Thu
May 7

The best part about abstraction is {the best part about abstraction - 1, the best part about abstraction - 2}

Oh, programming is so easy! Just look at the source!

$ less /usr/include/errno.h

#include <bits/errno.h>

$ less /usr/include/bits/errno.h

#include <linux/errno.h>

$ less /usr/include/linux/errno.h:

#include <asm/errno.h>

$ less /usr/include/asm/errno.h:

#include <asm-generic/errno.h>

$ less /usr/include/asm-generic/errno.h:

#include <asm-generic/errno-base.h>

$ less /usr/include/asm-generic/errno-base.h:

#define EPERM     1   /* Operation not permitted */
[…]

It’s a wonder that gcc ever terminates.

Mon
May 4

CI20: Interrupt handling

This post is part of the CI20 bare-metal project — a project to write operating-system-level code on the CI20 MIPS-based demo board.

In the previous instalment we ran a memory tester and verified that DDR was initialised. Let’s start adding real OS features, starting with interrupts. We’ll add a generic interrupt mechanism and then apply it to the timer interrupt.

I don’t have any relevant pictures to go with this post, so here are some nesting swans I saw recently in Exeter:


Silly swans, building their nest so close to the path that the council had to give them a bit of privacy. They didn’t seem to care, though. Well, on to interrupt support on the CI20.

Generic interrupt support

The CI20 has a multi-level approach for handling interrupts.
  1. Firstly, the global interrupt enable flag in the CP0 STATUS register must be switched on.
  2. Then, each individual interrupt must be unmasked in the same register.
  3. Then, the interrupt controller hardware must unmask the interrupt for a particular device.
  4. Finally, the device itself must be configured to generate interrupts.
After you do all of this, the CPU will jump to a special location when an interrupt occurs, after storing the program counter and setting some flags (such as CAUSE). All the rest is up to software. 

The special location is well-defined, but is well-defined to be in any of four places, depending on CPU flags:
  • If BEV is set in the CP0 STATUS register, the address is in uncached memory.
  • If IV is set in the CP0 CAUSE register, the address for interrupts is distinct from the address for other types of exceptions, otherwise it’s the same.
We don’t want to use uncached memory, and, in fact, on the CI20, we can’t, as the address is 0xBFC00380, in memory-mapped device territory, so we’ll leave BEV unset. However, we *do* want separate addresses for interrupts and other exceptions, because this means that there’s a little less work to do when an interrupt or exception arrives, so we’ll set IV. 

Finally, we need to write an interrupt handler. The interrupt handler will consist of two parts: an assembly-language part which does the minimum necessary to safely jump into C code, and a C part which determines which interrupt has occurred and deals with it appropriately.

Here’s the assembly-language portion of the interrupt handler included in this post’s start.S:

.org 0x200
_irq_asm:
    sw at, -4(sp)
    sw v0, -8(sp)
    sw v1, -12(sp)

    [ snip: many more registers saved ]

    sw fp, -108(sp)
    sw ra, -112(sp)

    addi sp, sp, -112

    jal libci20_interrupt
    nop

    addi sp, sp, 112
    lw at, -4(sp)
    lw v0, -8(sp)
    lw v1, -12(sp)

    [ snip: many registers re-loaded ]

    lw fp, -108(sp)
    lw ra, -112(sp)

    eret

Fairly straightforward, then: save all registers, run the C portion of the interrupt handler, restore all registers, and return from interrupt. This is fine for now, but if this were to be used in a real system it would certainly want to switch to a dedicated interrupt stack — or, at the very least, make sure it was on a kernel stack.

The C portion is similarly straightforward. The CI20 has two interrupt pending registers, which are bitfields, one bit per device. A bit is set if an interrupt is pending for that device. The C routine allows device drivers to register a handler routine for their interrupt — if a handler is registered when an interrupt for that device arrives, it will be called.

Finally, the job of the handler is to inform the device that the interrupt has been handled. 

The timer interrupt

The “OS timer” device, used by timer.c, can be set up to generate an interrupt whenever the timer reaches a 32-bit comparison value. We previously initialised the timer to tick 3 million times a second, so let’s get it to generate an interrupt every millisecond by setting our comparison value equal to 3 million / 1000 = 3000. The timer then registers its interrupt handler for TCU0, which is the timer unit:

intc_register_handler_tcu0(ostimer_interrupt);

When a timer interrupt occurs, ostimer_interrupt is called. The only thing it absolutely has to do is to tell the TCU that the interrupt has been handled:

poke32(TFCR, TFR_OSTFLAG);

… but if that’s all it did then we wouldn’t even know it was working. So in addition to silencing the interrupt, we add support for timer callbacks, functions which are invoked by the timer interrupt handler:

void ostimer_interrupt(void)
{
for(int i = 0; i < timer_callback_count; i++)
timer_callbacks[i]();

/* Clear interrupt flag. If we don't do this we will immediately return to
* this interrupt on exit! */
poke32(TFCR, TFR_OSTFLAG);
}

At this point, finally, we can register a callback handler in our main() function and increment a 1ms counter.

Running the code

Check out the code as usual, this time using the interrupts tag:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/interrupts

Make sure you have also checked out pyelftools if you haven’t already:

$ cd thirdparty/

Now build and run. I now use a single command for this:

$ make && python3 usbloader.py build/stage1.elf && sleep 1 && python3 usbloader.py build/kernel.elf

Benchmarking 

If all goes well, you should see a short benchmark run three times, printing something like this:

00000C3C
00000C3C
00000C3C

This is the number of milliseconds taken to run a simple delay loop in main(). It doesn’t mean very much by itself, but I was curious to see how what we’ve got so far compared with Linux. So I wrote a short Linux benchmark which did the same thing (download it here), booted my CI20 into Linux, ran the benchmark, and got these results:

00000C51
00000C48
00000C49

In other words, Linux has more variance and is slightly slower than our OS. This is exactly as we’d expect: Linux is running other things behind the scenes, which will both cause the increased variance and slow the benchmark down. The results are within 5% of each other, however, which is encouraging — we did all the right things so far, or, at least, we did them as right as Linux does.

Other changes

This release includes quite a few changes:
  • “os” became “kernel” — which makes much more sense.
  • The kernel-mode stuff is mostly contained in a library, libci20, which is used by both stage1 and kernel. But use of the library started to diverge in this section, and will certainly diverge further. For example, both stage1 and kernel require a timer, but they use it differently: stage1 uses it for busy-waiting, while kernel uses it to generate periodic interrupts. Also, stage1 needs to be less than 14K, so there’s no room for fancy extra features. There is no perfect solution to this when you’re working in C. My solution is to link two different libraries, libci20 and libci20_mini. The mini version has won’t add any more files, but has its own simple implementations of some things (like the timer). The Makefile changed to reflect this.
  • The kernel’s assembly-language startup file, start.S, now zeroes out BSS. It didn’t do it before because previously we didn’t have a BSS section. (BSS (https://en.wikipedia.org/wiki/.bss) is where all uninitialised file-scope variables get placed — like the array defined in libci20/interrupts.c.) The kernel’s linker.lds file changed to accommodate the new sections, and also to align the data blocks to the length of a cache line. Note that the BSS section (and its architecture-specific friend, .sbss) is marked as “NOLOAD” — which means it takes up no space in the file at all.
  • The USB loader changed again, this time to pad uploaded data to a multiple of 2k when writing to TCSM. Experiments with crossing 2k block boundaries failed unless the data were padded. I have no idea why this peculiarly hardware-specific quirk works, or even if it’s doing the right thing, but it does seem to work.
The end, or just the beginning?*

We’ve now got all the resources we need to start writing an operating system. Next time we’ll begin on that, starting with the scheduler.

* Probably not the beginning.

Sun
Apr 19

CI20: The DDR odyssey, part 4: memtester

I can’t think of a worse type of bug than one related to faulty RAM. Actually, that’s not true — probably concurrency bugs are worse. Oh well, so much for the strong opening. In any case, we’ve spent the last 3 posts initialising the RAM, so let’s now run memtester and make sure it works.

The CI20 bare-metal project: all posts

Memtester is a popular open-source program for testing RAM. At its core it is quite simple: it runs a suite of tests on the RAM by writing specially-crafted data to it, designed to expose any issues with the RAM, then reading it back and verifying that it survived the trip.

I ported Memtester to run on bare-metal CI20 by removing most of it: all the command-line parsing and POSIX-specific functionality, apart from random number generation. Instead the tester runs directly in cached kernel memory (0x80000000) and tests a fixed size (200MB). We’re only testing a fixed amount of memory because kseg0 isn’t very large (256MB, when you exclude memory-mapped devices). Anyway, it doesn’t really matter if we don’t touch every byte of memory, since the point isn’t to discover bad RAM but to discover whether the DDR controller and DDR PHY are configured properly — problems which should be obvious even after testing only a very small amount of memory.

Running the test 
Get this version by checking out the OS as normal:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/memtester

Now check out two “ports”:

$ cd ports
$ git clone https://github.com/nfd/ci20-os-port-posix posix
$ cd ..

Now build the system:

$ make
$ cd ports/memtester-4.3.0
$ make
$ cd ../../

Now run everything. This happens in two stages. First we use stage1 to initialise the memory:

$ python3 usbloader.py build/stage1.elf

Look at the serial console output, and when stage1 indicates that the memory test is complete, load memtester.

$ python3 usbloader.py ports/memtester-4.3.0/build/memtester.elf

You should see memtester load and start testing RAM. The test suite repeats until memtester finds a problem.

How it works
The hard part of all of this was the build system, which needed quite a bit of expansion to support “ports”. Ports are libraries or binaries for third-party applications or their support. The idea is that you drop the repository into ports/ and use a custom Makefile to build them as part of the rest of the system. The memtester Makefile looks like this:

PORT_SRC=memtester.c tests.c
PORT_TYPE=elf
PORT_TARGET=memtester.elf
PORT_INCLUDES=ports/posix/include
PORT_LIBS=build/libci20.a ports/posix/build/libposix.a

include ../baremetal.mk

In other words, Memtester is a program (an ELF file) defined by two C source files, depending on the Posix port and libci20. Pretty straight-forward so far, but the hard work is performed by ports/baremetal.mk, which will build any dependencies and then the port itself. The implementation of baremetal.mk is a bit gruesome, using quite a few Make “features”. I’m perversely proud of it, which are the two feelings I always get when I accomplish something nontrivial in Make.

Next steps
An interesting thing to do now is to edit stage1/ddr.py, changing follow_reference_code=True to follow_reference_code=False. This activates a whole lot of changes related to RAM initialisation, but doesn’t seem to affect system stability — memtester runs just fine, which indicates that DDR might be more robust to timing variations than it looks. An interesting next step might be to measure memory speed in addition to memory reliability, but let’s move on from RAM for a little while: next time we’ll get back to the OS development proper, and look at handing interrupts.

Thu
Apr 16

CI20: The DDR odyssey, part 3: memory remapping

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part three of four posts about initialising the DDR on the CI20 Creator. 

In the last post we got the RAM working enough to boot something into kseg0, but we missed one curious part of RAM initialisation: memory remapping.

The jz4780 DDR controller has a several DREMAP registers, described a little opaquely as "DREMAP1~5 are used to define address mapping in DDRC.” Fine, but what is “address mapping in the DDRC”?

My understanding, which is basically guesswork, goes like this: We know that DDR memory addresses are specified in four dimensions: bank, row, column, and byte number within the word*. But of course physical memory addresses are a single number. So one thing you could do is just assign the various dimensions to bits within a 32-bit address space, like this:


Then physical address 0x00000000 maps to bank 0, row 0, column 0, byte 0; and physical address, say, 0x1CE1CEBB maps to bank 3, row 19996, column 942, and byte 3. (The two unused bits at the top are because we have a 4 gigabyte address space, but only 1 gigabyte of RAM to play with.)

The problem with this particular mapping is that we don’t expect the bank to change very frequently. Programs tend to follow the principle of locality, which means that their next memory reference is likely to be very near their last memory reference. But DDR RAM works by “precharging” a bank + row combination, after which columns can be read. This precharging takes time, and only one row can be precharged per bank. If all the addresses we need in the near future reside in same bank but in different rows, we have no choice but to wait for the bank precharge multiple times in sequence, once for each row. If, however, the upcoming addresses span multiple banks, we could precharge all the banks we needed at once, saving some time.

In other words, we might prefer this arrangement, in which bank number should change more frequently than row number:


This is the purpose of the DREMAP registers — they let us swap bits around, effectively allowing us to change the positioning of our 4d-to-1d mapping.

And this is what the reference code (and, now, ddr.py) does: switch bank and row addresses to make it more likely that we’ll be able to precharge multiple banks simultaneously. This is actually a power / performance trade-off: we end up using more power (for bank precharging) but don’t spent so much time waiting.

Code is available under the ddr_remap tag:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ddr_remap

If you build and run it, you should find that nothing has changed, and everything still works. Which is a comfort.

In the next post, we’ll finish off the RAM stuff by running a proper memory test.

* It’s actually five, as we also have rank, but there’s only one of those on the CI20 (ranks make more sense for removable memory, where you can define a rank as “whatever goes in a memory slot”).

Wed
Apr 8

CI20: The DDR odyssey, part 2: getting it working

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part two of three or perhaps four posts about initialising the DDR on the CI20 Creator. This is an interesting one though because at the end we end up with usable RAM.

DDR3 is these two chips (and two on the other side)

DDR RAM is designed to make the RAM chips as cheap as possible by offloading a lot of the task of driving them to a separate chip. This chip typically contains two IP blocks: a DDR controller (DDRC), which does higher-level control of the RAM, and a DDR PHY, which handles the physical layer. To get the DDR working, we have to tell the DDRC and PHY a large amount of information about the physical characteristics of the RAM.

How do we get this information? Various sources.
  • A lot of the required information has standardised names which can be read straight out of the RAM datasheet
  • Most of the DDRC registers are documented in the JZ4780 programmer’s manual.
  • For the stuff that isn’t documented, we can take the required values from sample code, such as Ingenic’s board support codeu-boot, or ci20-tools.
But in addition to just having something which works it would also be rather nice to know what is going on. That isn’t so easy:
  • The DDRC, while mostly well-documented, is still missing some information which is supplied in the source as magic numbers.
  • The PHY is not documented at all. I wonder if it’s actually licensed from someone else? In any case, we can get some information about what the registers are from their symbolic names, from what is put into them, and, as a last resort, from datasheets for similar PHY blocks (PDF).
  • The sample code, even the best version of it (ci20-tools), is not great. This isn’t the programmers’ fault but simply a consequence of a bad original version (the Ingenic board support package).
All the sample code for DDR3 initialisation is written in C, but I ended up writing a Python program which generates C code. Doing the hard work in Python made life much easier, because it’s much easier to separate concerns. For example, here is the code to initialise a register with the DDR timing value named tRTP:

Name: tRTP
Description: READ to PRECHARGE command period
Value (from the datasheet): 4 DDR clock cycles or 7.5 nanoseconds, whichever is greater.

C implementation:

#define DDR_tRTP DDR_MAX(4, 7500)
...
tmp = DIV_ROUND_UP(DDR_tRTP * 1000, ps);
if (tmp < 1) tmp = 1;
if (tmp > 6) tmp = 6;
ddrc_timing1 |= (tmp << DDRC_TIMING1_TRTP_BIT);
other register values
writel(ddrc_timing1, DDRC_TIMING(1));

Python implementation:

ram.tRTP = NS('max(4 * nCK, 7.5)')
hardware.write_register(‘DDR.DTIMING1’, tRTP=ram.tRTP.ticks, other register values)

It’s hopefully pretty clear that the Python code is easier to understand. The key helpful part here is that tRTP becomes an object with two attributes “ns” and “ticks” — the first being the timing value in nanoseconds, and the second being the timing value in multiples of the DDR clock cycle. This reflects the fact that timing values are specified in nanoseconds (and any calculations on timing values are usually done in nanoseconds), but they are ultimately written into DDRC and PHY registers as multiples of a DDR clock cycle (one clock tick is 2.5 nanoseconds, at 400MHz).

You can view the Python online here: ddr.py. The interesting stuff is closer to the bottom of the file.

Class AutogenOutput produces C output based on method calls, so it defines what sort of operations can be performed to initialise RAM. For example, calling write_register causes AutogenOutput to produce a line of C code which modifies a register. Other operations include waiting for some time interval, updating only parts of a register, and repeatedly reading from a register until its value equals some predefined setting. These are all the operations which are required to initialise the DDR.

The actual initialisation is done in the init_ram function (which calls init_phy). It is full of function calls which look like this:

hardware.note('reset DDRC')
hardware.write_register('DDR.DCTRL', DFI_RST=1, DLL_RST=1, CTL_RST=1, CFG_RST=1)
hardware.write_register_raw('DDR.DCTRL', 0)

… where hardware is an instance of the AutogenOutput class. 

Further down is the generate function, which establishes the RAM timing parameters. The timing parameters are evaluated on-demand, which means they don’t need to be in any particular order — and they can be arbitrarily complex expressions. For example, the timing value tWR is a relatively simple 15 nanoseconds:

ram.tWR = NS(15)

… but the timing value tWTR is quite complex:

ram.tRTW = TCK('ram.tRL.ticks + ram.tCCD.ticks + 2 - ram.tWL.ticks')

Further down the generate function is a set of conditional settings depending on whether the initialisation should follow the reference code exactly or not. When writing the generator, I noticed some discrepancies between the reference code and the DDR datasheet, as well as what I’m at least 90% sure is a genuine bug. For example, the reference code stores what is apparently a nanosecond value into a register directly:

ram.phy_dtpr2_tCKE = TCK('math.ceil(ram.tCKE.ns)')

(note the forced conversion between nanoseconds and ticks), whereas the correct value should be in terms of ticks:

ram.phy_dtpr2_tCKE = ram.tCKE

In any case, that’s enough picking through code. You can check out the DDR-initialising bootloader using the “ddr” tag from the usual place:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ddr

If you make and install this, you will see a memory test passing.

Next up: a very interesting part of memory initialisation which this version completely avoids: DDR address remapping! After that, we’ll look at actually loading something with our boot loader.

JZ4780 USB Loading should start at 0xf4000800

A minor change: you’ll notice that the linker script and usbloader.py have changed to use the start address of 0xf4000800 — 2048 bytes higher than previously. This was a pretty annoying bug to track down: the first 2k of my bootloader binary was running fine, but any code beyond that point just wasn’t working at all. It turns out that if you load into the first 2k of TCSM, you can’t write more than 2k. I don’t know why this is (though I’m sure it’s to do with the fact that TCSM is divided into 2k “banks”), and the documentation doesn’t help (in fact, it flat-out states that TCSM for bootloading starts at 0xf4000000), but skipping the first 2k solves the problem. 

Wed
Apr 1

CI20: The DDR odyssey, part 1: PLLs

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part one of three or perhaps four posts about initialising the DDR on the CI20 Creator.

The CI20 has 1GB of DDR3 SDRAM onboard. Communicating with DDR is quite complex. Actually even the reason that it’s complex is somewhat complex: essentially a trade-off was made early on to keep the actual RAM as simple and cheap as it could possibly be without compromising performance. The result of that decision is that a lot of the control circuitry lives in a separate bit of hardware called a DDR controller. On the CI20, the DDR controller is part of JZ4780 SOC. Modern Intel chips are similar, having the DDR controller as part of the CPU package.

The DDR controller is complemented by another bit of hardware (“IP block” in hardware design speak) called the DDR PHY. These two tightly-integrated parts split the task of DDR control into a high level and a low level:
  1. The DDR controller maps the multiple RAM chips to a logical, flat address space. It takes into account the timings of the RAM (how fast you can read and write it, basically), manages DRAM refresh, and uses its knowledge of the physical layout of the RAM to try to maximise performance at the protocol level. It communicates with the DDR through the PHY.
  2. The DDR PHY manages the physical interface to the RAM. The RAM is in a separate chip from the PHY (which is part of the system-on-chip), connected by long (to a computer) circuit-board traces. The PHY is in charge of sending high-speed signals along these wires and dealing with the complexities involved. For example, without proper impedance matching, signals can reflect off the end of the wire and bounce back to interfere with incoming signals. The PHY also knows about details of the RAM timing, so it can optimise the speed at which it communicates with the DDR at a physical level. The PHY is kind of fascinating, not least because it’s almost certainly the least-well-documented part of the CI20 apart from the GPU.
So the DDR controller and PHY do all the really hard work, but in order to do it they have to know a lot of information about the RAM. Getting that information into these two parts is the subject of the next few posts. Before we get there, we have one more bit of groundwork to cover: supplying a clock signal to the DDR controller, the PHY, and the DDR itself.

Phase-locked loops

The CI20 has two external oscillators. One is the very slow 32KHz oscillator used for the real-time clock. The other is a much-faster 48MHz oscillator used by everything else. Well, “much faster” is relative — 48MHz is still a long way from the 1.2GHz we need to run the CPU at full speed, or even from the 400MHz required for the DDR. So how do we generate these much faster clocks?

The answer is a circuit called a phase-locked loop, or PLL. The electrical-engineering details of a PLL are out of scope for this post, and, frankly, out of scope for my brain, but conceptually they seem simple enough: they generate a frequency which is some multiple of their input frequency, and stay synchronised by using a phase detector. If the output is out of phase with the input, the PLL will either speed up or slow down its internal oscillator until the phases match. In other words, the speed of the oscillator is controlled by a feedback loop.

PLLs typically also incorporate a frequency divider (or two), so you can essentially multiply the input frequency by any fraction you like, within the range of numerator and denominator supported by your specific PLL.

The JZ4780 has four PLLs: APLL, MPLL, VPLL, and EPLL. These names seem arbitrary, but they are apparently traditional — plenty of non-jz47xx code refers to PLLs with these names. In fact, even the purposes of these PLL names are re-used, to some extent, between devices. For example, VPLL is used to drive the video hardware, and EPLL is often audio.

Initialising a JZ4780 PLL is simple enough, with the obvious-sounding caveat that we shouldn’t re-initialise a PLL that’s already being used to drive the CPU clock, and the less-obvious caveat that we can’t change a PLL’s speed to more than about 20% faster or slower without halting and restarting it, because it will “lose lock” (the feedback loop will go out of synchronisation). This is relevant to us because the USB bootloader built in to the JZ4780 initialises the first PLL, PLLA, and uses it to drive the CPU. Consequently the code for this post initialises the second PLL, PLLM, and uses that one for the CPU and DDR.

The code is under the tag “plls”:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/plls

… and can be built the regular way:

$ make bootloader.bin
$ python3 usbloader.py bootloader.bin


(If you boot this, you should see the CI20 begin a memory test, but never finish it. This is to be expected — the DDR clock is initialised, but the DDR controller isn’t yet. So the memory test ends up attempting to write to an address which doesn’t exist.)

The PLL code is in the file pllclock.c and is fairly lavishly commented. Perhaps the most interesting parts are:
  • The magic numbers defined at the top (CI20_PDIV, CI20_H2DIV and so on) come from Ingenic’s reference code and determine how fast particular parts of the SOC should run. This aspect of the SOC (the timing information) isn’t part of the public documentation, so we don’t have much choice but to re-use these numbers as is: they set up various dividers to ensure that the peripherals run at about 100MHz, the AHB buses run at 200MHz and 400MHz, and the L2 cache runs at half the speed of the CPU. Presumably they could be changed, but I don’t know to what extent.
  • Switching the CPU (and friends) to a new PLL is a two-step process: first the frequency dividers are installed, and then the PLL source is switched over.
  • CI20 PLLs have one multiplier and two dividers. The first divider is applied to the input frequency, and the second divider is applied to the output frequency. I don’t know when you’d use one and when you’d use the other, but the reference code uses the input divider for the CPU’s PLL and the output divider for PLLs for video and audio.
  • The PLL is set up with a multiplier of twice the speed required, and then with a divider of two. Apparently this is a  reasonably common thing to do, to reduce jitter. Or to normalise the duty cycle. It’s not clear which.
Magic ahead


Sadly, we are starting to enter a realm of magic numbers which describe undocumented aspects of the hardware, generally typically related to timing. This happens a little with the PLLs, just slightly more with the DDR controller (which has reasonable documentation), and significantly more with the DDR PHY (which has no official documentation at all). Nonetheless the situation is far from hopeless: it’s possible to get quite a clear idea of what’s going on even without official documentation, as we’ll see in the next few posts.

Tue
Mar 31

The CI20 bare-metal project


This is an index of posts I’ve made about the CI20 Creator board. I’ll update it when I add new ones.

Running bare-metal code (most recent first):

Diversions:

Sat
Mar 14

ci20: Enabling the timer

This is part 4 of an ongoing series (part 1part 2part 3) about writing bare-metal code for the CI20 Creator. The last post talked about the RAM available to the CI20: 1 gigabyte of DDR, which requires initialisation, and 16 kilobytes of SRAM, which can be used by a bootloader to initialise the DDR.

Before we get to DDR, we need to give the bootloader some way to keep track of time intervals. The proximate requirement is the DDR controller requires a 200 microsecond delay, and without a timer of some sort we have no way of knowing how long that is — but timers are generally useful things to have around in any case. The JZ4780 CPU has lots of timers available to it, so the simplest thing to do is just to enable one of them.

You can check out the repository for this code:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ostimer

There are a few changes from last time, but the only really functionally relevant one is that there are two new files, timer.c and timer.h, which set up the “OS Timer” in the JZ4780 SOC. Initialisation of this timer is very straightforward, as can be seen in os_timer_init():
  • Set the frequency of count, which is some fraction of one of the available clocks — we use the 48MHz EXTCLK which doesn’t require any additional set-up, and then divide it by 16 to give us a timer which increments 3 million times a second.
  • Clear the counter value (not required, just neat);
  • Set the clock source (EXTCLK); and
  • Start the timer running.
Compile and run the code as before:

$ make bootloader.bin
$ python3 usbloader.py bootloader.bin

And you should see a message printed to the serial port once a second.

This timer isn’t using any interrupts, because we haven’t yet written any way to handle them. This means that the delay function, usleep(), simply polls the timer register in a tight loop. This is of course very bad practise generally, but it’s just about okay for run-at-boot initialisation code and short time intervals.

Sat
Mar 7

Running code on the CI20 without uboot

This is part of a large series on the CI20, the CI20 bare-metal project.

Previously we were using uboot to load the code. This is pretty convenient, but it does mean that there are some mysteries about setting up the board’s hardware which uboot hides from view. What does it do? This post and the next one will explore that a little, by writing some code which doesn’t use it at all.

Where is the RAM?

One thing I took for granted previously is that the board has RAM available for us at 0x80000000, i.e. at the start of KSEG0. We know that the board has 1GB of RAM, and that RAM has to be accessible to us somewhere.

Or does it? The JZ4780 programmer’s manual is full of references to a “DDR [RAM] Controller”, which has quite a complicated set-up. But if the DDR controller has to be initialised, where does that initialisation code run?

It turns out that on the JZ4780, there is in fact a separate tiny (16k) bit of static ram, known as the tighly-coupled shared memory area, or TCSM, which solves this problem: you can copy your DDR initialisation code into TCSM and then copy the rest into DDR.

Of course, now we are faced with the problem of getting our code into the TCSM. To deal with this problem, the JZ4780 includes a small ROM capable of loading things into TCSM. This boot room is quite sophisticated, supporting NAND, SD, USB, and SPI loading.

Booting from USB

If we’re going to play around, then USB loading is quite an attractive option because it’s nice and fast (compared with unplugging SD cards and so on). Unfortunately the USB loader is a bit finicky: the device doesn’t even enumerate properly on Mac OS X (though it’s fine on Linux), and it requires a custom loader. I wrote the custom loader, so if you have access to a Linux box (or, perhaps, if you’re running Linux in a VM) you can follow along.

By default, the CI20 requires you to hold down a button to do a USB boot. That sounded a bit tedious, so I soldered some headers to the button terminals:


Connecting a switch to the headers means I can have always-on USB booting.

We’re going to use the USB OTG port as a peripheral. Remove the jumper from the pins next to the port. This turns the port into a USB peripheral, rather than a host. You can then connect a mini-USB cable from your computer to the OTG port and power up the board. If all is well, you should see a new USB device appear (run lsusb on Linux, or use USB Prober, or System Information, on a Mac). 

Serial-port communication

Rather than flashing a LED, this example now communicates over a serial port — the same one, UART4, used by uboot and the Linux kernel, so if you set up a usb-serial cable for previous parts this will work without any changes.

Running the example

I’ve created a new repository for this stuff, which you can check out using git:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/usbloader

You’ll need to build the bootloader (make sure you’ve set up your environment as described in my previous CI20 posts):

$ make bootloader.bin

Now reset the board and load the example using my USB loader. Note that you will need to install pyusb before running this (and that, in turn, will require libusb0).

$ python3 usbloader.py bootloader.bin

If you get “permission denied” errors, you can set up udev rules to give yourself access to the port. Or you can use sudo if you’re naughty.

You should then get a cheery message on the serial port.

Next steps

This example hasn’t actually initialised the DDR. It’s just running straight out of TCSM at the special TCSM location 0xf4000000. Overall, the board is only just barely initialised enough to get stuff out of the serial port. Next time we’ll look at setting up the DDR so that we can begin to think about loading things back into the familiar 0x80000000 “proper” RAM, and see what other board initialisation is needed by the jz4780.

Thu
Feb 12

A reset button for the CI20 Creator MIPS-based board


I’ve been rambling about the CI20 board a lot recently. It’s a fun board, but it lacks a hardware reset button.

Fortunately you can easily hook one up using the EJTAG header. Connect a momentary switch between RST_N (pin 11) and GND (pin 2, 4, 6, 8 or 10). If your board is angled the same way as mine is in the picture, then the pin-out of the header matches that on the hardware page (linked above) — i.e. RST_N is second from bottom on the left and GND is third from bottom on the right.

I’ve got somewhat-reassuring confirmation from the CI20 mailing list that RST_N has a debounce circuit and a pull-up resistor, so a simple switch should be okay.