Sun
Apr 19

CI20: The DDR odyssey, part 4: memtester

I can’t think of a worse type of bug than one related to faulty RAM. Actually, that’s not true — probably concurrency bugs are worse. Oh well, so much for the strong opening. In any case, we’ve spent the last 3 posts initialising the RAM, so let’s now run memtester and make sure it works.

The CI20 bare-metal project: all posts

Memtester is a popular open-source program for testing RAM. At its core it is quite simple: it runs a suite of tests on the RAM by writing specially-crafted data to it, designed to expose any issues with the RAM, then reading it back and verifying that it survived the trip.

I ported Memtester to run on bare-metal CI20 by removing most of it: all the command-line parsing and POSIX-specific functionality, apart from random number generation. Instead the tester runs directly in cached kernel memory (0x80000000) and tests a fixed size (200MB). We’re only testing a fixed amount of memory because kseg0 isn’t very large (256MB, when you exclude memory-mapped devices). Anyway, it doesn’t really matter if we don’t touch every byte of memory, since the point isn’t to discover bad RAM but to discover whether the DDR controller and DDR PHY are configured properly — problems which should be obvious even after testing only a very small amount of memory.

Running the test 
Get this version by checking out the OS as normal:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/memtester

Now check out two “ports”:

$ cd ports
$ git clone https://github.com/nfd/ci20-os-port-posix posix
$ cd ..

Now build the system:

$ make
$ cd ports/memtester-4.3.0
$ make
$ cd ../../

Now run everything. This happens in two stages. First we use stage1 to initialise the memory:

$ python3 usbloader.py build/stage1.elf

Look at the serial console output, and when stage1 indicates that the memory test is complete, load memtester.

$ python3 usbloader.py ports/memtester-4.3.0/build/memtester.elf

You should see memtester load and start testing RAM. The test suite repeats until memtester finds a problem.

How it works
The hard part of all of this was the build system, which needed quite a bit of expansion to support “ports”. Ports are libraries or binaries for third-party applications or their support. The idea is that you drop the repository into ports/ and use a custom Makefile to build them as part of the rest of the system. The memtester Makefile looks like this:

PORT_SRC=memtester.c tests.c
PORT_TYPE=elf
PORT_TARGET=memtester.elf
PORT_INCLUDES=ports/posix/include
PORT_LIBS=build/libci20.a ports/posix/build/libposix.a

include ../baremetal.mk

In other words, Memtester is a program (an ELF file) defined by two C source files, depending on the Posix port and libci20. Pretty straight-forward so far, but the hard work is performed by ports/baremetal.mk, which will build any dependencies and then the port itself. The implementation of baremetal.mk is a bit gruesome, using quite a few Make “features”. I’m perversely proud of it, which are the two feelings I always get when I accomplish something nontrivial in Make.

Next steps
An interesting thing to do now is to edit stage1/ddr.py, changing follow_reference_code=True to follow_reference_code=False. This activates a whole lot of changes related to RAM initialisation, but doesn’t seem to affect system stability — memtester runs just fine, which indicates that DDR might be more robust to timing variations than it looks. An interesting next step might be to measure memory speed in addition to memory reliability, but let’s move on from RAM for a little while: next time we’ll get back to the OS development proper, and look at handing interrupts.

Thu
Apr 16

CI20: The DDR odyssey, part 3: memory remapping

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part three of four posts about initialising the DDR on the CI20 Creator. 

In the last post we got the RAM working enough to boot something into kseg0, but we missed one curious part of RAM initialisation: memory remapping.

The jz4780 DDR controller has a several DREMAP registers, described a little opaquely as "DREMAP1~5 are used to define address mapping in DDRC.” Fine, but what is “address mapping in the DDRC”?

My understanding, which is basically guesswork, goes like this: We know that DDR memory addresses are specified in four dimensions: bank, row, column, and byte number within the word*. But of course physical memory addresses are a single number. So one thing you could do is just assign the various dimensions to bits within a 32-bit address space, like this:


Then physical address 0x00000000 maps to bank 0, row 0, column 0, byte 0; and physical address, say, 0x1CE1CEBB maps to bank 3, row 19996, column 942, and byte 3. (The two unused bits at the top are because we have a 4 gigabyte address space, but only 1 gigabyte of RAM to play with.)

The problem with this particular mapping is that we don’t expect the bank to change very frequently. Programs tend to follow the principle of locality, which means that their next memory reference is likely to be very near their last memory reference. But DDR RAM works by “precharging” a bank + row combination, after which columns can be read. This precharging takes time, and only one row can be precharged per bank. If all the addresses we need in the near future reside in same bank but in different rows, we have no choice but to wait for the bank precharge multiple times in sequence, once for each row. If, however, the upcoming addresses span multiple banks, we could precharge all the banks we needed at once, saving some time.

In other words, we might prefer this arrangement, in which bank number should change more frequently than row number:


This is the purpose of the DREMAP registers — they let us swap bits around, effectively allowing us to change the positioning of our 4d-to-1d mapping.

And this is what the reference code (and, now, ddr.py) does: switch bank and row addresses to make it more likely that we’ll be able to precharge multiple banks simultaneously. This is actually a power / performance trade-off: we end up using more power (for bank precharging) but don’t spent so much time waiting.

Code is available under the ddr_remap tag:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ddr_remap

If you build and run it, you should find that nothing has changed, and everything still works. Which is a comfort.

In the next post, we’ll finish off the RAM stuff by running a proper memory test.

* It’s actually five, as we also have rank, but there’s only one of those on the CI20 (ranks make more sense for removable memory, where you can define a rank as “whatever goes in a memory slot”).

Wed
Apr 8

CI20: The DDR odyssey, part 2: getting it working

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part two of three or perhaps four posts about initialising the DDR on the CI20 Creator. This is an interesting one though because at the end we end up with usable RAM.

DDR3 is these two chips (and two on the other side)

DDR RAM is designed to make the RAM chips as cheap as possible by offloading a lot of the task of driving them to a separate chip. This chip typically contains two IP blocks: a DDR controller (DDRC), which does higher-level control of the RAM, and a DDR PHY, which handles the physical layer. To get the DDR working, we have to tell the DDRC and PHY a large amount of information about the physical characteristics of the RAM.

How do we get this information? Various sources.
  • A lot of the required information has standardised names which can be read straight out of the RAM datasheet
  • Most of the DDRC registers are documented in the JZ4780 programmer’s manual.
  • For the stuff that isn’t documented, we can take the required values from sample code, such as Ingenic’s board support codeu-boot, or ci20-tools.
But in addition to just having something which works it would also be rather nice to know what is going on. That isn’t so easy:
  • The DDRC, while mostly well-documented, is still missing some information which is supplied in the source as magic numbers.
  • The PHY is not documented at all. I wonder if it’s actually licensed from someone else? In any case, we can get some information about what the registers are from their symbolic names, from what is put into them, and, as a last resort, from datasheets for similar PHY blocks (PDF).
  • The sample code, even the best version of it (ci20-tools), is not great. This isn’t the programmers’ fault but simply a consequence of a bad original version (the Ingenic board support package).
All the sample code for DDR3 initialisation is written in C, but I ended up writing a Python program which generates C code. Doing the hard work in Python made life much easier, because it’s much easier to separate concerns. For example, here is the code to initialise a register with the DDR timing value named tRTP:

Name: tRTP
Description: READ to PRECHARGE command period
Value (from the datasheet): 4 DDR clock cycles or 7.5 nanoseconds, whichever is greater.

C implementation:

#define DDR_tRTP DDR_MAX(4, 7500)
...
tmp = DIV_ROUND_UP(DDR_tRTP * 1000, ps);
if (tmp < 1) tmp = 1;
if (tmp > 6) tmp = 6;
ddrc_timing1 |= (tmp << DDRC_TIMING1_TRTP_BIT);
other register values
writel(ddrc_timing1, DDRC_TIMING(1));

Python implementation:

ram.tRTP = NS('max(4 * nCK, 7.5)')
hardware.write_register(‘DDR.DTIMING1’, tRTP=ram.tRTP.ticks, other register values)

It’s hopefully pretty clear that the Python code is easier to understand. The key helpful part here is that tRTP becomes an object with two attributes “ns” and “ticks” — the first being the timing value in nanoseconds, and the second being the timing value in multiples of the DDR clock cycle. This reflects the fact that timing values are specified in nanoseconds (and any calculations on timing values are usually done in nanoseconds), but they are ultimately written into DDRC and PHY registers as multiples of a DDR clock cycle (one clock tick is 2.5 nanoseconds, at 400MHz).

You can view the Python online here: ddr.py. The interesting stuff is closer to the bottom of the file.

Class AutogenOutput produces C output based on method calls, so it defines what sort of operations can be performed to initialise RAM. For example, calling write_register causes AutogenOutput to produce a line of C code which modifies a register. Other operations include waiting for some time interval, updating only parts of a register, and repeatedly reading from a register until its value equals some predefined setting. These are all the operations which are required to initialise the DDR.

The actual initialisation is done in the init_ram function (which calls init_phy). It is full of function calls which look like this:

hardware.note('reset DDRC')
hardware.write_register('DDR.DCTRL', DFI_RST=1, DLL_RST=1, CTL_RST=1, CFG_RST=1)
hardware.write_register_raw('DDR.DCTRL', 0)

… where hardware is an instance of the AutogenOutput class. 

Further down is the generate function, which establishes the RAM timing parameters. The timing parameters are evaluated on-demand, which means they don’t need to be in any particular order — and they can be arbitrarily complex expressions. For example, the timing value tWR is a relatively simple 15 nanoseconds:

ram.tWR = NS(15)

… but the timing value tWTR is quite complex:

ram.tRTW = TCK('ram.tRL.ticks + ram.tCCD.ticks + 2 - ram.tWL.ticks')

Further down the generate function is a set of conditional settings depending on whether the initialisation should follow the reference code exactly or not. When writing the generator, I noticed some discrepancies between the reference code and the DDR datasheet, as well as what I’m at least 90% sure is a genuine bug. For example, the reference code stores what is apparently a nanosecond value into a register directly:

ram.phy_dtpr2_tCKE = TCK('math.ceil(ram.tCKE.ns)')

(note the forced conversion between nanoseconds and ticks), whereas the correct value should be in terms of ticks:

ram.phy_dtpr2_tCKE = ram.tCKE

In any case, that’s enough picking through code. You can check out the DDR-initialising bootloader using the “ddr” tag from the usual place:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ddr

If you make and install this, you will see a memory test passing.

Next up: a very interesting part of memory initialisation which this version completely avoids: DDR address remapping! After that, we’ll look at actually loading something with our boot loader.

JZ4780 USB Loading should start at 0xf4000800

A minor change: you’ll notice that the linker script and usbloader.py have changed to use the start address of 0xf4000800 — 2048 bytes higher than previously. This was a pretty annoying bug to track down: the first 2k of my bootloader binary was running fine, but any code beyond that point just wasn’t working at all. It turns out that if you load into the first 2k of TCSM, you can’t write more than 2k. I don’t know why this is (though I’m sure it’s to do with the fact that TCSM is divided into 2k “banks”), and the documentation doesn’t help (in fact, it flat-out states that TCSM for bootloading starts at 0xf4000000), but skipping the first 2k solves the problem. 

Wed
Apr 1

CI20: The DDR odyssey, part 1: PLLs

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part one of three or perhaps four posts about initialising the DDR on the CI20 Creator.

The CI20 has 1GB of DDR3 SDRAM onboard. Communicating with DDR is quite complex. Actually even the reason that it’s complex is somewhat complex: essentially a trade-off was made early on to keep the actual RAM as simple and cheap as it could possibly be without compromising performance. The result of that decision is that a lot of the control circuitry lives in a separate bit of hardware called a DDR controller. On the CI20, the DDR controller is part of JZ4780 SOC. Modern Intel chips are similar, having the DDR controller as part of the CPU package.

The DDR controller is complemented by another bit of hardware (“IP block” in hardware design speak) called the DDR PHY. These two tightly-integrated parts split the task of DDR control into a high level and a low level:
  1. The DDR controller maps the multiple RAM chips to a logical, flat address space. It takes into account the timings of the RAM (how fast you can read and write it, basically), manages DRAM refresh, and uses its knowledge of the physical layout of the RAM to try to maximise performance at the protocol level. It communicates with the DDR through the PHY.
  2. The DDR PHY manages the physical interface to the RAM. The RAM is in a separate chip from the PHY (which is part of the system-on-chip), connected by long (to a computer) circuit-board traces. The PHY is in charge of sending high-speed signals along these wires and dealing with the complexities involved. For example, without proper impedance matching, signals can reflect off the end of the wire and bounce back to interfere with incoming signals. The PHY also knows about details of the RAM timing, so it can optimise the speed at which it communicates with the DDR at a physical level. The PHY is kind of fascinating, not least because it’s almost certainly the least-well-documented part of the CI20 apart from the GPU.
So the DDR controller and PHY do all the really hard work, but in order to do it they have to know a lot of information about the RAM. Getting that information into these two parts is the subject of the next few posts. Before we get there, we have one more bit of groundwork to cover: supplying a clock signal to the DDR controller, the PHY, and the DDR itself.

Phase-locked loops

The CI20 has two external oscillators. One is the very slow 32KHz oscillator used for the real-time clock. The other is a much-faster 48MHz oscillator used by everything else. Well, “much faster” is relative — 48MHz is still a long way from the 1.2GHz we need to run the CPU at full speed, or even from the 400MHz required for the DDR. So how do we generate these much faster clocks?

The answer is a circuit called a phase-locked loop, or PLL. The electrical-engineering details of a PLL are out of scope for this post, and, frankly, out of scope for my brain, but conceptually they seem simple enough: they generate a frequency which is some multiple of their input frequency, and stay synchronised by using a phase detector. If the output is out of phase with the input, the PLL will either speed up or slow down its internal oscillator until the phases match. In other words, the speed of the oscillator is controlled by a feedback loop.

PLLs typically also incorporate a frequency divider (or two), so you can essentially multiply the input frequency by any fraction you like, within the range of numerator and denominator supported by your specific PLL.

The JZ4780 has four PLLs: APLL, MPLL, VPLL, and EPLL. These names seem arbitrary, but they are apparently traditional — plenty of non-jz47xx code refers to PLLs with these names. In fact, even the purposes of these PLL names are re-used, to some extent, between devices. For example, VPLL is used to drive the video hardware, and EPLL is often audio.

Initialising a JZ4780 PLL is simple enough, with the obvious-sounding caveat that we shouldn’t re-initialise a PLL that’s already being used to drive the CPU clock, and the less-obvious caveat that we can’t change a PLL’s speed to more than about 20% faster or slower without halting and restarting it, because it will “lose lock” (the feedback loop will go out of synchronisation). This is relevant to us because the USB bootloader built in to the JZ4780 initialises the first PLL, PLLA, and uses it to drive the CPU. Consequently the code for this post initialises the second PLL, PLLM, and uses that one for the CPU and DDR.

The code is under the tag “plls”:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/plls

… and can be built the regular way:

$ make bootloader.bin
$ python3 usbloader.py bootloader.bin


(If you boot this, you should see the CI20 begin a memory test, but never finish it. This is to be expected — the DDR clock is initialised, but the DDR controller isn’t yet. So the memory test ends up attempting to write to an address which doesn’t exist.)

The PLL code is in the file pllclock.c and is fairly lavishly commented. Perhaps the most interesting parts are:
  • The magic numbers defined at the top (CI20_PDIV, CI20_H2DIV and so on) come from Ingenic’s reference code and determine how fast particular parts of the SOC should run. This aspect of the SOC (the timing information) isn’t part of the public documentation, so we don’t have much choice but to re-use these numbers as is: they set up various dividers to ensure that the peripherals run at about 100MHz, the AHB buses run at 200MHz and 400MHz, and the L2 cache runs at half the speed of the CPU. Presumably they could be changed, but I don’t know to what extent.
  • Switching the CPU (and friends) to a new PLL is a two-step process: first the frequency dividers are installed, and then the PLL source is switched over.
  • CI20 PLLs have one multiplier and two dividers. The first divider is applied to the input frequency, and the second divider is applied to the output frequency. I don’t know when you’d use one and when you’d use the other, but the reference code uses the input divider for the CPU’s PLL and the output divider for PLLs for video and audio.
  • The PLL is set up with a multiplier of twice the speed required, and then with a divider of two. Apparently this is a  reasonably common thing to do, to reduce jitter. Or to normalise the duty cycle. It’s not clear which.
Magic ahead


Sadly, we are starting to enter a realm of magic numbers which describe undocumented aspects of the hardware, generally typically related to timing. This happens a little with the PLLs, just slightly more with the DDR controller (which has reasonable documentation), and significantly more with the DDR PHY (which has no official documentation at all). Nonetheless the situation is far from hopeless: it’s possible to get quite a clear idea of what’s going on even without official documentation, as we’ll see in the next few posts.

Tue
Mar 31

The CI20 bare-metal project


This is an index of posts I’ve made about the CI20 Creator board. I’ll update it when I add new ones.

Running bare-metal code (most recent first):

Diversions:

Sat
Mar 14

ci20: Enabling the timer

This is part 4 of an ongoing series (part 1part 2part 3) about writing bare-metal code for the CI20 Creator. The last post talked about the RAM available to the CI20: 1 gigabyte of DDR, which requires initialisation, and 16 kilobytes of SRAM, which can be used by a bootloader to initialise the DDR.

Before we get to DDR, we need to give the bootloader some way to keep track of time intervals. The proximate requirement is the DDR controller requires a 200 microsecond delay, and without a timer of some sort we have no way of knowing how long that is — but timers are generally useful things to have around in any case. The JZ4780 CPU has lots of timers available to it, so the simplest thing to do is just to enable one of them.

You can check out the repository for this code:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ostimer

There are a few changes from last time, but the only really functionally relevant one is that there are two new files, timer.c and timer.h, which set up the “OS Timer” in the JZ4780 SOC. Initialisation of this timer is very straightforward, as can be seen in os_timer_init():
  • Set the frequency of count, which is some fraction of one of the available clocks — we use the 48MHz EXTCLK which doesn’t require any additional set-up, and then divide it by 16 to give us a timer which increments 3 million times a second.
  • Clear the counter value (not required, just neat);
  • Set the clock source (EXTCLK); and
  • Start the timer running.
Compile and run the code as before:

$ make bootloader.bin
$ python3 usbloader.py bootloader.bin

And you should see a message printed to the serial port once a second.

This timer isn’t using any interrupts, because we haven’t yet written any way to handle them. This means that the delay function, usleep(), simply polls the timer register in a tight loop. This is of course very bad practise generally, but it’s just about okay for run-at-boot initialisation code and short time intervals.

Sat
Mar 7

Running code on the CI20 without uboot

This is part of a series of posts about bare-metal programming for the CI20 Creator development board, though you don’t need to have read the other parts (part 1part 2) to follow this one.

Previously we were using uboot to load the code. This is pretty convenient, but it does mean that there are some mysteries about setting up the board’s hardware which uboot hides from view. What does it do? This post and the next one will explore that a little, by writing some code which doesn’t use it at all.

Where is the RAM?

One thing I took for granted previously is that the board has RAM available for us at 0x80000000, i.e. at the start of KSEG0. We know that the board has 1GB of RAM, and that RAM has to be accessible to us somewhere.

Or does it? The JZ4780 programmer’s manual is full of references to a “DDR [RAM] Controller”, which has quite a complicated set-up. But if the DDR controller has to be initialised, where does that initialisation code run?

It turns out that on the JZ4780, there is in fact a separate tiny (16k) bit of static ram, known as the tighly-coupled shared memory area, or TCSM, which solves this problem: you can copy your DDR initialisation code into TCSM and then copy the rest into DDR.

Of course, now we are faced with the problem of getting our code into the TCSM. To deal with this problem, the JZ4780 includes a small ROM capable of loading things into TCSM. This boot room is quite sophisticated, supporting NAND, SD, USB, and SPI loading.

Booting from USB

If we’re going to play around, then USB loading is quite an attractive option because it’s nice and fast (compared with unplugging SD cards and so on). Unfortunately the USB loader is a bit finicky: the device doesn’t even enumerate properly on Mac OS X (though it’s fine on Linux), and it requires a custom loader. I wrote the custom loader, so if you have access to a Linux box (or, perhaps, if you’re running Linux in a VM) you can follow along.

By default, the CI20 requires you to hold down a button to do a USB boot. That sounded a bit tedious, so I soldered some headers to the button terminals:


Connecting a switch to the headers means I can have always-on USB booting.

We’re going to use the USB OTG port as a peripheral. Remove the jumper from the pins next to the port. This turns the port into a USB peripheral, rather than a host. You can then connect a mini-USB cable from your computer to the OTG port and power up the board. If all is well, you should see a new USB device appear (run lsusb on Linux, or use USB Prober, or System Information, on a Mac). 

Serial-port communication

Rather than flashing a LED, this example now communicates over a serial port — the same one, UART4, used by uboot and the Linux kernel, so if you set up a usb-serial cable for previous parts this will work without any changes.

Running the example

I’ve created a new repository for this stuff, which you can check out using git:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/usbloader

You’ll need to build the bootloader (make sure you’ve set up your environment as described in my previous CI20 posts):

$ make bootloader.bin

Now reset the board and load the example using my USB loader. Note that you will need to install pyusb before running this (and that, in turn, will require libusb0).

$ python3 usbloader.py bootloader.bin

If you get “permission denied” errors, you can set up udev rules to give yourself access to the port. Or you can use sudo if you’re naughty.

You should then get a cheery message on the serial port.

Next steps

This example hasn’t actually initialised the DDR. It’s just running straight out of TCSM at the special TCSM location 0xf4000000. Overall, the board is only just barely initialised enough to get stuff out of the serial port. Next time we’ll look at setting up the DDR so that we can begin to think about loading things back into the familiar 0x80000000 “proper” RAM, and see what other board initialisation is needed by the jz4780.

Thu
Feb 12

A reset button for the CI20 Creator MIPS-based board


I’ve been rambling about the CI20 board a lot recently. It’s a fun board, but it lacks a hardware reset button.

Fortunately you can easily hook one up using the EJTAG header. Connect a momentary switch between RST_N (pin 11) and GND (pin 2, 4, 6, 8 or 10). If your board is angled the same way as mine is in the picture, then the pin-out of the header matches that on the hardware page (linked above) — i.e. RST_N is second from bottom on the left and GND is third from bottom on the right.

I’ve got somewhat-reassuring confirmation from the CI20 mailing list that RST_N has a debounce circuit and a pull-up resistor, so a simple switch should be okay.

Wed
Feb 11

The impact of caching on MIPS


I wrote my bare-metal CI20 demo tutorial to run in the uncached area of the MIPS address space. This means that all memory accesses bypass the cache completely and go straight to and from main memory. Unsurprisingly, this is quite slow.

It’s pretty easy to change the demo to use cached memory:

  • Fix the entry point: modify linker.lds, changing “. = 0xa8000000” to “. = 0x88000000". This moves the code from kseg1 (uncached) to kseg0 (cached);
  • Fix the stack: modify start.S, changing “li sp, 0xa9000000” to “li sp, 0x89000000”. This does the same thing to the stack;
  • Make it more obvious: modify the delay function in main.c, making the loop from 0 to 1000000 (previously it ended at 1000); and
  • Rebuild: make clean && make, then copy the new hello.bin to your tftp directory.

If you do that, then you will see the very dramatic effect that caching has on code performance on the MIPS! I made a short video (37 seconds) demonstrating the difference: https://www.youtube.com/watch?v=Ve0xtVGApbA

Tue
Feb 10

Running bare-metal code on a ci20: part 2

This is a follow-up to part 1, from yesterday.

By this point you have a serial connection to the board, and you have a toolchain. Let’s write some code!

Step 3: Code

What to code? We could always output something to the UART, but why do that when the board has a nice, large, glowing LED on it? The hardware page has a short recipe to make the led appear purple by rapidly switching it between red and blue. Let’s do that.

Ideally we’d like to write as much as possible in C, but C needs a stack, so the first thing we do is write a very short assembly language program which sets up a stack pointer and jumps to C. Create a file named start.S and add this to it:

#include "mipsregs.h"

/* make it accessible outside */
.globl _start
/* Tell binutils it's a function */
.ent _start
.text

_start:
	/* Set up a stack */
	li sp, 0xa9000000 

	/* And jump to C */
	la t0, entrypoint
	jr t0
	nop

.end _start
	  
All this file does, apart from play nice with binutils, is set up a stack pointer and jump to a symbol named “entrypoint”, which is where we’re going to write our C. Where did the stack pointer address come from? I basically made it up. :) Looking at the MIPS memory map, we know we have an uncached area, called kseg1, starting at 0xa0000000 and ending at 0xc0000000. On this board the first 256MB of that are directly mapped to RAM. We’ll use this area to store code, data, and the stack.

What about “mipsregs.h”? GCC doesn’t natively understand the symbolic register names “sp” and “t0”, so I wrote a small header which defines them for me. The important parts are these two lines:
#define t0 $8  /* temporary values */
#define sp $29 /* stack pointer */
	  

… but the full file is included in my Github repository. See below.

We now have some code which jumps to a function named “entrypoint”, so the next step is to write “entrypoint”. Create a file named main.c, and add these lines:

#define GPIO_F_SET 0xb0010544
#define GPIO_F_CLEAR 0xb0010548

#define GPIO_F_LED_PIN (1 << 15)

static inline void write_l(unsigned int addr, unsigned int val)
{
    volatile unsigned int *ptr = (unsigned int *)(addr);

    *ptr = val;
}

static void delay()
{
    volatile int i;

    for(i=0; i<1000; i++)
        ;
}

void entrypoint(void)
{
    /* Do the purple LED thing */
    while(1) {
        write_l(GPIO_F_CLEAR, GPIO_F_LED_PIN); /* Turn LED blue */
        delay();
        write_l(GPIO_F_SET, GPIO_F_LED_PIN); /* Turn LED red */
        delay();
    }

}
		


As described in the ci20 hardware page, the LED is accessible via a general-purpose IO port — GPIO — called GPIO F (the board also has GPIO ports A to E, and each port is 32 bits wide). Bit 15 of that port controls the LED. If it’s set to 0, the LED is blue, and if it’s set to 1, the LED is red.

The jz4780 makes it very easy to set and clear bits in the GPIO ports. Each port has a “set” address and a “clear” address. Any 1s which you write to the “set” address cause the corresponding bit of the GPIO port to be set, and, conversely, and 1s which you write to the “clear” address cause the corresponding bit to be cleared. In both cases, 0s are ignored. So you can just set or clear the bits you need without worrying about reading from the port first to avoid changing values you’re not interested in.

You can see that the GPIO ports lie in the uncached KSEG1 region as well. In fact, the jz4780 processor reserves the upper 256MB of KSEG1 (0xb0000000-0xbfffffff) for memory-mapped devices.

The final thing we need to do is to instruct the linker to put everything at an address inside KSEG1. Create a linker script named linker.lds and add the following:

OUTPUT_ARCH(mips)

ENTRY(_start)

SECTIONS
{
    /* Our base address */
    . = 0xa8000000;

    /* Code */
    .text : {
        *(.text)
    }

    /* Static data */
    .rodata : {
        *(.rodata)
        *(.rodata.*)
    }
    /* non-static data */
        .data : {
        *(.data*)
    }
}
Linker scripts are often referred to as voodoo, but this one is pretty simple. We tell the linker to start writing “text” (program code) at 0xa8000000, and that the first thing in it must be _start. Data comes after the code.

Step 4: Compiling

Now to build everything. Let’s use Make. Create a file named Makefile:

AS=mipsel-unknown-elf-as -mips32
CC=mipsel-unknown-elf-gcc
LD=mipsel-unknown-elf-ld
OBJCOPY=mipsel-unknown-elf-objcopy
CFLAGS=-Os

OBJS=start.o main.o

hello.bin: hello.elf
	$(OBJCOPY) -O binary $< $@

hello.elf: $(OBJS)
	$(LD) -T linker.lds -o $@ $+

%.o: %.[Sc]
	$(CC) $(CFLAGS) -c -o $@ $<

clean:
	rm -f *.o *.elf *.bin

This Makefile does a few interesting things.

  • It uses our custom-built toolchain, in which all the tools are named mipsel-unknown-elf-something
  • It uses GCC to compile everything, even the assembly language files. This is standard practise and it’s because GCC runs the C preprocessor over the file, if it ends in S, so we can use those symbolic register names.
  • It turns the output of the linker into a raw binary file using objcopy.

That last step is required because the linker will produce an “elf” file, which is a structured file in the Executable and Linkable format, the standard format for executable files on Linux (and many other Unix-like systems, but not Macs). However, we’re going to use uboot to load the file, and uboot expects the file to be in a raw “memory image” format, which it can just copy directly into RAM and run. We keep the .elf file around, though, because it’s useful for debugging.

You should now build everything:

$ make

If you get this far, you’re ready to boot your new “kernel”.

Step 5: Booting!

First, we have to set up a TFTP server. On a Mac, you can use the built-in TFTP server by running these commands from a terminal:

$ sudo launchctl load -F /System/Library/LaunchDaemons/tftp.plist
$ sudo launchctl start com.apple.tftpd

This will serve files from /private/tftpboot. Copy your new code there:

$ sudo cp hello.bin /private/tftpboot/

Now connect an Ethernet cable to your ci20, get your serial terminal ready, and reset the board. When you see "Hit any key to stop autoboot”, which happens within 5 seconds of booting, press a key. You should be greeted by a uboot prompt:

ci20#

This is uboot, which is quite a featureful bootloader — type “help” to get an idea of what it can do. For now, we’re going to configure it to load our file via tftpboot. Set the server IP (the IP address of your TFTP server) and the board’s IP address. My server is at 192.168.1.12, and I gave the board an IP of 192.168.1.7:

ci20# setenv serverip 192.168.1.12
ci20# setenv ipaddr 192.168.1.7

Now we can instruct uboot to load our file:

ci20# tftpboot 192.168.1.12:hello.bin
Load address: 0x88000000
Loading: #
         12.7 KiB/s
done
Bytes transferred = 144 (90 hex)

Note that the default load address is at 0x88000000, which isn’t what we asked for. However, it’s not a problem, because this address and 0xa8000000 both refer to the same location, but the former is cached.

If you get this far, cross your fingers, and type:

ci20# go 0xa8000000
## Starting application at 0xA8000000 …

You should get a pretty purple (well, pinkish, really) LED, with no operating system required.

Step 6: Future work

Some things one might do to expand on this:

  • Enable caches and work from cached memory
  • Write to the uart, to make a true “Hello, world!” program
  • Set up a timer and flip the LED from the timer interrupt routine
  • Read up on MIPS TLB refil and run the LED flipping function from KUSEG (also known as user space) by mapping memory in using that function
  • Use the timer to switch between executing one function which turns the LED red, and another which turns it blue…

Complete code

Is available from Github:

https://github.com/nfd/ci20-hello-world