Sun
Oct 29

Supporting Clang in the CI20 bare-metal project

This post is part of the CI20 bare-metal project.

Today I added support for the LLVM toolchain (The C/C++ frontend, Clang, and the linker, lld), in the CI20 bare-metal project. It is now possible to build the CI20 bare-metal project using only LLVM tools.

That doesn’t mean GCC support is gone, though! You can now choose between the two toolchains by setting the TOOLCHAIN CMake variable to either ‘gcc’ or ‘clang' (for example, by running ccmake, or by passing it to the cmake command line). If you’re using GCC, you can set TOOLCHAIN_PREFIX; if you’re using Clang, you can set CLANG_PATH (to the bin directory containing clang and friends). This reflects the fact that gcc tends to be prefixed for different targets, whereas Clang just uses the same binary for all targets.

Adding the support wasn’t as hard as I had thought — massive respect to the llvm team for building something so compatible. Here are some notes. I’m going to try to make them interesting, but I’ve got to be honest: it’s about adding toolchain support to an operating system; there are limits as to how interesting things could possibly get. 

  • It is significantly easier to build the LLVM toolchain than it is to build GCC — LLVM uses cmake, includes support for all architectures by default, and automatically includes tool support (clang and lld) if it’s in the right place. This is much nicer than the Makefiles, option-tastic configure scripts, and (if you wish to stay sane) third-party cross-compiler build tools of GCC.
  • It is much harder to get information out of llvm tools. Want to know what target triples clang supports? Check the source! On the other hand, the defaults actually do what I want — e.g. the right flavour of MIPS is in there by default (when I figured out how to ask for it) and the linker doesn’t need any particular configuration to support everything. 
  • CMake’s support for configuration is somewhat  rudimentary but works. Internally, user configuration is done by changing variables in the CMake cache — when CMakeLists.txt is evaluated, variables are substituted from the cache if they are present. This weird nomenclature threw me for a bit but it makes sense once you know the internal details.
  • LLVM’s linker, lld, is a bit less forgiving than GNU ld. I ran into a somewhat opaque error in which lld was complaining that the section flags for the .text section were 0x4 but I wanted to add a file with flags of 0x6; turns out that this is because I had set my start.S file as executable (0x4, ‘x') but not as allocating space (0x2, ‘a’) and lld was using that object (the first one it saw) to determine section flags. GNU ld, on the other hand, did the sensible thing and added the ‘a’ flag by itself.

Building an LLVM toolchain for the CI20 bare-metal project

If you want to use clang/lld instead of gcc/ld, and you don’t have a suitable clang installed, you’ll need to compile it yourself. I followed the Clang getting started instructions (also see the LLVM getting started instructions) and made these changes:

  • Downloaded lld: cd llvm/tools ; svn co http://llvm.org/svn/llvm-project/lld/trunk lld
  • Set CMAKE_BUILD_TYPE to Release as I’m not interested in actually hacking on llvm
  • Set CMAKE_INSTALL_PREFIX so as not to interfere with my system toolchain

After that it was a simple make -j4 and make install.

Building a GCC toolchain for the CI20 bare-metal project, 2017 edition (on a Mac)

This isn’t very fair as a comparison, because I’m using a case-insensitive file system. But if you are, too, and want to use GCC, here’s how:

  • Install Crosstool-ng. Previously I’ve used Homebrew for this, but the Homebrew crosstool recipe is currently broken, so I cloned it from Github and installed it manually.
  • Create a case-sensitive disk image (for building) and mount it somewhere.
  • Create a case-sensitive disk image (for installing) and mount it somewhere. (Tip: hdiutil attach <dmg> -mountpoint <mountpoint> if you don’t want to run your toolchain from /Volumes)
  • Create a build directory inside your build image and cd into it.
  • ct-ng mips-unknown-elf
  • make menuconfig
  • Set prefix directory (Paths and Options) to your case-sensitive disk image.
  • Set endianness (Target Options) to “Little endian”.
  • Save and exit.
  • ct-ng build
  • Unmount (and later delete) the build disk image when done.

As a Mac-specific bonus, gcc, ld, ar, and ranlib (at least) appear to work fine even if they’re running from a case-insensitive filesystem. So if you like, you can unmount your installation image, mount it in /Volumes, and copy the installed files to the right spot (i.e. to the same directory structure as the previously-mounted image). You then don’t need to have any disk images mounted to run gcc.

Mon
Sep 18

zlib Z_DATA_ERROR "invalid distance too far back"

If anyone else is looking for this: Internet wisdom says it’s corrupt input data, but one other cause could be that you misread the documentation and passed 32 (enable gzip and zlib decoding with header detection) to inflateInit2, but forgot to add a window size. Pass 32 + 15. Magic constants are bad.

Mon
Jul 31

Evolution of SNES games (I swear real work is coming out of this)

Super Nintendo Entertainment System games made their backgrounds out of 8x8 pixel (usually) tiles. Have a look at the first level of Super Mario World, superimposed its tilemap:

Smw tilemap

It’s not super easy to see, but it’s clearly very regular. Walkable parts use tiles 385, 386, and 387. Blocks use tiles 96, 97, 98, and 99. The background is all tile 248 (there is actually a separate tile layer with the hills and things).

By contrast, look at level 1 of Donkey Kong Country:

Dkc tilemap

The walkable portion is all messed up. The walkable ground is made up of tiles, what, 345, 168, 166, 221, 19, 282, 169… and that’s just the ones I looked at quickly in the lower left.

Super Mario World came out in 1990, with the release of the SNES. Donkey Kong Country came out four years later. It seems pretty clear that the tilemap for SMW was made by hand. It seems almost equally clear that the tilemap for DKC was made by a computer. Not obvious here, but quite interesting, is that several tiles are repurposed. For example, there is a tile which is just “noise” — light brown dots on a darker brown background, used as part of the ground. However, that exact same tile, this time using a green palette, also forms part of the forest backdrop. 

This has implications for automated game-playing: it’s pretty obvious that you could train a computer to play SMW just by looking at the tilemap, but for DKC there is presumably a separate memory area which stores the walkable path — looking at the tile map doesn’t buy you much over looking at the bitmap itself.

Sun
Feb 5

Raytracing in the browser

Large and pretty smaller

On a whim, I bought “Ray Tracing in One Weekend” and wrote a ray tracer. I then put it on the web using Emscripten. Check it out here: http://lardcave.net/raytrace/.

Interestingly / scarily, the Web version runs almost as fast as the native version on my Mac. That’s really impressive.

Sat
Jan 28

CI20: User-space cache coherency

This is a CI20 bare-metal project post.

Back in the dual-core post we talked a bit about multicore cache coherency — specifically, that kseg0 didn’t have it. This meant that if two cores wanted to communicate using shared memory in kseg0 they would have to use explicit cache flushing, or turn caching off entirely.

Now that we have a user mode, we can run basically the same test in user space, where things are very different. To do this, we modify sigma0’s main.c to do one of two things:

  • If it’s core 0, increment a counter
  • If it’s core 1, periodically print the counter.

The code for this is in the userspace-cache-coherency branch. If you build and run that, you should see a changing number when the kseg0 version repeatedly printed zeroes. The user-space version is somehow magically keeping the caches coherent between the two cores.

The magic is that the hardware implements the MSI protocol: when one core modifies a cache, the other cores find out somehow (either by bus snooping or by consulting a central directory) so they can invalidate their own state. This is very convenient for us, so why isn’t it implemented in kseg0 as well? Actually, on some architectures, it can be (by setting the CCA bits in the Config register of CP0). But this is implementation-specific: the standard “cacheable” setting (3) of CCA is incoherent.

As noted in the “When You Say Nothing At All” post, the philosophy we’ll adopt with this operating system is to leave inter-core communication to user space, as much as possible, and cache coherence is one of the reasons why.

Mon
Jan 23

CI20: User space!

This is a CI20 bare-metal project post.

Big update this time as we enter user mode for the first time. Before we get started, it’s worth checking out the code to follow along. The git tag is ‘userspace’:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/userspace 

Here’s what we’re going to do, overall:

  1. Implement system call handling
  2. Implement TLB management.
  3. Write a user-mode program
  4. Get the user-mode program into memory somehow
  5. Run everything

Surprisingly, the most complicated part of the whole process is step 4 — getting the user-mode program into memory. That stage is also where most of the design is in this instalment, so let’s start with that.

4. Get the user-mode program into memory

As you know, the boot process at the moment goes like this:

  1. Use CI20’s built-in USB loader to load stage1, which initialises the DDR RAM.
  2. Use the same built-in USB loader to load and run the kernel.

Fundamentally, we just want to load another ELF. We’re already loading two ELF files to start the system. This leads to obvious solution 1.

Obvious solution 1: add a step 1.5: “Use the USB loader to load the user-space program”.

You’ve probably guessed, since this heading is called “obvious solution 1”, that I didn’t go with it. I think there are several problems with this approach.

The first problem is that you need to tell the kernel where the user-space program is loaded and what the entry point is, so that the kernel can run it. The second problem is that you need to link the user-space program at some address which is not already occupied by the kernel (or, to put it another way: the kernel loads at 0x80000000, where does the user-space program load?). 

Both of the above problems can be avoided by making the kernel deal with them. That leads to obvious solution 2.

Obvious solution 2: store the user-space ELF as an uninterpreted binary blob inside the kernel ELF. Then write an ELF loader in the kernel.

This solution has its own problems. ELF loaders are complicated and a potential source of security vulnerabilities. It’s also not very obvious (to put it politely) that ELF loading needs to be something the kernel should do — perhaps a user-space program could be in charge of ELF loading. So all we need to do is to get that one user-space ELF loaded, and then it could do the rest.

Possibly non-obvious solution which I actually used

My chosen solution is to combine the kernel and user-space programs into a single ELF file according to the following diagram:

Linking process for combined kernel

In words: the kernel is compiled and linked normally (at a base address of kseg0, i.e. 0x80000000). Then the user-mode program, which is called sigma0 for reasons we’ll get back to, is linked. The base address of sigma0 is the highest virtual address of the kernel. In other words, if the kernel is linked to occupy addresses 0x80000000 to 0x80004FFF, then sigma0 will be loaded from 0x80005000 up*. The two files are then combined into one ELF file, which works because the loadable segments of each file do not overlap (but are contiguous). Finally, a data structure in the combined ELF file is patched up to contain the entrypoint of Sigma0, as well as some other details.

This adds a lot of complexity, but the complexity is in building the system: the running code is very simple.

This horrible ELF manipulation is achieved through a new toolchain called Saruman. Saruman is now a required part of the CI20 build process.

In your CI20 directory, download and compile Saruman:

$ git clone https://github.com/nfd/saruman
$ cd saruman
$ cmake .
$ make -j 

That covers the loading part. The rest is simple!

* Side note: Why 0x80005000? I thought this was user space? Actually, the link address would be 0x5000 in this example, but the load address inside the ELF file is in kseg0 so that the USB loader can upload it without needing TLB refill support. What sorcery is this? Saruman again.

1. Implement system call handling

There is nothing CI20-specific about this — we implement a very simple MIPS syscall handler which saves all registers (to a structure called “current”), runs a C function to handle the syscall, and then returns to user mode using eret. There is one syscall implemented (implementation in architecture/mips/exception.c) which writes a character to the serial port.

2. Implement TLB management

Like many modern MIPS processors, the JZ4780 provides a separate happy-path TLB exception handler entrypoint. The implementation of this is in start.S (label _exc_tlb_refill) and it simply direct-maps user space — giving vaddrs from 0 to 0x7FFFFFFF the same treatment that vaddrs in kseg0 get (except, of course, that user-mode code can only access the former).

3. Write the user-mode program

This was a relaxing part! Implementation in the sigma0 directory. It does almost nothing at this point. sigma0/main.c includes some commented-out code to access kseg0. If you uncomment this code, you should see the kernel… well, panic with an unhandled exception, which is admittedly not very nice, but which does demonstrate that the program is indeed running in user mode.

4. Get the user-mode program into memory somehow

We did this bit!

5. Run everything

If you’ve downloaded and built Saruman, you should be able to compile and load everything the normal way.

You should see two messages from sigma0 — “Sup0” and “Sup1”. These correspond to the two cores we are running. Each core is running its own user-space program. This is all part of the grand plan to eliminate inter-core communication on the kernel, and we’ll discuss it in more detail in the next post.

Sun
Jan 15

The Saruman suite of ELF manipulation tools

I’m writing an operating system and one of the interesting parts is constructing the executables. I’d like the kernel to be as small as possible, which means deferring a lot of its functionality to a separate program running with lower privileges — in user space, in other words. But this leads to a bootstrapping problem, since support for loading a user-space program is exactly the sort of complicated thing I would prefer not to add to my kernel.

My solution is to store the initial user-space program, sigma0, as part of the kernel executable. This way, whatever loads the kernel also ends up writing the user-space program into memory as well. This isn’t an original solution; the typical ways people do this are to use objcopy to convert the user-space program to a raw binary blob, or to include the user-space program verbatim and add an ELF loader to the kernel. Neither of these solutions is particularly great — raw binary blobs are by definition structureless and hard to debug, and I don’t want to add an ELF loader to the kernel. (The Linux solution, called initrd, incorporates an entire file system as a giant blob — which is fine for Linux, as it is a giant kernel and already includes filesystem-mounting and ELF-loading functionality.)

So my solution is to make one ELF file containing both kernel and user-space program. ELF files are structured into segments (among other things), so all one needs to do is construct a new ELF file containing segments both for the kernel and for the user-space program, then patch up the kernel to tell it where the user-space program should be.

To cut a long story short, that need, and a few other ELF-related requirements I had, weren’t well served by the standard toolchain. So I wrote the Saruman suite of pipeline-friendly ELF tools.

Githubhttps://github.com/nfd/saruman 
License: Public domain, as usual 

There are three tools currently.

objcat combines ELF files by producing a new file containing all the loadable segments from all input files. For example, objcat program1 program2 >program3. This obviously doesn’t produce anything interesting if the virtual addresses of the input segments overlap, but it’s useful if they don’t.

objinfo prints information from ELF files. Currently it can print the value of a symbol (objinfo -S name program) or the entrypoint of the file (objinfo -E program)

objpatch lets you overwrite specific bytes in ELF files. The interesting part is that the address of the bytes is specified as a virtual address. Objpatch will find the part of the file corresponding with the vaddr you specify. For example, objpatch -V 0x80000c00=u32:0x4000 program >out will overwrite the unsigned 32-bit integer which would load at virtual address 0x80000c00 with the value 0x4000. 

I’ll probably add more to these in future, but it’s already enough to produce the two-executables-in-one-ELF functionality I need for my kernel.

(Why Saruman and not Sauron? Because Sauron took a good thing and made it bad, but Saruman took a bad thing and made it worse.)

Thu
Dec 15

Some quick notes on threading models

This is sort-of part of the CI20 bare-metal project, but it’s all theory and nothing to do with the CI20 or bare-metal programming. It’s about how to design a system which supports threads.

A thread consists of a stack and a set of registers (okay, and thread-local storage). Threads enter the kernel through syscalls; when they do, it would be very dangerous for the thread running in-kernel to continue to use the user-level stack, so we need a kernel stack for each thread concurrently running in the kernel as well. And of course threads running in-kernel also need their own set of registers, so that user-mode state can be restored when the thread exits the kernel.

Take a concept like a process. Within a process, there are four ways to model the user-mode/kernel-mode thread split:

  • Nothing: One “thread” per process, as per old-school Unix.
  • 1:N: There is only one kernel thread, but multiple user threads.
  • 1:1: For each user thread, there is a corresponding kernel thread.
  • M:N: There are multiple user and kernel threads, but no one-to-one correlation.

1:N means exclusively user-level threads and is very uncommon nowadays, since multi-core systems are so common. 1:1 is the obvious generalisation of “nothing” and is the most common threading model nowadays. The M:N model offers the best performance and flexibility but almost no real-world operating systems actually use it.

Why the M:N model is better

There is a lot of background, but basically it boils down to two things:

  1. You don’t need to enter the kernel and run the (expensive) kernel scheduler so often, and
  2. User-space scheduling can be more efficient as it has more contextual information to work with.

You really need examples to talk about this stuff, and then things get a little murky because it’s hard to know how representative the examples are. Nonetheless Paul Turner gave a good motivating example in his 2013 talk on user-level thread scheduling in Linux: a multi-threaded server where one thread listens to the socket(s) and dispatches to worker threads. The typical thing that happens is that the dispatcher receives data, sends it to a worker, and goes to sleep waiting for more data. If you do this naively, your dispatcher gets scheduled and sends a message to a worker. The kernel scheduler runs and starts the worker, perhaps on a different CPU if one is available. Meanwhile the first CPU resumes running the dispatcher, which promptly goes to sleep waiting for more data. Now the worker is on a fresh CPU with a cold cache. The simple solution here is to pin worker and dispatcher to the same CPU, but this has obvious performance implications as well. Paul’s solution involves adding some syscalls to Linux, in particular a “switch_to” syscall which exchanges the states of the caller and callee thread without doing a reschedule.

This example is made more convincing by being an example of an IO-bound task, which M:N thread models have been traditionally bad at.

Why nobody uses the M:N model

The fundamental problem is backwards compatibility. Debuggers and other tools (such as tracers) assume a 1:1 thread model and get confused. Both the Linux user-level threading patches, and Windows 7’s user-level scheduling, therefore enforce a 1:1 thread model even while doing user-level scheduling (in Linux’s case, by a syscall, and in Windows’ case, by switching entirely at user-level and having the kernel fix things up if a thread enters the kernel but isn’t the one the kernel was expecting).

Some older arguments for an M:N model don’t apply any more, at least on x86 — the SYSENTER and SYSEXIT opcodes make entering and leaving kernel mode very fast, whereas previously it was relatively slow. However, modern scheduler implementations are very complicated, so now the kernel scheduler is the bottleneck. 

Another problem, strongly related to backward compatibility, is POSIX. Delivering signals to user-level threads the kernel has no idea about was, apparently, a nightmare, as was thread-local storage. 

It is of course theoretically possible to implement an M:N model on top of a 1:1 model without any kernel involvement at all, just by putting user-level threads in your language runtime. This was, apparently, what Rust did for a while. Unfortunately, unless your entire system is designed with that model in mind, things eventually go wrong — Rust was obliged, for example, to use asynchronous versions of every system call, which slowed down “normal” code, and libraries which made use of thread-local storage or which made blocking system calls broke the model. Implementing such a thing without pervasive OS-wide support seems very difficult in practise.

Both FreeBSD and NetBSD attempted an M:N implementation, which was apparently rather complex. It’s hard to tell how much of this complexity is related to supporting POSIX semantics, but presumably at least some of it was. NetBSD’s implementation was also, apparently, written before NetBSD supported multiple processors. When multiple processors arrived, threads within a process were limited to a single core.

Windows NT had a user-level threading solution called Fibers, but it was (apparently gratuitously) incompatible with everything except itself, including the win32 API, so doesn’t really add any data — I mention it because it tends to come up in these discussions, along with GNU pth, for some reason.

Fundamentally, it seems that 1:1 is good enough for most purposes most of the time.

What people are doing instead

Both Linux and Windows have, essentially, hacks to support user-level scheduling whilst not breaking backwards compatibility — Linux with the switch_to schedule-without-a-schedule, and Windows by having the kernel fix itself up on entry. Both approaches maintain the 1:1 kernel-to-user mapping for legacy reasons. (Actually it doesn’t seem like the Linux user-level scheduling patches even made it into the mainline kernel.)

Interestingly, both approaches recognise the need to abstract above threads to some extent at user level. The switch_to syscall is fairly high-level and encodes a simple scheduling decision. Meanwhile Windows 7 has a user-level scheduling framework, ConcRT, which aims to lift programmers above a thread-centric model and have them focus instead on a task-centric model.

M:N crops up in research operating systems, such as Nemesis and Barrelfish, and even in some “real” OSes, such as DragonFly BSD (with its “light weight kernel threads”).

More information

Scheduler Activations (pdf): the original M:N threading model (better-quality recreation (pdf)).

User-level threads… with threads (video): Paul Turner’s talk about Linux user-level scheduling.

Dave Probert: Inside Windows 7 - User Mode Scheduler (UMS) (video): Dave Probert’s chat about Windows user-level scheduling.

[rust-dev] The future of M:N threading: on why Rust gave up on their hybrid model.

Thu
Dec 8

CI20: You say it best when you say nothing at all

I’ve always liked the song lyric in the title, in a wry sort of way, because of its apparent, and surprisingly blithe, indictment of the recipient’s ability to express themselves, and/or its suggestion that it was past time that the object of the singer’s affections shut up and got on with... well, with whatever nonverbal communication the singer was concerned with at the time. Yes, I realise it’s about how sometimes words are unnecessary and that it’s not supposed to be mean. But it’s so easy to read it as mean! I just don’t think the writer expressed himself very well, that’s all. I suspect he doesn’t take his own advice.

In any case, “you say it best when you say nothing at all" is surprisingly relevant to the CI20 bare-metal project. As we saw in the previous post about multicore, cache (in-)coherence is a real problem when communicating between cores in kseg0. There are multiple options available to us:

1. Communicate using only uncached memory. this is effective, but slow. We could take this option if communication was very infrequent.

2. Manually manage the cache. Use the MIPS cache-management instructions to effectively disable caching only for the shared variable. 

3. Communicate via other means. For example, the JZ4780 has support for inter-processor interrupts (IPIs) which are triggered by writing to a special mailbox location in CP0. This is fast and convenient, but limits communication to 32 bits — we could pass a command, or a single pointer.

4. (The country option): say nothing at all. To the extent that it’s possible, eliminate inter-core communication altogether.

Perhaps surprisingly, it seems quite feasible to eliminate inter-core communication for most things that a kernel might want to do. If you’re into OS theory, there is plenty of work in this area, for example in Barrelfish. There are several good reasons to aim for such a thing: no inter-core communication eliminates two particularly tricky sources of bugs (cache invalidation and synchronisation); it simplifies the kernel; and specifically it pushes matters relating to policy (which tends to be where we need multi-core synchronisation) out of the kernel and into user-space (more on this later); and it’s very easy to scale (because if there is no inter-core communication, then inter-core communication can’t possibly become a scalability bottleneck).

So that’s the option we’ll be taking with the CI20 bare-metal project. We will get into cache management at some point, but as far as the kernel is concerned, we will aim to keep each core as isolated as possible. When cores do need to share information — for example, when both cores are running code in the same address space — we will aim to do this sharing outside the kernel.

In this post we’ve moved away from CI20-specific code into more general areas of operating system design. This will be a theme of at least the next few posts. We will still be implementing all this stuff, though, so there will still be plenty of low-level and system-specific code to contend with!

Wed
Dec 7

CI20: Dual cores! Double your fun!

We’re back with some actual code! This time, let’s bring up the second core.

The CI20’s JZ4780 SOC has two MIPS cores on it. By default, though, it starts up with only the first core running, which makes sense — a single core presents the simplest startup environment, and it’s easy enough for the first core to bring up the others.

There are apparently several competing standards for multicore MIPS systems. Refreshingly, the JZ4780 doesn’t follow any of them (well, let’s say it follows the “XBurst standard”, as implemented by… the JZ4780). However, once you know what’s involved, it’s very simple to start up a new core. The key code is in soc/jz4780/multicore.c. There are four main steps:

  1. Ensure the core has a clock source.
  2. Set an entry point for the new core.
  3. Ensure the core is powered up.
  4. Take the core out of reset.
There are two functional units which make this happen: coprocessor 0, which has some XBurst-specific multicore extensions, and the accurately-named clock gating / low power control module.
 
As a reminder, all of these module and register names are in reference to the JZ4780 programmer’s manual (PDF link).

CP0 extensions for multicore

Ingenic has added several registers to its coprocessor 0 (for a quick discussion about coprocessor 0, see below). The two we need to manipulate to start the second core are:
  • CORE_CTL: determines whether the core is in sleep mode and whether it is in reset. On startup, CORE1 is held in reset — taking it out of reset (assuming it has power and a clock source) will immediately start it running at its entrypoint. CORE_CTL also has a flag which determines whether a core will use the default (ROM) entry point, or a custom one.
  • CORE_REIM: allows us to set our custom entry point.

Low-power and clock gating settings

As a designed-for-mobile SOC, the JZ4780 has a lot of support for controlling energy use. You can selectively power-down almost everything in the SOC (the key register here is LCR, for Low-Power Control Register), and you can also selectively disable the clock source for most system modules using “clock gating”. There is a little more info on this lower down, if you’re interested, but the basics are that you need to enable the core’s clock and make sure it’s not in one of several low-power modes before you can start it up. 

Running the code

Check out and run this code in the usual way (the release tag for this version is “multicore”, i.e. git checkout tags/multicore).

You should see “Hello, world!” followed by (or perhaps partially interspersed with) “Hello, multicore world!”, followed by three rows of 00000000. What’s with those?

In a previous release, we enabled interrupt handling and set up a timer to increment a counter. The kernel entrypoint would then print the value of this counter at regular intervals. But in this release, the code in main.c doesn’t register a counter (and actually the init code doesn’t even set up the timer). However, if you look further down main.c, you will see something interesting. The second core is incrementing the counter like mad, but the first core keeps printing 00000000, even though the “counter” variable is volatile.

The reason is that both cores are accessing “counter” from their cache. Each core has its own level-1 cache (“primary cache” in MIPS terminology), which means that if caching is enabled we need to do something special to ensure that the caches remain coherent, i.e. that they agree with each other. However, if you look in start.S, you will see that we are explicitly not doing this:

/* Enable caching in kseg0 */
li t0, CACHE_MODE_CACHABLE_NONCOHERENT
mtc0 t0, CP0_CONFIG

So we can just change that CACHE_MODE_CACHABLE_NONCOHERENT to CACHE_MODE_UNCACHED and it will work, right? Right, but if you try this you will notice that everything runs much slower (and you may want to change the delay loop in main() from 0xfffffff to, say, 0xfffff, so you’re not waiting for ages). 

Fortunately, turning off caching entirely is not the only solution to this problem. More on that in the next installment.

Implementation notes

Quite a few small changes went into this release, some of which are worth discussing:
 
Multicore entrypoint: CORE_REIM only allows us to specify the upper 16 bits of an entrypoint — the lower 16 are assumed to be zero. That means that our entry point for multicore must be aligned to 64k. There are several ways to do this, but the way I chose puts the multicore entrypoint right at the start of the kernel, at 0x80000000. This is nice in some ways (it’s as aligned as it could possibly be; there is room for it there before the exception handler) and less nice in others (system-specific functionality in generic code). I may come back to it later.

Stacks: The kernel now declares its own stack (as part of its BSS). Previously, start.S was simply carving off a piece of RAM to use as a stack. This obviously doesn’t make sense for anything other than a tutorial, but it gets worse with multiple cores. architecture/mips32/kernelstack.c declares a (small) kernel stack for each core, and architecture/mips32/start.S and soc/jz4780/multicore_lowlevel.S make use of it.

CP0 accessor functions: We now take a FreeBSD-inspired approach to, e.g., writing to device registers or to CP0 — small inline functions with sensible names, used from C. This is nicer than bit shifting and poking values, and much nicer than dropping down to assembly language. An example of this approach is in include/soc/jz4780/jz47xx-cgm.h, which defines a number of functions for gating and ungating clock sources, as used by soc/jz4780/multicore.c.

Library-ification: Start.S, being MIPS-specific, is now part of libsystem.a. Unfortunately this means that it needs to be in its own section, so that the linker can be instructed to place it before any other code. I’m still not entirely happy with what’s going on here behind the scenes — GNU ld seems to be creating a LOAD segment which has zero length and is entirely BSS, which doesn’t make much sense to me — but the outcome is positive overall.

Supplemental material: Coprocessor 0

I have hinted at CP0 a few times during this series, but I haven’t explicitly addressed it. The MIPS design includes support for “coprocessors”, which are accessed using special instructions. Coprocessor 0, which is required, is mostly about memory management and caching, but it also has a grab-bag of other things, such as processor information (make, model, size of cache line), plus implementation-defined portions which can do anything — such as turn on the second core, in the case of the JZ4780. 

There are two relevant assembly-language instructions: mtc0 (move register value to coprocessor 0) and mfc0 (move from coprocessor 0 to register). In this release, any access to them from C is wrapped up in CP0 accessor functions (see above).

Supplemental material: everything you would have wanted to know about clock gating, had you known that clock gating existed

Q: What is clock gating?
A: Clock gating is an energy-saving measure. Switching transistors on and off repeatedly, which is what a clock does, takes energy. In a complex system like the JZ4780, a single clock is connected to multiple modules — so it’s very likely that the clock will be running, but not all modules will be active. In this case, you can gate (disable) the clock source for the modules you aren’t using, without having to turn the clock off completely.

Q: How does clock gating apply here?
A: Well, the jz4780 lets you gate pretty much everything, including the second processor core — and, on startup, pretty much everything, including the second core, is in fact gated. If we enable CORE1 with a gated clock it won't execute any instructions. 

Q: I notice that the JZ4780 programmer’s manual refers to CORE1’s clock gate bit as “P1”, with no further information. Did figuring out that P1 meant CORE1 take any time?
A: No, none at all. Well, not much. I mean, a little. Well, definitely not much more than I’d want to admit.  

Q: I also notice that the JZ4780 programmer’s manual refers to CORE1 as “SCPU”, with no further information, in documentation for the same module. I guess that was similarly easy to recognise?
A: *hopeless stare*

Q: Look, would you like a coffee?
A: I’d love a coffee!