Dec 26



Just in time for Christmas, almost, here is the classic Amiga demo State of the Art, runnable in a browser. Works well on Chrome, Safari, and Firefox on a computer. Less well on mobile (missing or broken audio — work in progress) and Edge (nobody cares about Edge).

Nov 23

Wire colour to TRS connector, Apple Lightning to 3.5" adapter

The Lightning to 3.5” adapter is a small digital to analogue converterconnected to a "standard headphone" (not for long) tip-ring-sleeve socket. If you cut the socket off (carefully, as the wires are thin and coated in plastic), you can repurpose it.

Here is the map between wire colour and destination on a 3.5” tip-ring-sleeve connector:

Wire colourTRS destinationPurpose
Red Tip Left
Green+Gold Tip Mic
Blue Sleeve Ground
Green Ring Right
Red+Green Sleeve Mic ground

If you are going to attach solder to the wires, it's helpful to melt the plastic off them first (with a lighter, for example).

Oct 29

Supporting Clang in the CI20 bare-metal project

This post is part of the CI20 bare-metal project.

Today I added support for the LLVM toolchain (The C/C++ frontend, Clang, and the linker, lld), in the CI20 bare-metal project. It is now possible to build the CI20 bare-metal project using only LLVM tools.

That doesn’t mean GCC support is gone, though! You can now choose between the two toolchains by setting the TOOLCHAIN CMake variable to either ‘gcc’ or ‘clang' (for example, by running ccmake, or by passing it to the cmake command line). If you’re using GCC, you can set TOOLCHAIN_PREFIX; if you’re using Clang, you can set CLANG_PATH (to the bin directory containing clang and friends). This reflects the fact that gcc tends to be prefixed for different targets, whereas Clang just uses the same binary for all targets.

Adding the support wasn’t as hard as I had thought — massive respect to the llvm team for building something so compatible. Here are some notes. I’m going to try to make them interesting, but I’ve got to be honest: it’s about adding toolchain support to an operating system; there are limits as to how interesting things could possibly get. 

  • It is significantly easier to build the LLVM toolchain than it is to build GCC — LLVM uses cmake, includes support for all architectures by default, and automatically includes tool support (clang and lld) if it’s in the right place. This is much nicer than the Makefiles, option-tastic configure scripts, and (if you wish to stay sane) third-party cross-compiler build tools of GCC.
  • It is much harder to get information out of llvm tools. Want to know what target triples clang supports? Check the source! On the other hand, the defaults actually do what I want — e.g. the right flavour of MIPS is in there by default (when I figured out how to ask for it) and the linker doesn’t need any particular configuration to support everything. 
  • CMake’s support for configuration is somewhat  rudimentary but works. Internally, user configuration is done by changing variables in the CMake cache — when CMakeLists.txt is evaluated, variables are substituted from the cache if they are present. This weird nomenclature threw me for a bit but it makes sense once you know the internal details.
  • LLVM’s linker, lld, is a bit less forgiving than GNU ld. I ran into a somewhat opaque error in which lld was complaining that the section flags for the .text section were 0x4 but I wanted to add a file with flags of 0x6; turns out that this is because I had set my start.S file as executable (0x4, ‘x') but not as allocating space (0x2, ‘a’) and lld was using that object (the first one it saw) to determine section flags. GNU ld, on the other hand, did the sensible thing and added the ‘a’ flag by itself.

Building an LLVM toolchain for the CI20 bare-metal project

If you want to use clang/lld instead of gcc/ld, and you don’t have a suitable clang installed, you’ll need to compile it yourself. I followed the Clang getting started instructions (also see the LLVM getting started instructions) and made these changes:

  • Downloaded lld: cd llvm/tools ; svn co http://llvm.org/svn/llvm-project/lld/trunk lld
  • Set CMAKE_BUILD_TYPE to Release as I’m not interested in actually hacking on llvm
  • Set CMAKE_INSTALL_PREFIX so as not to interfere with my system toolchain

After that it was a simple make -j4 and make install.

Building a GCC toolchain for the CI20 bare-metal project, 2017 edition (on a Mac)

This isn’t very fair as a comparison, because I’m using a case-insensitive file system. But if you are, too, and want to use GCC, here’s how:

  • Install Crosstool-ng. Previously I’ve used Homebrew for this, but the Homebrew crosstool recipe is currently broken, so I cloned it from Github and installed it manually.
  • Create a case-sensitive disk image (for building) and mount it somewhere.
  • Create a case-sensitive disk image (for installing) and mount it somewhere. (Tip: hdiutil attach <dmg> -mountpoint <mountpoint> if you don’t want to run your toolchain from /Volumes)
  • Create a build directory inside your build image and cd into it.
  • ct-ng mips-unknown-elf
  • make menuconfig
  • Set prefix directory (Paths and Options) to your case-sensitive disk image.
  • Set endianness (Target Options) to “Little endian”.
  • Save and exit.
  • ct-ng build
  • Unmount (and later delete) the build disk image when done.

As a Mac-specific bonus, gcc, ld, ar, and ranlib (at least) appear to work fine even if they’re running from a case-insensitive filesystem. So if you like, you can unmount your installation image, mount it in /Volumes, and copy the installed files to the right spot (i.e. to the same directory structure as the previously-mounted image). You then don’t need to have any disk images mounted to run gcc.

Sep 18

zlib Z_DATA_ERROR "invalid distance too far back"

If anyone else is looking for this: Internet wisdom says it’s corrupt input data, but one other cause could be that you misread the documentation and passed 32 (enable gzip and zlib decoding with header detection) to inflateInit2, but forgot to add a window size. Pass 32 + 15. Magic constants are bad.

Jul 31

Evolution of SNES games (I swear real work is coming out of this)

Super Nintendo Entertainment System games made their backgrounds out of 8x8 pixel (usually) tiles. Have a look at the first level of Super Mario World, superimposed its tilemap:

Smw tilemap

It’s not super easy to see, but it’s clearly very regular. Walkable parts use tiles 385, 386, and 387. Blocks use tiles 96, 97, 98, and 99. The background is all tile 248 (there is actually a separate tile layer with the hills and things).

By contrast, look at level 1 of Donkey Kong Country:

Dkc tilemap

The walkable portion is all messed up. The walkable ground is made up of tiles, what, 345, 168, 166, 221, 19, 282, 169… and that’s just the ones I looked at quickly in the lower left.

Super Mario World came out in 1990, with the release of the SNES. Donkey Kong Country came out four years later. It seems pretty clear that the tilemap for SMW was made by hand. It seems almost equally clear that the tilemap for DKC was made by a computer. Not obvious here, but quite interesting, is that several tiles are repurposed. For example, there is a tile which is just “noise” — light brown dots on a darker brown background, used as part of the ground. However, that exact same tile, this time using a green palette, also forms part of the forest backdrop. 

This has implications for automated game-playing: it’s pretty obvious that you could train a computer to play SMW just by looking at the tilemap, but for DKC there is presumably a separate memory area which stores the walkable path — looking at the tile map doesn’t buy you much over looking at the bitmap itself.

Feb 5

Raytracing in the browser

Large and pretty smaller

On a whim, I bought “Ray Tracing in One Weekend” and wrote a ray tracer. I then put it on the web using Emscripten. Check it out here: http://lardcave.net/raytrace/.

Interestingly / scarily, the Web version runs almost as fast as the native version on my Mac. That’s really impressive.

Jan 28

CI20: User-space cache coherency

This is a CI20 bare-metal project post.

Back in the dual-core post we talked a bit about multicore cache coherency — specifically, that kseg0 didn’t have it. This meant that if two cores wanted to communicate using shared memory in kseg0 they would have to use explicit cache flushing, or turn caching off entirely.

Now that we have a user mode, we can run basically the same test in user space, where things are very different. To do this, we modify sigma0’s main.c to do one of two things:

  • If it’s core 0, increment a counter
  • If it’s core 1, periodically print the counter.

The code for this is in the userspace-cache-coherency branch. If you build and run that, you should see a changing number when the kseg0 version repeatedly printed zeroes. The user-space version is somehow magically keeping the caches coherent between the two cores.

The magic is that the hardware implements the MSI protocol: when one core modifies a cache, the other cores find out somehow (either by bus snooping or by consulting a central directory) so they can invalidate their own state. This is very convenient for us, so why isn’t it implemented in kseg0 as well? Actually, on some architectures, it can be (by setting the CCA bits in the Config register of CP0). But this is implementation-specific: the standard “cacheable” setting (3) of CCA is incoherent.

As noted in the “When You Say Nothing At All” post, the philosophy we’ll adopt with this operating system is to leave inter-core communication to user space, as much as possible, and cache coherence is one of the reasons why.

Jan 23

CI20: User space!

This is a CI20 bare-metal project post.

Big update this time as we enter user mode for the first time. Before we get started, it’s worth checking out the code to follow along. The git tag is ‘userspace’:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/userspace 

Here’s what we’re going to do, overall:

  1. Implement system call handling
  2. Implement TLB management.
  3. Write a user-mode program
  4. Get the user-mode program into memory somehow
  5. Run everything

Surprisingly, the most complicated part of the whole process is step 4 — getting the user-mode program into memory. That stage is also where most of the design is in this instalment, so let’s start with that.

4. Get the user-mode program into memory

As you know, the boot process at the moment goes like this:

  1. Use CI20’s built-in USB loader to load stage1, which initialises the DDR RAM.
  2. Use the same built-in USB loader to load and run the kernel.

Fundamentally, we just want to load another ELF. We’re already loading two ELF files to start the system. This leads to obvious solution 1.

Obvious solution 1: add a step 1.5: “Use the USB loader to load the user-space program”.

You’ve probably guessed, since this heading is called “obvious solution 1”, that I didn’t go with it. I think there are several problems with this approach.

The first problem is that you need to tell the kernel where the user-space program is loaded and what the entry point is, so that the kernel can run it. The second problem is that you need to link the user-space program at some address which is not already occupied by the kernel (or, to put it another way: the kernel loads at 0x80000000, where does the user-space program load?). 

Both of the above problems can be avoided by making the kernel deal with them. That leads to obvious solution 2.

Obvious solution 2: store the user-space ELF as an uninterpreted binary blob inside the kernel ELF. Then write an ELF loader in the kernel.

This solution has its own problems. ELF loaders are complicated and a potential source of security vulnerabilities. It’s also not very obvious (to put it politely) that ELF loading needs to be something the kernel should do — perhaps a user-space program could be in charge of ELF loading. So all we need to do is to get that one user-space ELF loaded, and then it could do the rest.

Possibly non-obvious solution which I actually used

My chosen solution is to combine the kernel and user-space programs into a single ELF file according to the following diagram:

Linking process for combined kernel

In words: the kernel is compiled and linked normally (at a base address of kseg0, i.e. 0x80000000). Then the user-mode program, which is called sigma0 for reasons we’ll get back to, is linked. The base address of sigma0 is the highest virtual address of the kernel. In other words, if the kernel is linked to occupy addresses 0x80000000 to 0x80004FFF, then sigma0 will be loaded from 0x80005000 up*. The two files are then combined into one ELF file, which works because the loadable segments of each file do not overlap (but are contiguous). Finally, a data structure in the combined ELF file is patched up to contain the entrypoint of Sigma0, as well as some other details.

This adds a lot of complexity, but the complexity is in building the system: the running code is very simple.

This horrible ELF manipulation is achieved through a new toolchain called Saruman. Saruman is now a required part of the CI20 build process.

In your CI20 directory, download and compile Saruman:

$ git clone https://github.com/nfd/saruman
$ cd saruman
$ cmake .
$ make -j 

That covers the loading part. The rest is simple!

* Side note: Why 0x80005000? I thought this was user space? Actually, the link address would be 0x5000 in this example, but the load address inside the ELF file is in kseg0 so that the USB loader can upload it without needing TLB refill support. What sorcery is this? Saruman again.

1. Implement system call handling

There is nothing CI20-specific about this — we implement a very simple MIPS syscall handler which saves all registers (to a structure called “current”), runs a C function to handle the syscall, and then returns to user mode using eret. There is one syscall implemented (implementation in architecture/mips/exception.c) which writes a character to the serial port.

2. Implement TLB management

Like many modern MIPS processors, the JZ4780 provides a separate happy-path TLB exception handler entrypoint. The implementation of this is in start.S (label _exc_tlb_refill) and it simply direct-maps user space — giving vaddrs from 0 to 0x7FFFFFFF the same treatment that vaddrs in kseg0 get (except, of course, that user-mode code can only access the former).

3. Write the user-mode program

This was a relaxing part! Implementation in the sigma0 directory. It does almost nothing at this point. sigma0/main.c includes some commented-out code to access kseg0. If you uncomment this code, you should see the kernel… well, panic with an unhandled exception, which is admittedly not very nice, but which does demonstrate that the program is indeed running in user mode.

4. Get the user-mode program into memory somehow

We did this bit!

5. Run everything

If you’ve downloaded and built Saruman, you should be able to compile and load everything the normal way.

You should see two messages from sigma0 — “Sup0” and “Sup1”. These correspond to the two cores we are running. Each core is running its own user-space program. This is all part of the grand plan to eliminate inter-core communication on the kernel, and we’ll discuss it in more detail in the next post.

Jan 15

The Saruman suite of ELF manipulation tools

I’m writing an operating system and one of the interesting parts is constructing the executables. I’d like the kernel to be as small as possible, which means deferring a lot of its functionality to a separate program running with lower privileges — in user space, in other words. But this leads to a bootstrapping problem, since support for loading a user-space program is exactly the sort of complicated thing I would prefer not to add to my kernel.

My solution is to store the initial user-space program, sigma0, as part of the kernel executable. This way, whatever loads the kernel also ends up writing the user-space program into memory as well. This isn’t an original solution; the typical ways people do this are to use objcopy to convert the user-space program to a raw binary blob, or to include the user-space program verbatim and add an ELF loader to the kernel. Neither of these solutions is particularly great — raw binary blobs are by definition structureless and hard to debug, and I don’t want to add an ELF loader to the kernel. (The Linux solution, called initrd, incorporates an entire file system as a giant blob — which is fine for Linux, as it is a giant kernel and already includes filesystem-mounting and ELF-loading functionality.)

So my solution is to make one ELF file containing both kernel and user-space program. ELF files are structured into segments (among other things), so all one needs to do is construct a new ELF file containing segments both for the kernel and for the user-space program, then patch up the kernel to tell it where the user-space program should be.

To cut a long story short, that need, and a few other ELF-related requirements I had, weren’t well served by the standard toolchain. So I wrote the Saruman suite of pipeline-friendly ELF tools.

License: Public domain, as usual 

There are three tools currently.

objcat combines ELF files by producing a new file containing all the loadable segments from all input files. For example, objcat program1 program2 >program3. This obviously doesn’t produce anything interesting if the virtual addresses of the input segments overlap, but it’s useful if they don’t.

objinfo prints information from ELF files. Currently it can print the value of a symbol (objinfo -S name program) or the entrypoint of the file (objinfo -E program)

objpatch lets you overwrite specific bytes in ELF files. The interesting part is that the address of the bytes is specified as a virtual address. Objpatch will find the part of the file corresponding with the vaddr you specify. For example, objpatch -V 0x80000c00=u32:0x4000 program >out will overwrite the unsigned 32-bit integer which would load at virtual address 0x80000c00 with the value 0x4000. 

I’ll probably add more to these in future, but it’s already enough to produce the two-executables-in-one-ELF functionality I need for my kernel.

(Why Saruman and not Sauron? Because Sauron took a good thing and made it bad, but Saruman took a bad thing and made it worse.)

Dec 15

Some quick notes on threading models

This is sort-of part of the CI20 bare-metal project, but it’s all theory and nothing to do with the CI20 or bare-metal programming. It’s about how to design a system which supports threads.

A thread consists of a stack and a set of registers (okay, and thread-local storage). Threads enter the kernel through syscalls; when they do, it would be very dangerous for the thread running in-kernel to continue to use the user-level stack, so we need a kernel stack for each thread concurrently running in the kernel as well. And of course threads running in-kernel also need their own set of registers, so that user-mode state can be restored when the thread exits the kernel.

Take a concept like a process. Within a process, there are four ways to model the user-mode/kernel-mode thread split:

  • Nothing: One “thread” per process, as per old-school Unix.
  • 1:N: There is only one kernel thread, but multiple user threads.
  • 1:1: For each user thread, there is a corresponding kernel thread.
  • M:N: There are multiple user and kernel threads, but no one-to-one correlation.

1:N means exclusively user-level threads and is very uncommon nowadays, since multi-core systems are so common. 1:1 is the obvious generalisation of “nothing” and is the most common threading model nowadays. The M:N model offers the best performance and flexibility but almost no real-world operating systems actually use it.

Why the M:N model is better

There is a lot of background, but basically it boils down to two things:

  1. You don’t need to enter the kernel and run the (expensive) kernel scheduler so often, and
  2. User-space scheduling can be more efficient as it has more contextual information to work with.

You really need examples to talk about this stuff, and then things get a little murky because it’s hard to know how representative the examples are. Nonetheless Paul Turner gave a good motivating example in his 2013 talk on user-level thread scheduling in Linux: a multi-threaded server where one thread listens to the socket(s) and dispatches to worker threads. The typical thing that happens is that the dispatcher receives data, sends it to a worker, and goes to sleep waiting for more data. If you do this naively, your dispatcher gets scheduled and sends a message to a worker. The kernel scheduler runs and starts the worker, perhaps on a different CPU if one is available. Meanwhile the first CPU resumes running the dispatcher, which promptly goes to sleep waiting for more data. Now the worker is on a fresh CPU with a cold cache. The simple solution here is to pin worker and dispatcher to the same CPU, but this has obvious performance implications as well. Paul’s solution involves adding some syscalls to Linux, in particular a “switch_to” syscall which exchanges the states of the caller and callee thread without doing a reschedule.

This example is made more convincing by being an example of an IO-bound task, which M:N thread models have been traditionally bad at.

Why nobody uses the M:N model

The fundamental problem is backwards compatibility. Debuggers and other tools (such as tracers) assume a 1:1 thread model and get confused. Both the Linux user-level threading patches, and Windows 7’s user-level scheduling, therefore enforce a 1:1 thread model even while doing user-level scheduling (in Linux’s case, by a syscall, and in Windows’ case, by switching entirely at user-level and having the kernel fix things up if a thread enters the kernel but isn’t the one the kernel was expecting).

Some older arguments for an M:N model don’t apply any more, at least on x86 — the SYSENTER and SYSEXIT opcodes make entering and leaving kernel mode very fast, whereas previously it was relatively slow. However, modern scheduler implementations are very complicated, so now the kernel scheduler is the bottleneck. 

Another problem, strongly related to backward compatibility, is POSIX. Delivering signals to user-level threads the kernel has no idea about was, apparently, a nightmare, as was thread-local storage. 

It is of course theoretically possible to implement an M:N model on top of a 1:1 model without any kernel involvement at all, just by putting user-level threads in your language runtime. This was, apparently, what Rust did for a while. Unfortunately, unless your entire system is designed with that model in mind, things eventually go wrong — Rust was obliged, for example, to use asynchronous versions of every system call, which slowed down “normal” code, and libraries which made use of thread-local storage or which made blocking system calls broke the model. Implementing such a thing without pervasive OS-wide support seems very difficult in practise.

Both FreeBSD and NetBSD attempted an M:N implementation, which was apparently rather complex. It’s hard to tell how much of this complexity is related to supporting POSIX semantics, but presumably at least some of it was. NetBSD’s implementation was also, apparently, written before NetBSD supported multiple processors. When multiple processors arrived, threads within a process were limited to a single core.

Windows NT had a user-level threading solution called Fibers, but it was (apparently gratuitously) incompatible with everything except itself, including the win32 API, so doesn’t really add any data — I mention it because it tends to come up in these discussions, along with GNU pth, for some reason.

Fundamentally, it seems that 1:1 is good enough for most purposes most of the time.

What people are doing instead

Both Linux and Windows have, essentially, hacks to support user-level scheduling whilst not breaking backwards compatibility — Linux with the switch_to schedule-without-a-schedule, and Windows by having the kernel fix itself up on entry. Both approaches maintain the 1:1 kernel-to-user mapping for legacy reasons. (Actually it doesn’t seem like the Linux user-level scheduling patches even made it into the mainline kernel.)

Interestingly, both approaches recognise the need to abstract above threads to some extent at user level. The switch_to syscall is fairly high-level and encodes a simple scheduling decision. Meanwhile Windows 7 has a user-level scheduling framework, ConcRT, which aims to lift programmers above a thread-centric model and have them focus instead on a task-centric model.

M:N crops up in research operating systems, such as Nemesis and Barrelfish, and even in some “real” OSes, such as DragonFly BSD (with its “light weight kernel threads”).

More information

Scheduler Activations (pdf): the original M:N threading model (better-quality recreation (pdf)).

User-level threads… with threads (video): Paul Turner’s talk about Linux user-level scheduling.

Dave Probert: Inside Windows 7 - User Mode Scheduler (UMS) (video): Dave Probert’s chat about Windows user-level scheduling.

[rust-dev] The future of M:N threading: on why Rust gave up on their hybrid model.