Thu
Dec 15

Some quick notes on threading models

This is sort-of part of the CI20 bare-metal project, but it’s all theory and nothing to do with the CI20 or bare-metal programming. It’s about how to design a system which supports threads.

A thread consists of a stack and a set of registers (okay, and thread-local storage). Threads enter the kernel through syscalls; when they do, it would be very dangerous for the thread running in-kernel to continue to use the user-level stack, so we need a kernel stack for each thread concurrently running in the kernel as well. And of course threads running in-kernel also need their own set of registers, so that user-mode state can be restored when the thread exits the kernel.

Take a concept like a process. Within a process, there are four ways to model the user-mode/kernel-mode thread split:

  • Nothing: One “thread” per process, as per old-school Unix.
  • 1:N: There is only one kernel thread, but multiple user threads.
  • 1:1: For each user thread, there is a corresponding kernel thread.
  • M:N: There are multiple user and kernel threads, but no one-to-one correlation.

1:N means exclusively user-level threads and is very uncommon nowadays, since multi-core systems are so common. 1:1 is the obvious generalisation of “nothing” and is the most common threading model nowadays. The M:N model offers the best performance and flexibility but almost no real-world operating systems actually use it.

Why the M:N model is better

There is a lot of background, but basically it boils down to two things:

  1. You don’t need to enter the kernel and run the (expensive) kernel scheduler so often, and
  2. User-space scheduling can be more efficient as it has more contextual information to work with.

You really need examples to talk about this stuff, and then things get a little murky because it’s hard to know how representative the examples are. Nonetheless Paul Turner gave a good motivating example in his 2013 talk on user-level thread scheduling in Linux: a multi-threaded server where one thread listens to the socket(s) and dispatches to worker threads. The typical thing that happens is that the dispatcher receives data, sends it to a worker, and goes to sleep waiting for more data. If you do this naively, your dispatcher gets scheduled and sends a message to a worker. The kernel scheduler runs and starts the worker, perhaps on a different CPU if one is available. Meanwhile the first CPU resumes running the dispatcher, which promptly goes to sleep waiting for more data. Now the worker is on a fresh CPU with a cold cache. The simple solution here is to pin worker and dispatcher to the same CPU, but this has obvious performance implications as well. Paul’s solution involves adding some syscalls to Linux, in particular a “switch_to” syscall which exchanges the states of the caller and callee thread without doing a reschedule.

This example is made more convincing by being an example of an IO-bound task, which M:N thread models have been traditionally bad at.

Why nobody uses the M:N model

The fundamental problem is backwards compatibility. Debuggers and other tools (such as tracers) assume a 1:1 thread model and get confused. Both the Linux user-level threading patches, and Windows 7’s user-level scheduling, therefore enforce a 1:1 thread model even while doing user-level scheduling (in Linux’s case, by a syscall, and in Windows’ case, by switching entirely at user-level and having the kernel fix things up if a thread enters the kernel but isn’t the one the kernel was expecting).

Some older arguments for an M:N model don’t apply any more, at least on x86 — the SYSENTER and SYSEXIT opcodes make entering and leaving kernel mode very fast, whereas previously it was relatively slow. However, modern scheduler implementations are very complicated, so now the kernel scheduler is the bottleneck. 

Another problem, strongly related to backward compatibility, is POSIX. Delivering signals to user-level threads the kernel has no idea about was, apparently, a nightmare, as was thread-local storage. 

It is of course theoretically possible to implement an M:N model on top of a 1:1 model without any kernel involvement at all, just by putting user-level threads in your language runtime. This was, apparently, what Rust did for a while. Unfortunately, unless your entire system is designed with that model in mind, things eventually go wrong — Rust was obliged, for example, to use asynchronous versions of every system call, which slowed down “normal” code, and libraries which made use of thread-local storage or which made blocking system calls broke the model. Implementing such a thing without pervasive OS-wide support seems very difficult in practise.

Both FreeBSD and NetBSD attempted an M:N implementation, which was apparently rather complex. It’s hard to tell how much of this complexity is related to supporting POSIX semantics, but presumably at least some of it was. NetBSD’s implementation was also, apparently, written before NetBSD supported multiple processors. When multiple processors arrived, threads within a process were limited to a single core.

Windows NT had a user-level threading solution called Fibers, but it was (apparently gratuitously) incompatible with everything except itself, including the win32 API, so doesn’t really add any data — I mention it because it tends to come up in these discussions, along with GNU pth, for some reason.

Fundamentally, it seems that 1:1 is good enough for most purposes most of the time.

What people are doing instead

Both Linux and Windows have, essentially, hacks to support user-level scheduling whilst not breaking backwards compatibility — Linux with the switch_to schedule-without-a-schedule, and Windows by having the kernel fix itself up on entry. Both approaches maintain the 1:1 kernel-to-user mapping for legacy reasons. (Actually it doesn’t seem like the Linux user-level scheduling patches even made it into the mainline kernel.)

Interestingly, both approaches recognise the need to abstract above threads to some extent at user level. The switch_to syscall is fairly high-level and encodes a simple scheduling decision. Meanwhile Windows 7 has a user-level scheduling framework, ConcRT, which aims to lift programmers above a thread-centric model and have them focus instead on a task-centric model.

M:N crops up in research operating systems, such as Nemesis and Barrelfish, and even in some “real” OSes, such as DragonFly BSD (with its “light weight kernel threads”).

More information

Scheduler Activations (pdf): the original M:N threading model (better-quality recreation (pdf)).

User-level threads… with threads (video): Paul Turner’s talk about Linux user-level scheduling.

Dave Probert: Inside Windows 7 - User Mode Scheduler (UMS) (video): Dave Probert’s chat about Windows user-level scheduling.

[rust-dev] The future of M:N threading: on why Rust gave up on their hybrid model.