Raytracing in the browser
Interestingly / scarily, the Web version runs almost as fast as the native version on my Mac. That’s really impressive.
Interestingly / scarily, the Web version runs almost as fast as the native version on my Mac. That’s really impressive.
This is a CI20 bare-metal project post.
Back in the dual-core post we talked a bit about multicore cache coherency — specifically, that kseg0 didn’t have it. This meant that if two cores wanted to communicate using shared memory in kseg0 they would have to use explicit cache flushing, or turn caching off entirely.
Now that we have a user mode, we can run basically the same test in user space, where things are very different. To do this, we modify sigma0’s main.c to do one of two things:
The code for this is in the userspace-cache-coherency branch. If you build and run that, you should see a changing number when the kseg0 version repeatedly printed zeroes. The user-space version is somehow magically keeping the caches coherent between the two cores.
The magic is that the hardware implements the MSI protocol: when one core modifies a cache, the other cores find out somehow (either by bus snooping or by consulting a central directory) so they can invalidate their own state. This is very convenient for us, so why isn’t it implemented in kseg0 as well? Actually, on some architectures, it can be (by setting the CCA bits in the Config register of CP0). But this is implementation-specific: the standard “cacheable” setting (3) of CCA is incoherent.
As noted in the “When You Say Nothing At All” post, the philosophy we’ll adopt with this operating system is to leave inter-core communication to user space, as much as possible, and cache coherence is one of the reasons why.
This is a CI20 bare-metal project post.
Big update this time as we enter user mode for the first time. Before we get started, it’s worth checking out the code to follow along. The git tag is ‘userspace’:
$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/userspace
Here’s what we’re going to do, overall:
Surprisingly, the most complicated part of the whole process is step 4 — getting the user-mode program into memory. That stage is also where most of the design is in this instalment, so let’s start with that.
As you know, the boot process at the moment goes like this:
Fundamentally, we just want to load another ELF. We’re already loading two ELF files to start the system. This leads to obvious solution 1.
You’ve probably guessed, since this heading is called “obvious solution 1”, that I didn’t go with it. I think there are several problems with this approach.
The first problem is that you need to tell the kernel where the user-space program is loaded and what the entry point is, so that the kernel can run it. The second problem is that you need to link the user-space program at some address which is not already occupied by the kernel (or, to put it another way: the kernel loads at 0x80000000, where does the user-space program load?).
Both of the above problems can be avoided by making the kernel deal with them. That leads to obvious solution 2.
This solution has its own problems. ELF loaders are complicated and a potential source of security vulnerabilities. It’s also not very obvious (to put it politely) that ELF loading needs to be something the kernel should do — perhaps a user-space program could be in charge of ELF loading. So all we need to do is to get that one user-space ELF loaded, and then it could do the rest.
My chosen solution is to combine the kernel and user-space programs into a single ELF file according to the following diagram:
In words: the kernel is compiled and linked normally (at a base address of kseg0, i.e. 0x80000000). Then the user-mode program, which is called sigma0 for reasons we’ll get back to, is linked. The base address of sigma0 is the highest virtual address of the kernel. In other words, if the kernel is linked to occupy addresses 0x80000000 to 0x80004FFF, then sigma0 will be loaded from 0x80005000 up*. The two files are then combined into one ELF file, which works because the loadable segments of each file do not overlap (but are contiguous). Finally, a data structure in the combined ELF file is patched up to contain the entrypoint of Sigma0, as well as some other details.
This adds a lot of complexity, but the complexity is in building the system: the running code is very simple.
This horrible ELF manipulation is achieved through a new toolchain called Saruman. Saruman is now a required part of the CI20 build process.
In your CI20 directory, download and compile Saruman:
$ git clone https://github.com/nfd/saruman
$ cd saruman
$ cmake .
$ make -j
That covers the loading part. The rest is simple!
* Side note: Why 0x80005000? I thought this was user space? Actually, the link address would be 0x5000 in this example, but the load address inside the ELF file is in kseg0 so that the USB loader can upload it without needing TLB refill support. What sorcery is this? Saruman again.
There is nothing CI20-specific about this — we implement a very simple MIPS syscall handler which saves all registers (to a structure called “current”), runs a C function to handle the syscall, and then returns to user mode using eret. There is one syscall implemented (implementation in architecture/mips/exception.c) which writes a character to the serial port.
Like many modern MIPS processors, the JZ4780 provides a separate happy-path TLB exception handler entrypoint. The implementation of this is in start.S (label _exc_tlb_refill) and it simply direct-maps user space — giving vaddrs from 0 to 0x7FFFFFFF the same treatment that vaddrs in kseg0 get (except, of course, that user-mode code can only access the former).
This was a relaxing part! Implementation in the sigma0 directory. It does almost nothing at this point. sigma0/main.c includes some commented-out code to access kseg0. If you uncomment this code, you should see the kernel… well, panic with an unhandled exception, which is admittedly not very nice, but which does demonstrate that the program is indeed running in user mode.
We did this bit!
If you’ve downloaded and built Saruman, you should be able to compile and load everything the normal way.
You should see two messages from sigma0 — “Sup0” and “Sup1”. These correspond to the two cores we are running. Each core is running its own user-space program. This is all part of the grand plan to eliminate inter-core communication on the kernel, and we’ll discuss it in more detail in the next post.
I’m writing an operating system and one of the interesting parts is constructing the executables. I’d like the kernel to be as small as possible, which means deferring a lot of its functionality to a separate program running with lower privileges — in user space, in other words. But this leads to a bootstrapping problem, since support for loading a user-space program is exactly the sort of complicated thing I would prefer not to add to my kernel.
My solution is to store the initial user-space program, sigma0, as part of the kernel executable. This way, whatever loads the kernel also ends up writing the user-space program into memory as well. This isn’t an original solution; the typical ways people do this are to use objcopy to convert the user-space program to a raw binary blob, or to include the user-space program verbatim and add an ELF loader to the kernel. Neither of these solutions is particularly great — raw binary blobs are by definition structureless and hard to debug, and I don’t want to add an ELF loader to the kernel. (The Linux solution, called initrd, incorporates an entire file system as a giant blob — which is fine for Linux, as it is a giant kernel and already includes filesystem-mounting and ELF-loading functionality.)
So my solution is to make one ELF file containing both kernel and user-space program. ELF files are structured into segments (among other things), so all one needs to do is construct a new ELF file containing segments both for the kernel and for the user-space program, then patch up the kernel to tell it where the user-space program should be.
To cut a long story short, that need, and a few other ELF-related requirements I had, weren’t well served by the standard toolchain. So I wrote the Saruman suite of pipeline-friendly ELF tools.
License: Public domain, as usual
There are three tools currently.
objcat combines ELF files by producing a new file containing all the loadable segments from all input files. For example, objcat program1 program2 >program3. This obviously doesn’t produce anything interesting if the virtual addresses of the input segments overlap, but it’s useful if they don’t.
objinfo prints information from ELF files. Currently it can print the value of a symbol (objinfo -S name program) or the entrypoint of the file (objinfo -E program)
objpatch lets you overwrite specific bytes in ELF files. The interesting part is that the address of the bytes is specified as a virtual address. Objpatch will find the part of the file corresponding with the vaddr you specify. For example, objpatch -V 0x80000c00=u32:0x4000 program >out will overwrite the unsigned 32-bit integer which would load at virtual address 0x80000c00 with the value 0x4000.
I’ll probably add more to these in future, but it’s already enough to produce the two-executables-in-one-ELF functionality I need for my kernel.
(Why Saruman and not Sauron? Because Sauron took a good thing and made it bad, but Saruman took a bad thing and made it worse.)
This is sort-of part of the CI20 bare-metal project, but it’s all theory and nothing to do with the CI20 or bare-metal programming. It’s about how to design a system which supports threads.
A thread consists of a stack and a set of registers (okay, and thread-local storage). Threads enter the kernel through syscalls; when they do, it would be very dangerous for the thread running in-kernel to continue to use the user-level stack, so we need a kernel stack for each thread concurrently running in the kernel as well. And of course threads running in-kernel also need their own set of registers, so that user-mode state can be restored when the thread exits the kernel.
Take a concept like a process. Within a process, there are four ways to model the user-mode/kernel-mode thread split:
1:N means exclusively user-level threads and is very uncommon nowadays, since multi-core systems are so common. 1:1 is the obvious generalisation of “nothing” and is the most common threading model nowadays. The M:N model offers the best performance and flexibility but almost no real-world operating systems actually use it.
There is a lot of background, but basically it boils down to two things:
You really need examples to talk about this stuff, and then things get a little murky because it’s hard to know how representative the examples are. Nonetheless Paul Turner gave a good motivating example in his 2013 talk on user-level thread scheduling in Linux: a multi-threaded server where one thread listens to the socket(s) and dispatches to worker threads. The typical thing that happens is that the dispatcher receives data, sends it to a worker, and goes to sleep waiting for more data. If you do this naively, your dispatcher gets scheduled and sends a message to a worker. The kernel scheduler runs and starts the worker, perhaps on a different CPU if one is available. Meanwhile the first CPU resumes running the dispatcher, which promptly goes to sleep waiting for more data. Now the worker is on a fresh CPU with a cold cache. The simple solution here is to pin worker and dispatcher to the same CPU, but this has obvious performance implications as well. Paul’s solution involves adding some syscalls to Linux, in particular a “switch_to” syscall which exchanges the states of the caller and callee thread without doing a reschedule.
This example is made more convincing by being an example of an IO-bound task, which M:N thread models have been traditionally bad at.
The fundamental problem is backwards compatibility. Debuggers and other tools (such as tracers) assume a 1:1 thread model and get confused. Both the Linux user-level threading patches, and Windows 7’s user-level scheduling, therefore enforce a 1:1 thread model even while doing user-level scheduling (in Linux’s case, by a syscall, and in Windows’ case, by switching entirely at user-level and having the kernel fix things up if a thread enters the kernel but isn’t the one the kernel was expecting).
Some older arguments for an M:N model don’t apply any more, at least on x86 — the SYSENTER and SYSEXIT opcodes make entering and leaving kernel mode very fast, whereas previously it was relatively slow. However, modern scheduler implementations are very complicated, so now the kernel scheduler is the bottleneck.
Another problem, strongly related to backward compatibility, is POSIX. Delivering signals to user-level threads the kernel has no idea about was, apparently, a nightmare, as was thread-local storage.
It is of course theoretically possible to implement an M:N model on top of a 1:1 model without any kernel involvement at all, just by putting user-level threads in your language runtime. This was, apparently, what Rust did for a while. Unfortunately, unless your entire system is designed with that model in mind, things eventually go wrong — Rust was obliged, for example, to use asynchronous versions of every system call, which slowed down “normal” code, and libraries which made use of thread-local storage or which made blocking system calls broke the model. Implementing such a thing without pervasive OS-wide support seems very difficult in practise.
Both FreeBSD and NetBSD attempted an M:N implementation, which was apparently rather complex. It’s hard to tell how much of this complexity is related to supporting POSIX semantics, but presumably at least some of it was. NetBSD’s implementation was also, apparently, written before NetBSD supported multiple processors. When multiple processors arrived, threads within a process were limited to a single core.
Windows NT had a user-level threading solution called Fibers, but it was (apparently gratuitously) incompatible with everything except itself, including the win32 API, so doesn’t really add any data — I mention it because it tends to come up in these discussions, along with GNU pth, for some reason.
Fundamentally, it seems that 1:1 is good enough for most purposes most of the time.
Both Linux and Windows have, essentially, hacks to support user-level scheduling whilst not breaking backwards compatibility — Linux with the switch_to schedule-without-a-schedule, and Windows by having the kernel fix itself up on entry. Both approaches maintain the 1:1 kernel-to-user mapping for legacy reasons. (Actually it doesn’t seem like the Linux user-level scheduling patches even made it into the mainline kernel.)
Interestingly, both approaches recognise the need to abstract above threads to some extent at user level. The switch_to syscall is fairly high-level and encodes a simple scheduling decision. Meanwhile Windows 7 has a user-level scheduling framework, ConcRT, which aims to lift programmers above a thread-centric model and have them focus instead on a task-centric model.
M:N crops up in research operating systems, such as Nemesis and Barrelfish, and even in some “real” OSes, such as DragonFly BSD (with its “light weight kernel threads”).
User-level threads… with threads (video): Paul Turner’s talk about Linux user-level scheduling.
Dave Probert: Inside Windows 7 - User Mode Scheduler (UMS) (video): Dave Probert’s chat about Windows user-level scheduling.
[rust-dev] The future of M:N threading: on why Rust gave up on their hybrid model.
I’ve always liked the song lyric in the title, in a wry sort of way, because of its apparent, and surprisingly blithe, indictment of the recipient’s ability to express themselves, and/or its suggestion that it was past time that the object of the singer’s affections shut up and got on with... well, with whatever nonverbal communication the singer was concerned with at the time. Yes, I realise it’s about how sometimes words are unnecessary and that it’s not supposed to be mean. But it’s so easy to read it as mean! I just don’t think the writer expressed himself very well, that’s all. I suspect he doesn’t take his own advice.
In any case, “you say it best when you say nothing at all" is surprisingly relevant to the CI20 bare-metal project. As we saw in the previous post about multicore, cache (in-)coherence is a real problem when communicating between cores in kseg0. There are multiple options available to us:
1. Communicate using only uncached memory. this is effective, but slow. We could take this option if communication was very infrequent.
2. Manually manage the cache. Use the MIPS cache-management instructions to effectively disable caching only for the shared variable.
3. Communicate via other means. For example, the JZ4780 has support for inter-processor interrupts (IPIs) which are triggered by writing to a special mailbox location in CP0. This is fast and convenient, but limits communication to 32 bits — we could pass a command, or a single pointer.
4. (The country option): say nothing at all. To the extent that it’s possible, eliminate inter-core communication altogether.
Perhaps surprisingly, it seems quite feasible to eliminate inter-core communication for most things that a kernel might want to do. If you’re into OS theory, there is plenty of work in this area, for example in Barrelfish. There are several good reasons to aim for such a thing: no inter-core communication eliminates two particularly tricky sources of bugs (cache invalidation and synchronisation); it simplifies the kernel; and specifically it pushes matters relating to policy (which tends to be where we need multi-core synchronisation) out of the kernel and into user-space (more on this later); and it’s very easy to scale (because if there is no inter-core communication, then inter-core communication can’t possibly become a scalability bottleneck).
So that’s the option we’ll be taking with the CI20 bare-metal project. We will get into cache management at some point, but as far as the kernel is concerned, we will aim to keep each core as isolated as possible. When cores do need to share information — for example, when both cores are running code in the same address space — we will aim to do this sharing outside the kernel.
In this post we’ve moved away from CI20-specific code into more general areas of operating system design. This will be a theme of at least the next few posts. We will still be implementing all this stuff, though, so there will still be plenty of low-level and system-specific code to contend with!
We’re back with some actual code! This time, let’s bring up the second core.
The CI20’s JZ4780 SOC has two MIPS cores on it. By default, though, it starts up with only the first core running, which makes sense — a single core presents the simplest startup environment, and it’s easy enough for the first core to bring up the others.
There are apparently several competing standards for multicore MIPS systems. Refreshingly, the JZ4780 doesn’t follow any of them (well, let’s say it follows the “XBurst standard”, as implemented by… the JZ4780). However, once you know what’s involved, it’s very simple to start up a new core. The key code is in soc/jz4780/multicore.c. There are four main steps:
As a designed-for-mobile SOC, the JZ4780 has a lot of support for controlling energy use. You can selectively power-down almost everything in the SOC (the key register here is LCR, for Low-Power Control Register), and you can also selectively disable the clock source for most system modules using “clock gating”. There is a little more info on this lower down, if you’re interested, but the basics are that you need to enable the core’s clock and make sure it’s not in one of several low-power modes before you can start it up.
Check out and run this code in the usual way (the release tag for this version is “multicore”, i.e. git checkout tags/multicore).
You should see “Hello, world!” followed by (or perhaps partially interspersed with) “Hello, multicore world!”, followed by three rows of 00000000. What’s with those?
In a previous release, we enabled interrupt handling and set up a timer to increment a counter. The kernel entrypoint would then print the value of this counter at regular intervals. But in this release, the code in main.c doesn’t register a counter (and actually the init code doesn’t even set up the timer). However, if you look further down main.c, you will see something interesting. The second core is incrementing the counter like mad, but the first core keeps printing 00000000, even though the “counter” variable is volatile.
The reason is that both cores are accessing “counter” from their cache. Each core has its own level-1 cache (“primary cache” in MIPS terminology), which means that if caching is enabled we need to do something special to ensure that the caches remain coherent, i.e. that they agree with each other. However, if you look in start.S, you will see that we are explicitly not doing this:
/* Enable caching in kseg0 */
li t0, CACHE_MODE_CACHABLE_NONCOHERENT
mtc0 t0, CP0_CONFIG
So we can just change that CACHE_MODE_CACHABLE_NONCOHERENT to CACHE_MODE_UNCACHED and it will work, right? Right, but if you try this you will notice that everything runs much slower (and you may want to change the delay loop in main() from 0xfffffff to, say, 0xfffff, so you’re not waiting for ages).
Fortunately, turning off caching entirely is not the only solution to this problem. More on that in the next installment.
Stacks: The kernel now declares its own stack (as part of its BSS). Previously, start.S was simply carving off a piece of RAM to use as a stack. This obviously doesn’t make sense for anything other than a tutorial, but it gets worse with multiple cores. architecture/mips32/kernelstack.c declares a (small) kernel stack for each core, and architecture/mips32/start.S and soc/jz4780/multicore_lowlevel.S make use of it.
CP0 accessor functions: We now take a FreeBSD-inspired approach to, e.g., writing to device registers or to CP0 — small inline functions with sensible names, used from C. This is nicer than bit shifting and poking values, and much nicer than dropping down to assembly language. An example of this approach is in include/soc/jz4780/jz47xx-cgm.h, which defines a number of functions for gating and ungating clock sources, as used by soc/jz4780/multicore.c.
Library-ification: Start.S, being MIPS-specific, is now part of libsystem.a. Unfortunately this means that it needs to be in its own section, so that the linker can be instructed to place it before any other code. I’m still not entirely happy with what’s going on here behind the scenes — GNU ld seems to be creating a LOAD segment which has zero length and is entirely BSS, which doesn’t make much sense to me — but the outcome is positive overall.
I have hinted at CP0 a few times during this series, but I haven’t explicitly addressed it. The MIPS design includes support for “coprocessors”, which are accessed using special instructions. Coprocessor 0, which is required, is mostly about memory management and caching, but it also has a grab-bag of other things, such as processor information (make, model, size of cache line), plus implementation-defined portions which can do anything — such as turn on the second core, in the case of the JZ4780.
There are two relevant assembly-language instructions: mtc0 (move register value to coprocessor 0) and mfc0 (move from coprocessor 0 to register). In this release, any access to them from C is wrapped up in CP0 accessor functions (see above).
Q: What is clock gating?
A: Clock gating is an energy-saving measure. Switching transistors on and off repeatedly, which is what a clock does, takes energy. In a complex system like the JZ4780, a single clock is connected to multiple modules — so it’s very likely that the clock will be running, but not all modules will be active. In this case, you can gate (disable) the clock source for the modules you aren’t using, without having to turn the clock off completely.
Q: How does clock gating apply here?
A: Well, the jz4780 lets you gate pretty much everything, including the second processor core — and, on startup, pretty much everything, including the second core, is in fact gated. If we enable CORE1 with a gated clock it won't execute any instructions.
Q: I notice that the JZ4780 programmer’s manual refers to CORE1’s clock gate bit as “P1”, with no further information. Did figuring out that P1 meant CORE1 take any time?
A: No, none at all. Well, not much. I mean, a little. Well, definitely not much more than I’d want to admit.
Q: I also notice that the JZ4780 programmer’s manual refers to CORE1 as “SCPU”, with no further information, in documentation for the same module. I guess that was similarly easy to recognise?
A: *hopeless stare*
Q: Look, would you like a coffee?
A: I’d love a coffee!
Welcome back to the CI20 bare-metal project! I’ve got a few interesting things planned, but this release is just a clean-up: there are no new features, but the project now builds using CMake.
CMake build scripts are quite a bit easier to understand than Make, not least because the tool implicitly understands C dependencies, such as header files. Of course, the downside is that we have added CMake as a dependency.
This release is tagged “cmake”:
$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/cmake
To build it for the first time, run CMake:
$ cmake .
Then run make as usual.
You can then boot in the normal way:
$ python3 usbloader.py stage1/stage1.elf
$ python3 usbloader.py kernel/kernel.elf
After the first time, you shouldn’t have to re-run CMake — even if you change the build system (CMakeLists.txt).
I read a lot of Web pages but it’s not always convenient to read them when I find them. Apps like Instapaper and Pocket let one save a version of the page for later offline viewing, such as when one is on an underground train. These apps also default to providing a “readable” version of the page, rather than the original — this is a version where only the useful parts of the page are presented, sans headers, sidebars, ads and so on. At least, that’s the theory — it’s not easy to automatically figure out which parts of the page are useful.
Anyway, I’m reasonably happy with these apps but of course it’d be better if I didn’t need to give my reading history to a third party, so I started wondering how difficult it would be to write one of these things myself. Turns out it’s quite easy. There are two obvious options:
Option 2. Remote-control a browser and extract the useful portions of the page in browser context.
The process, then, is:
… but so far I only have the first three steps of the process to demonstrate to you.
git clone https://github.com/nfd/cefpythondemo
git clone https://github.com/nfd/readability-js
pip install cefpython3
python readabilitydemo.py <url to download>
This will store a file named readable.html in the current working directory. Unfortunately Python 2 is required — this is a limitation of cefpython. Tested on a Mac — some changes may be required for other systems.
Future work: readability.js is a bit old now and doesn’t work for some pages. Also, it’s useful to store images, sometimes, which readability.js doesn’t do. Since we’re running a full browser it’s also possible to take screenshots (of the hidden display), which would be useful for various things. Lots of finessing would be required around timeouts, resource limits, and clean-up before I’d put this thing on a server. And of course it needs an app to be actually useful.
The motivation behind language models is to get some idea of the likelihood of a particular sequence of words, or (same thing) the likelihood of a particular word given a context of previous words. There are many applications of language models — spelling correction and speech recognition come to mind for simple ones, and combined with the concept of word embeddings they can be used to draw analogies, identify synonyms, and more.
An n-gram is a continuous set of items from some text. Unigrams from the previous sentence would be “An”, “n-gram”, “is”, etc. Bigrams would be (“An”, “n-gram”), (“n-gram”, “is”), (“is”, “a”), etc.
Language models compute the probability of a sequence of words, or, given a sequence of words, compute the probability of an upcoming word. They are effectively probabilistic grammars. An n-gram model uses n-grams to compute this probability.
N-gram models make use of the Markov assumption, which is that the Markov property holds. The Markov property of some process is conditional probability of future states depends only on the current state, and not on past states. This means that in a bigram language model we assume that the conditional probability of future words depends only on the previous two words. This is of course totally wrong, but it’s a simplifying assumption which is sufficient for some things.
A unigram model, also known as a bag of words model, uses the frequency of individual words to make its predictions. For each word, the probability is simply the number of times it occurs in the training corpus divided by the total number of words in the corpus. For example, in the statement
“If you like to like those who like people who like you, you are a you-liker liker liker”
… P(like) is 4/18 ~= 0.22.
By extension an n-gram model uses the conditional probability of word n given word n-1, … 1. In the bigram model, we compute P(you | if), P(like | you), P(to | like), P(like | to), etc.
An n-gram language model will perform best at modelling its training set, but the sparsity of language means that it won’t work as well on other corpuses because plenty of probabilities will be exactly zero — for n words which didn’t appear at all in the training set. The general approach to fixing this is to assign small but non-zero probabilities to all zero-labelled n-grams in the set, by “stealing” from the probabilities of other members of the set. You can also, sometimes, just do really stupid things like pretend that every n-gram appears at least once. Why bother? It can make the model more realistic. There is also the practical consideration that a common measurement of model accuracy, perplexity, relies on non-zero probabilities.
An interesting bit of jargon: the maximum likelihood estimate for some parameter of a model is the estimate which maximises the likelihood that that we’re in the training set. To paraphrase the Stanford slides (see references), if “bagel” appears 400 times in a one-million-word corpus, then the MLE is 400/1000000 = 0.0004, and this is the probability that makes it most likely that the text we’re looking at is statistically the same as the training set, i.e. that a randomly-chosen word will be “bagel” 0.04% of the time.
Continuous space models
N-gram models, as discussed above, are based on the completely-unrealistic but sometimes-useful assumption that the n most recent words are all we need to predict future words. Ideally we would set n to be as large as possible, but the sparsity of language means that this isn’t feasible — almost all grammatical 18-word sentences, for example, including that “liker liker liker” abomination above, have never been written before; there simply isn’t enough data to train an n-gram model for large n (and “large” appears to be closer to 4 than 18, in practise).
We saw above that the n-gram approach to solving the zero-probability problem was some form of smoothing. However, this is by definition unrealistic — we invent the probabilities of unseen n-grams based on the probabilities of component (n-1)-grams — or something even less plausible, such as adding a small constant amount.
Continuous-space models address this problem. The idea behind them is to put each word into a real-valued (“continuous”) high-dimensional vector space — in other words, to assign each word a set of real values y1, y2, y3, … which together represent the word. The aim is to achieve this goal:
Goal: Ensure that each dimension of this space, or at least some of them, represents some characteristic of the word.
If we manage to do this, then
Consequence 1: the closer together (in a literal, Euclidean sense) two words are in this space, the more related they are, and
Consequence 2: you can use this property to estimate the likelihoods of n-grams which you haven’t seen before.
The output of a model trained to place words into a space in this fashion is a set of word embeddings — a new name, since they aren’t probabilities any more. (The popular form of continuous space model at the moment uses a neural network to learn the embeddings. Non-neural-network-based methods exist, but I’m not going to write about them here.)
The distributional hypothesis in linguistics states that words used in similar contexts tend to be related somehow. This makes sense intuitively — imagine sentences like “I like to eat x”, “Let’s go to the x”, “It’s not x being green”, and so on. We can train a neural network to take advantage of the distributional hypothesis by training it to predict the next word given a context, i.e. P(W(n) | W(n-j), W(n-j+1) … W(n-1)). The idea is to minimise the distance between the target word and its context. By doing this ‘online’ (example-by-example, rather than in a batch) the vectors for words which tend to be close together end up closer together themselves. Thus the neural network takes advantage of the distributional hypothesis to generate a set of word embeddings — words are initially assigned random vectors and gradually “move into place” during training.
The first neural-network-based word embeddings did the above by maximising the log-likelihood of a given word and its context. This is computationally expensive, because to calculate log-likelihood you need to calculate the probability of all other words in the same context at every training step.
The big innovation in the word2vec algorithm is to use a different training method, noise-contrastive training, which is more efficient computationally. Instead of maximising log-likelihood for a word given its context, word2vec trains a neural network to distinguish a word in a context from several “noise” words in the same context (where a ’noise’ word is a valid word which does not appear in the given context). This is computationally much cheaper (no calculation of other-word probabilities, and the answer is a simple logistic-regression “yes” or “no”.
A neat word2vec trick is coming up with analogies — you can ask questions like “king” - “man” + “woman” = ? and a word2vec-trained language model will answer “queen”. This isn’t specific to word2vec, though — it’s a consequence of the application of the distributional hypothesis. Word2vec made it feasible to compute these embeddings on large data sets, however.
CBOW vs skip-gram
As mentioned above, bag of words is the name given to unigram language models. Continuous bag of words (CBOW) is the extension of that model into a continuous vector space.
The above description of neural continuous models is CBOW. We saw that we train a network to predict the next word given a context — but that context is presented as a vector generated by averaging all the words present in it. This means that CBOW training discards information — you could imagine that multiple contexts (ideally similar ones, but nonetheless) could average to the same, or to very similar, vectors.
The alternative is skip-gram, which flips the training around: rather than attempting to predict word from context, predict each context word from a given target word. It turns out that this approach performs better in larger data sets, but the smoothing provided by CBOW actually helps in smaller data sets.