Sun
Jan 15

The Saruman suite of ELF manipulation tools

I’m writing an operating system and one of the interesting parts is constructing the executables. I’d like the kernel to be as small as possible, which means deferring a lot of its functionality to a separate program running with lower privileges — in user space, in other words. But this leads to a bootstrapping problem, since support for loading a user-space program is exactly the sort of complicated thing I would prefer not to add to my kernel.

My solution is to store the initial user-space program, sigma0, as part of the kernel executable. This way, whatever loads the kernel also ends up writing the user-space program into memory as well. This isn’t an original solution; the typical ways people do this are to use objcopy to convert the user-space program to a raw binary blob, or to include the user-space program verbatim and add an ELF loader to the kernel. Neither of these solutions is particularly great — raw binary blobs are by definition structureless and hard to debug, and I don’t want to add an ELF loader to the kernel. (The Linux solution, called initrd, incorporates an entire file system as a giant blob — which is fine for Linux, as it is a giant kernel and already includes filesystem-mounting and ELF-loading functionality.)

So my solution is to make one ELF file containing both kernel and user-space program. ELF files are structured into segments (among other things), so all one needs to do is construct a new ELF file containing segments both for the kernel and for the user-space program, then patch up the kernel to tell it where the user-space program should be.

To cut a long story short, that need, and a few other ELF-related requirements I had, weren’t well served by the standard toolchain. So I wrote the Saruman suite of pipeline-friendly ELF tools.

Githubhttps://github.com/nfd/saruman 
License: Public domain, as usual 

There are three tools currently.

objcat combines ELF files by producing a new file containing all the loadable segments from all input files. For example, objcat program1 program2 >program3. This obviously doesn’t produce anything interesting if the virtual addresses of the input segments overlap, but it’s useful if they don’t.

objinfo prints information from ELF files. Currently it can print the value of a symbol (objinfo -S name program) or the entrypoint of the file (objinfo -E program)

objpatch lets you overwrite specific bytes in ELF files. The interesting part is that the address of the bytes is specified as a virtual address. Objpatch will find the part of the file corresponding with the vaddr you specify. For example, objpatch -V 0x80000c00=u32:0x4000 program >out will overwrite the unsigned 32-bit integer which would load at virtual address 0x80000c00 with the value 0x4000. 

I’ll probably add more to these in future, but it’s already enough to produce the two-executables-in-one-ELF functionality I need for my kernel.

(Why Saruman and not Sauron? Because Sauron took a good thing and made it bad, but Saruman took a bad thing and made it worse.)

Thu
Dec 15

Some quick notes on threading models

This is sort-of part of the CI20 bare-metal project, but it’s all theory and nothing to do with the CI20 or bare-metal programming. It’s about how to design a system which supports threads.

A thread consists of a stack and a set of registers (okay, and thread-local storage). Threads enter the kernel through syscalls; when they do, it would be very dangerous for the thread running in-kernel to continue to use the user-level stack, so we need a kernel stack for each thread concurrently running in the kernel as well. And of course threads running in-kernel also need their own set of registers, so that user-mode state can be restored when the thread exits the kernel.

Take a concept like a process. Within a process, there are four ways to model the user-mode/kernel-mode thread split:

  • Nothing: One “thread” per process, as per old-school Unix.
  • 1:N: There is only one kernel thread, but multiple user threads.
  • 1:1: For each user thread, there is a corresponding kernel thread.
  • M:N: There are multiple user and kernel threads, but no one-to-one correlation.

1:N means exclusively user-level threads and is very uncommon nowadays, since multi-core systems are so common. 1:1 is the obvious generalisation of “nothing” and is the most common threading model nowadays. The M:N model offers the best performance and flexibility but almost no real-world operating systems actually use it.

Why the M:N model is better

There is a lot of background, but basically it boils down to two things:

  1. You don’t need to enter the kernel and run the (expensive) kernel scheduler so often, and
  2. User-space scheduling can be more efficient as it has more contextual information to work with.

You really need examples to talk about this stuff, and then things get a little murky because it’s hard to know how representative the examples are. Nonetheless Paul Turner gave a good motivating example in his 2013 talk on user-level thread scheduling in Linux: a multi-threaded server where one thread listens to the socket(s) and dispatches to worker threads. The typical thing that happens is that the dispatcher receives data, sends it to a worker, and goes to sleep waiting for more data. If you do this naively, your dispatcher gets scheduled and sends a message to a worker. The kernel scheduler runs and starts the worker, perhaps on a different CPU if one is available. Meanwhile the first CPU resumes running the dispatcher, which promptly goes to sleep waiting for more data. Now the worker is on a fresh CPU with a cold cache. The simple solution here is to pin worker and dispatcher to the same CPU, but this has obvious performance implications as well. Paul’s solution involves adding some syscalls to Linux, in particular a “switch_to” syscall which exchanges the states of the caller and callee thread without doing a reschedule.

This example is made more convincing by being an example of an IO-bound task, which M:N thread models have been traditionally bad at.

Why nobody uses the M:N model

The fundamental problem is backwards compatibility. Debuggers and other tools (such as tracers) assume a 1:1 thread model and get confused. Both the Linux user-level threading patches, and Windows 7’s user-level scheduling, therefore enforce a 1:1 thread model even while doing user-level scheduling (in Linux’s case, by a syscall, and in Windows’ case, by switching entirely at user-level and having the kernel fix things up if a thread enters the kernel but isn’t the one the kernel was expecting).

Some older arguments for an M:N model don’t apply any more, at least on x86 — the SYSENTER and SYSEXIT opcodes make entering and leaving kernel mode very fast, whereas previously it was relatively slow. However, modern scheduler implementations are very complicated, so now the kernel scheduler is the bottleneck. 

Another problem, strongly related to backward compatibility, is POSIX. Delivering signals to user-level threads the kernel has no idea about was, apparently, a nightmare, as was thread-local storage. 

It is of course theoretically possible to implement an M:N model on top of a 1:1 model without any kernel involvement at all, just by putting user-level threads in your language runtime. This was, apparently, what Rust did for a while. Unfortunately, unless your entire system is designed with that model in mind, things eventually go wrong — Rust was obliged, for example, to use asynchronous versions of every system call, which slowed down “normal” code, and libraries which made use of thread-local storage or which made blocking system calls broke the model. Implementing such a thing without pervasive OS-wide support seems very difficult in practise.

Both FreeBSD and NetBSD attempted an M:N implementation, which was apparently rather complex. It’s hard to tell how much of this complexity is related to supporting POSIX semantics, but presumably at least some of it was. NetBSD’s implementation was also, apparently, written before NetBSD supported multiple processors. When multiple processors arrived, threads within a process were limited to a single core.

Windows NT had a user-level threading solution called Fibers, but it was (apparently gratuitously) incompatible with everything except itself, including the win32 API, so doesn’t really add any data — I mention it because it tends to come up in these discussions, along with GNU pth, for some reason.

Fundamentally, it seems that 1:1 is good enough for most purposes most of the time.

What people are doing instead

Both Linux and Windows have, essentially, hacks to support user-level scheduling whilst not breaking backwards compatibility — Linux with the switch_to schedule-without-a-schedule, and Windows by having the kernel fix itself up on entry. Both approaches maintain the 1:1 kernel-to-user mapping for legacy reasons. (Actually it doesn’t seem like the Linux user-level scheduling patches even made it into the mainline kernel.)

Interestingly, both approaches recognise the need to abstract above threads to some extent at user level. The switch_to syscall is fairly high-level and encodes a simple scheduling decision. Meanwhile Windows 7 has a user-level scheduling framework, ConcRT, which aims to lift programmers above a thread-centric model and have them focus instead on a task-centric model.

M:N crops up in research operating systems, such as Nemesis and Barrelfish, and even in some “real” OSes, such as DragonFly BSD (with its “light weight kernel threads”).

More information

Scheduler Activations (pdf): the original M:N threading model (better-quality recreation (pdf)).

User-level threads… with threads (video): Paul Turner’s talk about Linux user-level scheduling.

Dave Probert: Inside Windows 7 - User Mode Scheduler (UMS) (video): Dave Probert’s chat about Windows user-level scheduling.

[rust-dev] The future of M:N threading: on why Rust gave up on their hybrid model.

Thu
Dec 8

CI20: You say it best when you say nothing at all

I’ve always liked the song lyric in the title, in a wry sort of way, because of its apparent, and surprisingly blithe, indictment of the recipient’s ability to express themselves, and/or its suggestion that it was past time that the object of the singer’s affections shut up and got on with... well, with whatever nonverbal communication the singer was concerned with at the time. Yes, I realise it’s about how sometimes words are unnecessary and that it’s not supposed to be mean. But it’s so easy to read it as mean! I just don’t think the writer expressed himself very well, that’s all. I suspect he doesn’t take his own advice.

In any case, “you say it best when you say nothing at all" is surprisingly relevant to the CI20 bare-metal project. As we saw in the previous post about multicore, cache (in-)coherence is a real problem when communicating between cores in kseg0. There are multiple options available to us:

1. Communicate using only uncached memory. this is effective, but slow. We could take this option if communication was very infrequent.

2. Manually manage the cache. Use the MIPS cache-management instructions to effectively disable caching only for the shared variable. 

3. Communicate via other means. For example, the JZ4780 has support for inter-processor interrupts (IPIs) which are triggered by writing to a special mailbox location in CP0. This is fast and convenient, but limits communication to 32 bits — we could pass a command, or a single pointer.

4. (The country option): say nothing at all. To the extent that it’s possible, eliminate inter-core communication altogether.

Perhaps surprisingly, it seems quite feasible to eliminate inter-core communication for most things that a kernel might want to do. If you’re into OS theory, there is plenty of work in this area, for example in Barrelfish. There are several good reasons to aim for such a thing: no inter-core communication eliminates two particularly tricky sources of bugs (cache invalidation and synchronisation); it simplifies the kernel; and specifically it pushes matters relating to policy (which tends to be where we need multi-core synchronisation) out of the kernel and into user-space (more on this later); and it’s very easy to scale (because if there is no inter-core communication, then inter-core communication can’t possibly become a scalability bottleneck).

So that’s the option we’ll be taking with the CI20 bare-metal project. We will get into cache management at some point, but as far as the kernel is concerned, we will aim to keep each core as isolated as possible. When cores do need to share information — for example, when both cores are running code in the same address space — we will aim to do this sharing outside the kernel.

In this post we’ve moved away from CI20-specific code into more general areas of operating system design. This will be a theme of at least the next few posts. We will still be implementing all this stuff, though, so there will still be plenty of low-level and system-specific code to contend with!

Wed
Dec 7

CI20: Dual cores! Double your fun!

We’re back with some actual code! This time, let’s bring up the second core.

The CI20’s JZ4780 SOC has two MIPS cores on it. By default, though, it starts up with only the first core running, which makes sense — a single core presents the simplest startup environment, and it’s easy enough for the first core to bring up the others.

There are apparently several competing standards for multicore MIPS systems. Refreshingly, the JZ4780 doesn’t follow any of them (well, let’s say it follows the “XBurst standard”, as implemented by… the JZ4780). However, once you know what’s involved, it’s very simple to start up a new core. The key code is in soc/jz4780/multicore.c. There are four main steps:

  1. Ensure the core has a clock source.
  2. Set an entry point for the new core.
  3. Ensure the core is powered up.
  4. Take the core out of reset.
There are two functional units which make this happen: coprocessor 0, which has some XBurst-specific multicore extensions, and the accurately-named clock gating / low power control module.
 
As a reminder, all of these module and register names are in reference to the JZ4780 programmer’s manual (PDF link).

CP0 extensions for multicore

Ingenic has added several registers to its coprocessor 0 (for a quick discussion about coprocessor 0, see below). The two we need to manipulate to start the second core are:
  • CORE_CTL: determines whether the core is in sleep mode and whether it is in reset. On startup, CORE1 is held in reset — taking it out of reset (assuming it has power and a clock source) will immediately start it running at its entrypoint. CORE_CTL also has a flag which determines whether a core will use the default (ROM) entry point, or a custom one.
  • CORE_REIM: allows us to set our custom entry point.

Low-power and clock gating settings

As a designed-for-mobile SOC, the JZ4780 has a lot of support for controlling energy use. You can selectively power-down almost everything in the SOC (the key register here is LCR, for Low-Power Control Register), and you can also selectively disable the clock source for most system modules using “clock gating”. There is a little more info on this lower down, if you’re interested, but the basics are that you need to enable the core’s clock and make sure it’s not in one of several low-power modes before you can start it up. 

Running the code

Check out and run this code in the usual way (the release tag for this version is “multicore”, i.e. git checkout tags/multicore).

You should see “Hello, world!” followed by (or perhaps partially interspersed with) “Hello, multicore world!”, followed by three rows of 00000000. What’s with those?

In a previous release, we enabled interrupt handling and set up a timer to increment a counter. The kernel entrypoint would then print the value of this counter at regular intervals. But in this release, the code in main.c doesn’t register a counter (and actually the init code doesn’t even set up the timer). However, if you look further down main.c, you will see something interesting. The second core is incrementing the counter like mad, but the first core keeps printing 00000000, even though the “counter” variable is volatile.

The reason is that both cores are accessing “counter” from their cache. Each core has its own level-1 cache (“primary cache” in MIPS terminology), which means that if caching is enabled we need to do something special to ensure that the caches remain coherent, i.e. that they agree with each other. However, if you look in start.S, you will see that we are explicitly not doing this:

/* Enable caching in kseg0 */
li t0, CACHE_MODE_CACHABLE_NONCOHERENT
mtc0 t0, CP0_CONFIG

So we can just change that CACHE_MODE_CACHABLE_NONCOHERENT to CACHE_MODE_UNCACHED and it will work, right? Right, but if you try this you will notice that everything runs much slower (and you may want to change the delay loop in main() from 0xfffffff to, say, 0xfffff, so you’re not waiting for ages). 

Fortunately, turning off caching entirely is not the only solution to this problem. More on that in the next installment.

Implementation notes

Quite a few small changes went into this release, some of which are worth discussing:
 
Multicore entrypoint: CORE_REIM only allows us to specify the upper 16 bits of an entrypoint — the lower 16 are assumed to be zero. That means that our entry point for multicore must be aligned to 64k. There are several ways to do this, but the way I chose puts the multicore entrypoint right at the start of the kernel, at 0x80000000. This is nice in some ways (it’s as aligned as it could possibly be; there is room for it there before the exception handler) and less nice in others (system-specific functionality in generic code). I may come back to it later.

Stacks: The kernel now declares its own stack (as part of its BSS). Previously, start.S was simply carving off a piece of RAM to use as a stack. This obviously doesn’t make sense for anything other than a tutorial, but it gets worse with multiple cores. architecture/mips32/kernelstack.c declares a (small) kernel stack for each core, and architecture/mips32/start.S and soc/jz4780/multicore_lowlevel.S make use of it.

CP0 accessor functions: We now take a FreeBSD-inspired approach to, e.g., writing to device registers or to CP0 — small inline functions with sensible names, used from C. This is nicer than bit shifting and poking values, and much nicer than dropping down to assembly language. An example of this approach is in include/soc/jz4780/jz47xx-cgm.h, which defines a number of functions for gating and ungating clock sources, as used by soc/jz4780/multicore.c.

Library-ification: Start.S, being MIPS-specific, is now part of libsystem.a. Unfortunately this means that it needs to be in its own section, so that the linker can be instructed to place it before any other code. I’m still not entirely happy with what’s going on here behind the scenes — GNU ld seems to be creating a LOAD segment which has zero length and is entirely BSS, which doesn’t make much sense to me — but the outcome is positive overall.

Supplemental material: Coprocessor 0

I have hinted at CP0 a few times during this series, but I haven’t explicitly addressed it. The MIPS design includes support for “coprocessors”, which are accessed using special instructions. Coprocessor 0, which is required, is mostly about memory management and caching, but it also has a grab-bag of other things, such as processor information (make, model, size of cache line), plus implementation-defined portions which can do anything — such as turn on the second core, in the case of the JZ4780. 

There are two relevant assembly-language instructions: mtc0 (move register value to coprocessor 0) and mfc0 (move from coprocessor 0 to register). In this release, any access to them from C is wrapped up in CP0 accessor functions (see above).

Supplemental material: everything you would have wanted to know about clock gating, had you known that clock gating existed

Q: What is clock gating?
A: Clock gating is an energy-saving measure. Switching transistors on and off repeatedly, which is what a clock does, takes energy. In a complex system like the JZ4780, a single clock is connected to multiple modules — so it’s very likely that the clock will be running, but not all modules will be active. In this case, you can gate (disable) the clock source for the modules you aren’t using, without having to turn the clock off completely.

Q: How does clock gating apply here?
A: Well, the jz4780 lets you gate pretty much everything, including the second processor core — and, on startup, pretty much everything, including the second core, is in fact gated. If we enable CORE1 with a gated clock it won't execute any instructions. 

Q: I notice that the JZ4780 programmer’s manual refers to CORE1’s clock gate bit as “P1”, with no further information. Did figuring out that P1 meant CORE1 take any time?
A: No, none at all. Well, not much. I mean, a little. Well, definitely not much more than I’d want to admit.  

Q: I also notice that the JZ4780 programmer’s manual refers to CORE1 as “SCPU”, with no further information, in documentation for the same module. I guess that was similarly easy to recognise?
A: *hopeless stare*

Q: Look, would you like a coffee?
A: I’d love a coffee!

 

Fri
Dec 2

CI20: Switch to CMake

Welcome back to the CI20 bare-metal project! I’ve got a few interesting things planned, but this release is just a clean-up: there are no new features, but the project now builds using CMake.

CMake build scripts are quite a bit easier to understand than Make, not least because the tool implicitly understands C dependencies, such as header files. Of course, the downside is that we have added CMake as a dependency.

This release is tagged “cmake”:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/cmake

To build it for the first time, run CMake:

$ cmake .

Then run make as usual.

$ make

You can then boot in the normal way:

$ python3 usbloader.py stage1/stage1.elf
$ python3 usbloader.py kernel/kernel.elf

After the first time, you shouldn’t have to re-run CMake — even if you change the build system (CMakeLists.txt).

Sat
Nov 12

Cloning Pocket, part 1 of about 1000

I read a lot of Web pages but it’s not always convenient to read them when I find them. Apps like Instapaper and Pocket let one save a version of the page for later offline viewing, such as when one is on an underground train. These apps also default to providing a “readable” version of the page, rather than the original — this is a version where only the useful parts of the page are presented, sans headers, sidebars, ads and so on. At least, that’s the theory — it’s not easy to automatically figure out which parts of the page are useful.

Anyway, I’m reasonably happy with these apps but of course it’d be better if I didn’t need to give my reading history to a third party, so I started wondering how difficult it would be to write one of these things myself. Turns out it’s quite easy. There are two obvious options:

Option 1. Download the page and attempt to parse the HTML manually. This seems like it would initially be easy but fixing annoyances would be hard — for example, what if JavaScript is required to generate the page? This led me to choose:

Option 2. Remote-control a browser and extract the useful portions of the page in browser context.

Obviously, you don’t actually need a complete browser — in particular, you don’t need a user interface — but you do need most of it — JavaScript engine, cookies, everything related to the DOM, and so on. Enter Chromium Embedded Framework (CEF), and in particular Chromium Embedded Framework for Python

The Chromium Embedded Framework is effectively a completely Chromium browser without, uh, the chrome. You get complete control over what it loads and what it runs, and have access to the DOM, via JavaScript. This is great from a Pocket-cloning perspective because several JavaScript projects exist to provide a “readable Web page” experience, such as the original Readability.js which was the inspiration behind Firefox and Safari's “Reader View”.

The process, then, is:

  • Retrieve page using CEF.
  • Inject readability.js and wait for it to complete
  • Siphon out the HTML created by readability.js
  • Store this in a database
  • User accounts and Web UI for stored data
  • App to sync the stored data and view it offline
  • ???
  • Profit (Warning: neither Instapaper nor Pocket have figured out what “???” is)

… but so far I only have the first three steps of the process to demonstrate to you.

git clone https://github.com/nfd/cefpythondemo
cd cefpythondemo
git clone https://github.com/nfd/readability-js
pip install cefpython3
python readabilitydemo.py <url to download>

This will store a file named readable.html in the current working directory. Unfortunately Python 2 is required — this is a limitation of cefpython. Tested on a Mac — some changes may be required for other systems.

Future work: readability.js is a bit old now and doesn’t work for some pages. Also, it’s useful to store images, sometimes, which readability.js doesn’t do. Since we’re running a full browser it’s also possible to take screenshots (of the hidden display), which would be useful for various things. Lots of finessing would be required around timeouts, resource limits, and clean-up before I’d put this thing on a server. And of course it needs an app to be actually useful.

Tue
Nov 8

Some quick notes on language models

The motivation behind language models is to get some idea of the likelihood of a particular sequence of words, or (same thing) the likelihood of a particular word given a context of previous words. There are many applications of language models — spelling correction and speech recognition come to mind for simple ones, and combined with the concept of word embeddings they can be used to draw analogies, identify synonyms, and more.

An n-gram is a continuous set of items from some text. Unigrams from the previous sentence would be “An”, “n-gram”, “is”, etc. Bigrams would be (“An”, “n-gram”), (“n-gram”, “is”), (“is”, “a”), etc. 

Language models compute the probability of a sequence of words, or, given a sequence of words, compute the probability of an upcoming word. They are effectively probabilistic grammars. An n-gram model uses n-grams to compute this probability.

N-gram models make use of the Markov assumption, which is that the Markov property holds. The Markov property of some process is conditional probability of future states depends only on the current state, and not on past states. This means that in a bigram language model we assume that the conditional probability of future words depends only on the previous two words. This is of course totally wrong, but it’s a simplifying assumption which is sufficient for some things.

A unigram model, also known as a bag of words model, uses the frequency of individual words to make its predictions. For each word, the probability is simply the number of times it occurs in the training corpus divided by the total number of words in the corpus. For example, in the statement

“If you like to like those who like people who like you, you are a you-liker liker liker”

… P(like) is 4/18 ~= 0.22.

By extension an n-gram model uses the conditional probability of word n given word n-1, … 1. In the bigram model, we compute P(you | if), P(like | you), P(to | like), P(like | to), etc. 

An n-gram language model will perform best at modelling its training set, but the sparsity of language means that it won’t work as well on other corpuses because plenty of probabilities will be exactly zero — for n words which didn’t appear at all in the training set. The general approach to fixing this is to assign small but non-zero probabilities to all zero-labelled n-grams in the set, by “stealing” from the probabilities of other members of the set. You can also, sometimes, just do really stupid things like pretend that every n-gram appears at least once. Why bother? It can make the model more realistic. There is also the practical consideration that a common measurement of model accuracy, perplexity, relies on non-zero probabilities. 

An interesting bit of jargon: the maximum likelihood estimate for some parameter of a model is the estimate which maximises the likelihood that that we’re in the training set. To paraphrase the Stanford slides (see references), if “bagel” appears 400 times in a one-million-word corpus, then the MLE is 400/1000000 = 0.0004, and this is the probability that makes it most likely that the text we’re looking at is statistically the same as the training set, i.e. that a randomly-chosen word will be “bagel” 0.04% of the time.

Continuous space models

N-gram models, as discussed above, are based on the completely-unrealistic but sometimes-useful assumption that the n most recent words are all we need to predict future words. Ideally we would set n to be as large as possible, but the sparsity of language means that this isn’t feasible — almost all grammatical 18-word sentences, for example, including that “liker liker liker” abomination above, have never been written before; there simply isn’t enough data to train an n-gram model for large n (and “large” appears to be closer to 4 than 18, in practise).

We saw above that the n-gram approach to solving the zero-probability problem was some form of smoothing. However, this is by definition unrealistic — we invent the probabilities of unseen n-grams based on the probabilities of component (n-1)-grams — or something even less plausible, such as adding a small constant amount.

Continuous-space models address this problem. The idea behind them is to put each word into a real-valued (“continuous”) high-dimensional vector space — in other words, to assign each word a set of real values y1, y2, y3, … which together represent the word. The aim is to achieve this goal:

Goal: Ensure that each dimension of this space, or at least some of them, represents some characteristic of the word.

If we manage to do this, then 

Consequence 1: the closer together (in a literal, Euclidean sense) two words are in this space, the more related they are, and
Consequence 2: you can use this property to estimate the likelihoods of n-grams which you haven’t seen before. 

The output of a model trained to place words into a space in this fashion is a set of word embeddings — a new name, since they aren’t probabilities any more. (The popular form of continuous space model at the moment uses a neural network to learn the embeddings. Non-neural-network-based methods exist, but I’m not going to write about them here.)

The distributional hypothesis in linguistics states that words used in similar contexts tend to be related somehow. This makes sense intuitively — imagine sentences like “I like to eat x”, “Let’s go to the x”, “It’s not x being green”, and so on. We can train a neural network to take advantage of the distributional hypothesis by training it to predict the next word given a context, i.e. P(W(n) | W(n-j), W(n-j+1) … W(n-1)). The idea is to minimise the distance between the target word and its context. By doing this ‘online’ (example-by-example, rather than in a batch) the vectors for words which tend to be close together end up closer together themselves. Thus the neural network takes advantage of the distributional hypothesis to generate a set of word embeddings — words are initially assigned random vectors and gradually “move into place” during training.

The first neural-network-based word embeddings did the above by maximising the log-likelihood of a given word and its context. This is computationally expensive, because to calculate log-likelihood you need to calculate the probability of all other words in the same context at every training step. 

The big innovation in the word2vec algorithm is to use a different training method, noise-contrastive training, which is more efficient computationally. Instead of maximising log-likelihood for a word given its context, word2vec trains a neural network to distinguish a word in a context from several “noise” words in the same context (where a ’noise’ word is a valid word which does not appear in the given context). This is computationally much cheaper (no calculation of other-word probabilities, and the answer is a simple logistic-regression “yes” or “no”. 

A neat word2vec trick is coming up with analogies — you can ask questions like “king” - “man” + “woman” = ? and a word2vec-trained language model will answer “queen”. This isn’t specific to word2vec, though — it’s a consequence of the application of the distributional hypothesis. Word2vec made it feasible to compute these embeddings on large data sets, however.

CBOW vs skip-gram

As mentioned above, bag of words is the name given to unigram language models. Continuous bag of words (CBOW) is the extension of that model into a continuous vector space.

The above description of neural continuous models is CBOW. We saw that we train a network to predict the next word given a context — but that context is presented as a vector generated by averaging all the words present in it. This means that CBOW training discards information — you could imagine that multiple contexts (ideally similar ones, but nonetheless) could average to the same, or to very similar, vectors. 

The alternative is skip-gram, which flips the training around: rather than attempting to predict word from context, predict each context word from a given target word. It turns out that this approach performs better in larger data sets, but the smoothing provided by CBOW actually helps in smaller data sets.

References:

 

Sat
Oct 29

Some quick notes on garbage collection

Just cribbing some stuff from this HN thread for later reference — and possibly, dear reader, your interest!

  • Go recently gained a significantly faster garbage collector with “typical worst-case” performance of 100 usec.
  • It still "stops the world” to do so. In order to stop all threads quickly, it uses page poisoning: have threads read from a page at GC-safe points (such as on function entry); when GC is required the collector can just install a segfault handler, unmap the page, wait for all threads to segfault, do the collection and restart the threads. (This thread-stopping technique is unrelated to the overall GC speed-up, which is about efficiently avoiding stack re-scanning)
  • The page poisoning technique trades off fast GC and unnecessary memory accesses. For example, a loop which does not make function calls can apparently stall GC.
  • Apparently the changes increase overall time in GC — to 20% of total runtime for one poster.
Sun
Aug 14

Smartcards available

If you're interested in using a JavaCard-based smartcard with KeePassNFC but don't want to mess about with obtaining a smartcard and suitable reader, then I can sell you one of mine at cost price, which is £5.40 plus postage, pre-loaded with KeePassNFC and NDEF apps. Send me an email if you're interested.

These are NXP J3A040 contactless cards, running JavaCard 2.2.2 and GlobalPlatform 2.1.1. I'll install the KeePassNFC applet and an open source NDEF applet with appropriate configuration. NDEF is important because it means that Android can be configured to automatically start KeePassNFC when the card is presented.

NFC crypto applet source code available

Updates: 14/09/2016: Added applet binary.

The source code for the JavaCard applet and new KPNFC with applet support is now available on Github.

Binaries

Binaries are available:

Building from source

To compile the applet yourself, you'll need to add the following to the ext/ directory:

  • java_card_kit-2_2_2
  • ant-javacard.jar

These are both available from Martin Paljak's excellent AppletPlayground repository.

To build:

$ cd applet
$ ant

You can then install the app using a Global Platform client. To do it using Martin Paljak's gp.jar:

$ java -jar gp.jar --install kpnfc.app

Pre-installation requirement: public key

Before using the applet, you must instruct it to generate a private key. This is done off-device because it takes quite a long time.

$ cd JavaClient
$ java -jar JavaClient.jar generate_card_key