Dec 2

CI20: Switch to CMake

Welcome back to the CI20 bare-metal project! I’ve got a few interesting things planned, but this release is just a clean-up: there are no new features, but the project now builds using CMake.

CMake build scripts are quite a bit easier to understand than Make, not least because the tool implicitly understands C dependencies, such as header files. Of course, the downside is that we have added CMake as a dependency.

This release is tagged “cmake”:

$ git clone
$ cd ci20-os
$ git checkout tags/cmake

To build it for the first time, run CMake:

$ cmake .

Then run make as usual.

$ make

You can then boot in the normal way:

$ python3 stage1/stage1.elf
$ python3 kernel/kernel.elf

After the first time, you shouldn’t have to re-run CMake — even if you change the build system (CMakeLists.txt).

Nov 12

Cloning Pocket, part 1 of about 1000

I read a lot of Web pages but it’s not always convenient to read them when I find them. Apps like Instapaper and Pocket let one save a version of the page for later offline viewing, such as when one is on an underground train. These apps also default to providing a “readable” version of the page, rather than the original — this is a version where only the useful parts of the page are presented, sans headers, sidebars, ads and so on. At least, that’s the theory — it’s not easy to automatically figure out which parts of the page are useful.

Anyway, I’m reasonably happy with these apps but of course it’d be better if I didn’t need to give my reading history to a third party, so I started wondering how difficult it would be to write one of these things myself. Turns out it’s quite easy. There are two obvious options:

Option 1. Download the page and attempt to parse the HTML manually. This seems like it would initially be easy but fixing annoyances would be hard — for example, what if JavaScript is required to generate the page? This led me to choose:

Option 2. Remote-control a browser and extract the useful portions of the page in browser context.

Obviously, you don’t actually need a complete browser — in particular, you don’t need a user interface — but you do need most of it — JavaScript engine, cookies, everything related to the DOM, and so on. Enter Chromium Embedded Framework (CEF), and in particular Chromium Embedded Framework for Python

The Chromium Embedded Framework is effectively a completely Chromium browser without, uh, the chrome. You get complete control over what it loads and what it runs, and have access to the DOM, via JavaScript. This is great from a Pocket-cloning perspective because several JavaScript projects exist to provide a “readable Web page” experience, such as the original Readability.js which was the inspiration behind Firefox and Safari's “Reader View”.

The process, then, is:

  • Retrieve page using CEF.
  • Inject readability.js and wait for it to complete
  • Siphon out the HTML created by readability.js
  • Store this in a database
  • User accounts and Web UI for stored data
  • App to sync the stored data and view it offline
  • ???
  • Profit (Warning: neither Instapaper nor Pocket have figured out what “???” is)

… but so far I only have the first three steps of the process to demonstrate to you.

git clone
cd cefpythondemo
git clone
pip install cefpython3
python <url to download>

This will store a file named readable.html in the current working directory. Unfortunately Python 2 is required — this is a limitation of cefpython. Tested on a Mac — some changes may be required for other systems.

Future work: readability.js is a bit old now and doesn’t work for some pages. Also, it’s useful to store images, sometimes, which readability.js doesn’t do. Since we’re running a full browser it’s also possible to take screenshots (of the hidden display), which would be useful for various things. Lots of finessing would be required around timeouts, resource limits, and clean-up before I’d put this thing on a server. And of course it needs an app to be actually useful.

Nov 8

Some quick notes on language models

The motivation behind language models is to get some idea of the likelihood of a particular sequence of words, or (same thing) the likelihood of a particular word given a context of previous words. There are many applications of language models — spelling correction and speech recognition come to mind for simple ones, and combined with the concept of word embeddings they can be used to draw analogies, identify synonyms, and more.

An n-gram is a continuous set of items from some text. Unigrams from the previous sentence would be “An”, “n-gram”, “is”, etc. Bigrams would be (“An”, “n-gram”), (“n-gram”, “is”), (“is”, “a”), etc. 

Language models compute the probability of a sequence of words, or, given a sequence of words, compute the probability of an upcoming word. They are effectively probabilistic grammars. An n-gram model uses n-grams to compute this probability.

N-gram models make use of the Markov assumption, which is that the Markov property holds. The Markov property of some process is conditional probability of future states depends only on the current state, and not on past states. This means that in a bigram language model we assume that the conditional probability of future words depends only on the previous two words. This is of course totally wrong, but it’s a simplifying assumption which is sufficient for some things.

A unigram model, also known as a bag of words model, uses the frequency of individual words to make its predictions. For each word, the probability is simply the number of times it occurs in the training corpus divided by the total number of words in the corpus. For example, in the statement

“If you like to like those who like people who like you, you are a you-liker liker liker”

… P(like) is 4/18 ~= 0.22.

By extension an n-gram model uses the conditional probability of word n given word n-1, … 1. In the bigram model, we compute P(you | if), P(like | you), P(to | like), P(like | to), etc. 

An n-gram language model will perform best at modelling its training set, but the sparsity of language means that it won’t work as well on other corpuses because plenty of probabilities will be exactly zero — for n words which didn’t appear at all in the training set. The general approach to fixing this is to assign small but non-zero probabilities to all zero-labelled n-grams in the set, by “stealing” from the probabilities of other members of the set. You can also, sometimes, just do really stupid things like pretend that every n-gram appears at least once. Why bother? It can make the model more realistic. There is also the practical consideration that a common measurement of model accuracy, perplexity, relies on non-zero probabilities. 

An interesting bit of jargon: the maximum likelihood estimate for some parameter of a model is the estimate which maximises the likelihood that that we’re in the training set. To paraphrase the Stanford slides (see references), if “bagel” appears 400 times in a one-million-word corpus, then the MLE is 400/1000000 = 0.0004, and this is the probability that makes it most likely that the text we’re looking at is statistically the same as the training set, i.e. that a randomly-chosen word will be “bagel” 0.04% of the time.

Continuous space models

N-gram models, as discussed above, are based on the completely-unrealistic but sometimes-useful assumption that the n most recent words are all we need to predict future words. Ideally we would set n to be as large as possible, but the sparsity of language means that this isn’t feasible — almost all grammatical 18-word sentences, for example, including that “liker liker liker” abomination above, have never been written before; there simply isn’t enough data to train an n-gram model for large n (and “large” appears to be closer to 4 than 18, in practise).

We saw above that the n-gram approach to solving the zero-probability problem was some form of smoothing. However, this is by definition unrealistic — we invent the probabilities of unseen n-grams based on the probabilities of component (n-1)-grams — or something even less plausible, such as adding a small constant amount.

Continuous-space models address this problem. The idea behind them is to put each word into a real-valued (“continuous”) high-dimensional vector space — in other words, to assign each word a set of real values y1, y2, y3, … which together represent the word. The aim is to achieve this goal:

Goal: Ensure that each dimension of this space, or at least some of them, represents some characteristic of the word.

If we manage to do this, then 

Consequence 1: the closer together (in a literal, Euclidean sense) two words are in this space, the more related they are, and
Consequence 2: you can use this property to estimate the likelihoods of n-grams which you haven’t seen before. 

The output of a model trained to place words into a space in this fashion is a set of word embeddings — a new name, since they aren’t probabilities any more. (The popular form of continuous space model at the moment uses a neural network to learn the embeddings. Non-neural-network-based methods exist, but I’m not going to write about them here.)

The distributional hypothesis in linguistics states that words used in similar contexts tend to be related somehow. This makes sense intuitively — imagine sentences like “I like to eat x”, “Let’s go to the x”, “It’s not x being green”, and so on. We can train a neural network to take advantage of the distributional hypothesis by training it to predict the next word given a context, i.e. P(W(n) | W(n-j), W(n-j+1) … W(n-1)). The idea is to minimise the distance between the target word and its context. By doing this ‘online’ (example-by-example, rather than in a batch) the vectors for words which tend to be close together end up closer together themselves. Thus the neural network takes advantage of the distributional hypothesis to generate a set of word embeddings — words are initially assigned random vectors and gradually “move into place” during training.

The first neural-network-based word embeddings did the above by maximising the log-likelihood of a given word and its context. This is computationally expensive, because to calculate log-likelihood you need to calculate the probability of all other words in the same context at every training step. 

The big innovation in the word2vec algorithm is to use a different training method, noise-contrastive training, which is more efficient computationally. Instead of maximising log-likelihood for a word given its context, word2vec trains a neural network to distinguish a word in a context from several “noise” words in the same context (where a ’noise’ word is a valid word which does not appear in the given context). This is computationally much cheaper (no calculation of other-word probabilities, and the answer is a simple logistic-regression “yes” or “no”. 

A neat word2vec trick is coming up with analogies — you can ask questions like “king” - “man” + “woman” = ? and a word2vec-trained language model will answer “queen”. This isn’t specific to word2vec, though — it’s a consequence of the application of the distributional hypothesis. Word2vec made it feasible to compute these embeddings on large data sets, however.

CBOW vs skip-gram

As mentioned above, bag of words is the name given to unigram language models. Continuous bag of words (CBOW) is the extension of that model into a continuous vector space.

The above description of neural continuous models is CBOW. We saw that we train a network to predict the next word given a context — but that context is presented as a vector generated by averaging all the words present in it. This means that CBOW training discards information — you could imagine that multiple contexts (ideally similar ones, but nonetheless) could average to the same, or to very similar, vectors. 

The alternative is skip-gram, which flips the training around: rather than attempting to predict word from context, predict each context word from a given target word. It turns out that this approach performs better in larger data sets, but the smoothing provided by CBOW actually helps in smaller data sets.



Oct 29

Some quick notes on garbage collection

Just cribbing some stuff from this HN thread for later reference — and possibly, dear reader, your interest!

  • Go recently gained a significantly faster garbage collector with “typical worst-case” performance of 100 usec.
  • It still "stops the world” to do so. In order to stop all threads quickly, it uses page poisoning: have threads read from a page at GC-safe points (such as on function entry); when GC is required the collector can just install a segfault handler, unmap the page, wait for all threads to segfault, do the collection and restart the threads. (This thread-stopping technique is unrelated to the overall GC speed-up, which is about efficiently avoiding stack re-scanning)
  • The page poisoning technique trades off fast GC and unnecessary memory accesses. For example, a loop which does not make function calls can apparently stall GC.
  • Apparently the changes increase overall time in GC — to 20% of total runtime for one poster.
Aug 14

Smartcards available

If you're interested in using a JavaCard-based smartcard with KeePassNFC but don't want to mess about with obtaining a smartcard and suitable reader, then I can sell you one of mine at cost price, which is £5.40 plus postage, pre-loaded with KeePassNFC and NDEF apps. Send me an email if you're interested.

These are NXP J3A040 contactless cards, running JavaCard 2.2.2 and GlobalPlatform 2.1.1. I'll install the KeePassNFC applet and an open source NDEF applet with appropriate configuration. NDEF is important because it means that Android can be configured to automatically start KeePassNFC when the card is presented.

NFC crypto applet source code available

Updates: 14/09/2016: Added applet binary.

The source code for the JavaCard applet and new KPNFC with applet support is now available on Github.


Binaries are available:

Building from source

To compile the applet yourself, you'll need to add the following to the ext/ directory:

  • java_card_kit-2_2_2
  • ant-javacard.jar

These are both available from Martin Paljak's excellent AppletPlayground repository.

To build:

$ cd applet
$ ant

You can then install the app using a Global Platform client. To do it using Martin Paljak's gp.jar:

$ java -jar gp.jar --install

Pre-installation requirement: public key

Before using the applet, you must instruct it to generate a private key. This is done off-device because it takes quite a long time.

$ cd JavaClient
$ java -jar JavaClient.jar generate_card_key
Aug 6

Better security for KeePassNFC

Update 2016-08-14: This applet is now available, and I will even send you a smartcard!

Update 2016-08-09: I've now got an experimental version of KeePassNFC working with the scheme below, and everything seems to work fine. If I can't break it in the next couple of days I'll start uploading source.

My NFC authentication app for KeePassDroid is fine as long as your phone is secure, but if your phone is stolen, or someone gets root access on it, then all they need to do is scan your NFC tag (remotely!) to get access to your KeePass database (or perhaps someone scanned for NFC tags first to determine the type of person who may have interesting stuff on their phone). This isn't a great failure mode.

The problem is that the key is easily read by anybody in the vicinity. It can't be used without privileged access to the phone, but it would be much better if it were not available at all. You could address this with a JavaCard applet running on a contactless smartcard. Applets can contain private data, so they can be written to never reveal the secret key. Additionally, JavaCard supports on-card generation of encryption keys, which is quite nice because it means that there is no time in which snooping on NFC communications will give an attacker any useful information.

The design is in two parts.

Establishing a password key with a JavaCard
Part 1: Storing a secret key

Plain text:  
Encrypted with public key (RSA):  
Encrypted with password key (AES):  
Encrypted with transaction key (AES):  

In the first part, the phone encrypts the sensitive information (e.g. the database password and keyfile) using the password key. It then transmits the password key to the smartcard, by first encrypting it with the card's public key (the smartcard having previously been instructed to generate a public/private key pair). The phone can then discard the password key.

Part 2: Decrypting secret data

At this point, the password key is only stored on the card. Decryption then necessarily involves the card. The idea here is to have the card decrypt the data using its secret key, but to re-encrypt it using a temporary key before sending it to the phone

The phone first generates a transaction key and transmits it to the card securely (by encrypting it with the card's public key). It then transmits the encrypted sensitive information to the card and requests that it be decrypted. The card first decrypts the information, using its previously-stored password key, and then re-encrypts it using the transaction key before transmitting it back to the phone. When the phone has received the data, it decrypts it and throws the transaction key away.

It should also be possible to do something similarly secure (in terms of NFC snoop resistance) using a PGP applet running on the card. Another alternative, would be to store the sensitive information on the card directly, but that creates its own set of problems around attestation and traffic sniffing.

Thanks to Bill Broadley for the email discussion which led to me implementing this.

Dec 6

JavaScript-based MNIST demo

This is an online JavaScript-based digit recogniser using the MNIST data set. A demo is available at — you can draw a digit and see what the network thinks. It’s in the public domain, and source code is available.

It’s based on a two-layer neural network which was trained on a quad-core Intel i5. Training took between five and ten minutes (I didn’t make a note). Error is 3.8%, which is around what one should expect from a two-layer neural network.

The lesson I learnt from this is that preparation of the data is a significant part of the whole process. MNIST has some degree of antialiasing — not too much, but not too little either; letters are centred within the box, they occupy the full height, and so on. Get any of these things wrong and the neural network won’t be at all effective. It now does a reasonable, but not stellar, job, and I suspect that it could do better with more preprocessing, even without changing the neural network architecture.

Nov 3

Turn Twitter hearts into skulls

Here is a small Userscript to turn Twitter’s new hearts into skulls, like this:

Download here:

Twitter Likes Skulls (0.3)

To install it you’ll need Greasemonkey (Firefox) or Tampermonkey (Chrome).

Let me know if it doesn’t work and don’t forget to skull people!

Sep 29

Using LXC with Debian

Using Debian unstable, currently, I found setting up linux containers not to be quite as pain-free as promised. Here are some of the more unusual aspects:

Linux cgroups are the abstraction which enable custom containers by providing resource isolation. On Debian I did not, by default, have permission to create my own groups.

Apparently this is changing, so you may not need to worry about this portion.

# echo 1 >/proc/sys/kernel/unprivileged_userns_clone
$ sudo cgm create all me
$ sudo cgm chown all me $(id -u) $(id -g)
$ sudo cgm movepid all me $$

Network configuration
This, along with other lxc parameters, I configured by modifying the configuration file directly:
$ vim .local/share/lxc/your container name here/config

These lines create a NATed network for the container: = veth = up = lxc-bridge-nat = =

I then wanted to enable ssh access remotely (from externally accessible port 2000), which is the standard Linux business of:

$ sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 2000 -j DNAT --to
$ sudo iptables -A FORWARD -m state -p tcp -d --dport 2000 --state NEW,ESTABLISHED,RELATED -j ACCEPT