Sun
Aug 14

Smartcards available

If you're interested in using a JavaCard-based smartcard with KeePassNFC but don't want to mess about with obtaining a smartcard and suitable reader, then I can sell you one of mine at cost price, which is £5.40 plus postage, pre-loaded with KeePassNFC and NDEF apps. Send me an email if you're interested.

These are NXP J3A040 contactless cards, running JavaCard 2.2.2 and GlobalPlatform 2.1.1. I'll install the KeePassNFC applet and an open source NDEF applet with appropriate configuration. NDEF is important because it means that Android can be configured to automatically start KeePassNFC when the card is presented.

NFC crypto applet source code available

Updates: 14/09/2016: Added applet binary.

The source code for the JavaCard applet and new KPNFC with applet support is now available on Github.

Binaries

Binaries are available:

Building from source

To compile the applet yourself, you'll need to add the following to the ext/ directory:

  • java_card_kit-2_2_2
  • ant-javacard.jar

These are both available from Martin Paljak's excellent AppletPlayground repository.

To build:

$ cd applet
$ ant

You can then install the app using a Global Platform client. To do it using Martin Paljak's gp.jar:

$ java -jar gp.jar --install kpnfc.app

Pre-installation requirement: public key

Before using the applet, you must instruct it to generate a private key. This is done off-device because it takes quite a long time.

$ cd JavaClient
$ java -jar JavaClient.jar generate_card_key
Sat
Aug 6

Better security for KeePassNFC

Update 2016-08-14: This applet is now available, and I will even send you a smartcard!

Update 2016-08-09: I've now got an experimental version of KeePassNFC working with the scheme below, and everything seems to work fine. If I can't break it in the next couple of days I'll start uploading source.

My NFC authentication app for KeePassDroid is fine as long as your phone is secure, but if your phone is stolen, or someone gets root access on it, then all they need to do is scan your NFC tag (remotely!) to get access to your KeePass database (or perhaps someone scanned for NFC tags first to determine the type of person who may have interesting stuff on their phone). This isn't a great failure mode.

The problem is that the key is easily read by anybody in the vicinity. It can't be used without privileged access to the phone, but it would be much better if it were not available at all. You could address this with a JavaCard applet running on a contactless smartcard. Applets can contain private data, so they can be written to never reveal the secret key. Additionally, JavaCard supports on-card generation of encryption keys, which is quite nice because it means that there is no time in which snooping on NFC communications will give an attacker any useful information.

The design is in two parts.

Establishing a password key with a JavaCard
Part 1: Storing a secret key

Plain text:  
Encrypted with public key (RSA):  
Encrypted with password key (AES):  
Encrypted with transaction key (AES):  

In the first part, the phone encrypts the sensitive information (e.g. the database password and keyfile) using the password key. It then transmits the password key to the smartcard, by first encrypting it with the card's public key (the smartcard having previously been instructed to generate a public/private key pair). The phone can then discard the password key.

javacard-crypto-2.png
Part 2: Decrypting secret data

At this point, the password key is only stored on the card. Decryption then necessarily involves the card. The idea here is to have the card decrypt the data using its secret key, but to re-encrypt it using a temporary key before sending it to the phone

The phone first generates a transaction key and transmits it to the card securely (by encrypting it with the card's public key). It then transmits the encrypted sensitive information to the card and requests that it be decrypted. The card first decrypts the information, using its previously-stored password key, and then re-encrypts it using the transaction key before transmitting it back to the phone. When the phone has received the data, it decrypts it and throws the transaction key away.

It should also be possible to do something similarly secure (in terms of NFC snoop resistance) using a PGP applet running on the card. Another alternative, would be to store the sensitive information on the card directly, but that creates its own set of problems around attestation and traffic sniffing.

Thanks to Bill Broadley for the email discussion which led to me implementing this.

Sun
Dec 6

JavaScript-based MNIST demo


This is an online JavaScript-based digit recogniser using the MNIST data set. A demo is available at http://nyloncactus.com/mnistjs/mnistjs.html — you can draw a digit and see what the network thinks. It’s in the public domain, and source code is available.

It’s based on a two-layer neural network which was trained on a quad-core Intel i5. Training took between five and ten minutes (I didn’t make a note). Error is 3.8%, which is around what one should expect from a two-layer neural network.

The lesson I learnt from this is that preparation of the data is a significant part of the whole process. MNIST has some degree of antialiasing — not too much, but not too little either; letters are centred within the box, they occupy the full height, and so on. Get any of these things wrong and the neural network won’t be at all effective. It now does a reasonable, but not stellar, job, and I suspect that it could do better with more preprocessing, even without changing the neural network architecture.

Tue
Nov 3

Turn Twitter hearts into skulls

Here is a small Userscript to turn Twitter’s new hearts into skulls, like this:


Download here:

Twitter Likes Skulls (0.3)

To install it you’ll need Greasemonkey (Firefox) or Tampermonkey (Chrome).

Let me know if it doesn’t work and don’t forget to skull people!

Tue
Sep 29

Using LXC with Debian

Using Debian unstable, currently, I found setting up linux containers not to be quite as pain-free as promised. Here are some of the more unusual aspects:

CGManager
Linux cgroups are the abstraction which enable custom containers by providing resource isolation. On Debian I did not, by default, have permission to create my own groups.

Apparently this is changing, so you may not need to worry about this portion.

# echo 1 >/proc/sys/kernel/unprivileged_userns_clone
$ sudo cgm create all me
$ sudo cgm chown all me $(id -u) $(id -g)
$ sudo cgm movepid all me $$

Network configuration
This, along with other lxc parameters, I configured by modifying the configuration file directly:
$ vim .local/share/lxc/your container name here/config

These lines create a NATed network for the container:
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = lxc-bridge-nat
lxc.network.ipv4 = 10.10.1.10/24
lxc.network.ipv4.gateway = 10.10.1.1

I then wanted to enable ssh access remotely (from externally accessible port 2000), which is the standard Linux business of:

$ sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 2000 -j DNAT --to 10.1.1.10:2000
$ sudo iptables -A FORWARD -m state -p tcp -d 10.1.1.10 --dport 2000 --state NEW,ESTABLISHED,RELATED -j ACCEPT

Thu
May 7

The best part about abstraction is {the best part about abstraction - 1, the best part about abstraction - 2}

Oh, programming is so easy! Just look at the source!

$ less /usr/include/errno.h

#include <bits/errno.h>

$ less /usr/include/bits/errno.h

#include <linux/errno.h>

$ less /usr/include/linux/errno.h:

#include <asm/errno.h>

$ less /usr/include/asm/errno.h:

#include <asm-generic/errno.h>

$ less /usr/include/asm-generic/errno.h:

#include <asm-generic/errno-base.h>

$ less /usr/include/asm-generic/errno-base.h:

#define EPERM     1   /* Operation not permitted */
[…]

It’s a wonder that gcc ever terminates.

Sun
May 3

CI20: Interrupt handling

This post is part of the CI20 bare-metal project — a project to write operating-system-level code on the CI20 MIPS-based demo board.

In the previous instalment we ran a memory tester and verified that DDR was initialised. Let’s start adding real OS features, starting with interrupts. We’ll add a generic interrupt mechanism and then apply it to the timer interrupt.

I don’t have any relevant pictures to go with this post, so here are some nesting swans I saw recently in Exeter:


Silly swans, building their nest so close to the path that the council had to give them a bit of privacy. They didn’t seem to care, though. Well, on to interrupt support on the CI20.

Generic interrupt support

The CI20 has a multi-level approach for handling interrupts.
  1. Firstly, the global interrupt enable flag in the CP0 STATUS register must be switched on.
  2. Then, each individual interrupt must be unmasked in the same register.
  3. Then, the interrupt controller hardware must unmask the interrupt for a particular device.
  4. Finally, the device itself must be configured to generate interrupts.
After you do all of this, the CPU will jump to a special location when an interrupt occurs, after storing the program counter and setting some flags (such as CAUSE). All the rest is up to software. 

The special location is well-defined, but is well-defined to be in any of four places, depending on CPU flags:
  • If BEV is set in the CP0 STATUS register, the address is in uncached memory.
  • If IV is set in the CP0 CAUSE register, the address for interrupts is distinct from the address for other types of exceptions, otherwise it’s the same.
We don’t want to use uncached memory, and, in fact, on the CI20, we can’t, as the address is 0xBFC00380, in memory-mapped device territory, so we’ll leave BEV unset. However, we *do* want separate addresses for interrupts and other exceptions, because this means that there’s a little less work to do when an interrupt or exception arrives, so we’ll set IV. 

Finally, we need to write an interrupt handler. The interrupt handler will consist of two parts: an assembly-language part which does the minimum necessary to safely jump into C code, and a C part which determines which interrupt has occurred and deals with it appropriately.

Here’s the assembly-language portion of the interrupt handler included in this post’s start.S:

.org 0x200
_irq_asm:
    sw at, -4(sp)
    sw v0, -8(sp)
    sw v1, -12(sp)

    [ snip: many more registers saved ]

    sw fp, -108(sp)
    sw ra, -112(sp)

    addi sp, sp, -112

    jal libci20_interrupt
    nop

    addi sp, sp, 112
    lw at, -4(sp)
    lw v0, -8(sp)
    lw v1, -12(sp)

    [ snip: many registers re-loaded ]

    lw fp, -108(sp)
    lw ra, -112(sp)

    eret

Fairly straightforward, then: save all registers, run the C portion of the interrupt handler, restore all registers, and return from interrupt. This is fine for now, but if this were to be used in a real system it would certainly want to switch to a dedicated interrupt stack — or, at the very least, make sure it was on a kernel stack.

The C portion is similarly straightforward. The CI20 has two interrupt pending registers, which are bitfields, one bit per device. A bit is set if an interrupt is pending for that device. The C routine allows device drivers to register a handler routine for their interrupt — if a handler is registered when an interrupt for that device arrives, it will be called.

Finally, the job of the handler is to inform the device that the interrupt has been handled. 

The timer interrupt

The “OS timer” device, used by timer.c, can be set up to generate an interrupt whenever the timer reaches a 32-bit comparison value. We previously initialised the timer to tick 3 million times a second, so let’s get it to generate an interrupt every millisecond by setting our comparison value equal to 3 million / 1000 = 3000. The timer then registers its interrupt handler for TCU0, which is the timer unit:

intc_register_handler_tcu0(ostimer_interrupt);

When a timer interrupt occurs, ostimer_interrupt is called. The only thing it absolutely has to do is to tell the TCU that the interrupt has been handled:

poke32(TFCR, TFR_OSTFLAG);

… but if that’s all it did then we wouldn’t even know it was working. So in addition to silencing the interrupt, we add support for timer callbacks, functions which are invoked by the timer interrupt handler:

void ostimer_interrupt(void)
{
for(int i = 0; i < timer_callback_count; i++)
timer_callbacks[i]();

/* Clear interrupt flag. If we don't do this we will immediately return to
* this interrupt on exit! */
poke32(TFCR, TFR_OSTFLAG);
}

At this point, finally, we can register a callback handler in our main() function and increment a 1ms counter.

Running the code

Check out the code as usual, this time using the interrupts tag:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/interrupts

Make sure you have also checked out pyelftools if you haven’t already:

$ cd thirdparty/

Now build and run. I now use a single command for this:

$ make && python3 usbloader.py build/stage1.elf && sleep 1 && python3 usbloader.py build/kernel.elf

Benchmarking 

If all goes well, you should see a short benchmark run three times, printing something like this:

00000C3C
00000C3C
00000C3C

This is the number of milliseconds taken to run a simple delay loop in main(). It doesn’t mean very much by itself, but I was curious to see how what we’ve got so far compared with Linux. So I wrote a short Linux benchmark which did the same thing (download it here), booted my CI20 into Linux, ran the benchmark, and got these results:

00000C51
00000C48
00000C49

In other words, Linux has more variance and is slightly slower than our OS. This is exactly as we’d expect: Linux is running other things behind the scenes, which will both cause the increased variance and slow the benchmark down. The results are within 5% of each other, however, which is encouraging — we did all the right things so far, or, at least, we did them as right as Linux does.

Other changes

This release includes quite a few changes:
  • “os” became “kernel” — which makes much more sense.
  • The kernel-mode stuff is mostly contained in a library, libci20, which is used by both stage1 and kernel. But use of the library started to diverge in this section, and will certainly diverge further. For example, both stage1 and kernel require a timer, but they use it differently: stage1 uses it for busy-waiting, while kernel uses it to generate periodic interrupts. Also, stage1 needs to be less than 14K, so there’s no room for fancy extra features. There is no perfect solution to this when you’re working in C. My solution is to link two different libraries, libci20 and libci20_mini. The mini version has won’t add any more files, but has its own simple implementations of some things (like the timer). The Makefile changed to reflect this.
  • The kernel’s assembly-language startup file, start.S, now zeroes out BSS. It didn’t do it before because previously we didn’t have a BSS section. (BSS (https://en.wikipedia.org/wiki/.bss) is where all uninitialised file-scope variables get placed — like the array defined in libci20/interrupts.c.) The kernel’s linker.lds file changed to accommodate the new sections, and also to align the data blocks to the length of a cache line. Note that the BSS section (and its architecture-specific friend, .sbss) is marked as “NOLOAD” — which means it takes up no space in the file at all.
  • The USB loader changed again, this time to pad uploaded data to a multiple of 2k when writing to TCSM. Experiments with crossing 2k block boundaries failed unless the data were padded. I have no idea why this peculiarly hardware-specific quirk works, or even if it’s doing the right thing, but it does seem to work.
The end, or just the beginning?*

We’ve now got all the resources we need to start writing an operating system. Next time we’ll begin on that, starting with the scheduler.

* Probably not the beginning.

Sun
Apr 19

CI20: The DDR odyssey, part 4: memtester

I can’t think of a worse type of bug than one related to faulty RAM. Actually, that’s not true — probably concurrency bugs are worse. Oh well, so much for the strong opening. In any case, we’ve spent the last 3 posts initialising the RAM, so let’s now run memtester and make sure it works.

The CI20 bare-metal project: all posts

Memtester is a popular open-source program for testing RAM. At its core it is quite simple: it runs a suite of tests on the RAM by writing specially-crafted data to it, designed to expose any issues with the RAM, then reading it back and verifying that it survived the trip.

I ported Memtester to run on bare-metal CI20 by removing most of it: all the command-line parsing and POSIX-specific functionality, apart from random number generation. Instead the tester runs directly in cached kernel memory (0x80000000) and tests a fixed size (200MB). We’re only testing a fixed amount of memory because kseg0 isn’t very large (256MB, when you exclude memory-mapped devices). Anyway, it doesn’t really matter if we don’t touch every byte of memory, since the point isn’t to discover bad RAM but to discover whether the DDR controller and DDR PHY are configured properly — problems which should be obvious even after testing only a very small amount of memory.

Running the test 
Get this version by checking out the OS as normal:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/memtester

Now check out two “ports”:

$ cd ports
$ git clone https://github.com/nfd/ci20-os-port-posix posix
$ cd ..

Now build the system:

$ make
$ cd ports/memtester-4.3.0
$ make
$ cd ../../

Now run everything. This happens in two stages. First we use stage1 to initialise the memory:

$ python3 usbloader.py build/stage1.elf

Look at the serial console output, and when stage1 indicates that the memory test is complete, load memtester.

$ python3 usbloader.py ports/memtester-4.3.0/build/memtester.elf

You should see memtester load and start testing RAM. The test suite repeats until memtester finds a problem.

How it works
The hard part of all of this was the build system, which needed quite a bit of expansion to support “ports”. Ports are libraries or binaries for third-party applications or their support. The idea is that you drop the repository into ports/ and use a custom Makefile to build them as part of the rest of the system. The memtester Makefile looks like this:

PORT_SRC=memtester.c tests.c
PORT_TYPE=elf
PORT_TARGET=memtester.elf
PORT_INCLUDES=ports/posix/include
PORT_LIBS=build/libci20.a ports/posix/build/libposix.a

include ../baremetal.mk

In other words, Memtester is a program (an ELF file) defined by two C source files, depending on the Posix port and libci20. Pretty straight-forward so far, but the hard work is performed by ports/baremetal.mk, which will build any dependencies and then the port itself. The implementation of baremetal.mk is a bit gruesome, using quite a few Make “features”. I’m perversely proud of it, which are the two feelings I always get when I accomplish something nontrivial in Make.

Next steps
An interesting thing to do now is to edit stage1/ddr.py, changing follow_reference_code=True to follow_reference_code=False. This activates a whole lot of changes related to RAM initialisation, but doesn’t seem to affect system stability — memtester runs just fine, which indicates that DDR might be more robust to timing variations than it looks. An interesting next step might be to measure memory speed in addition to memory reliability, but let’s move on from RAM for a little while: next time we’ll get back to the OS development proper, and look at handing interrupts.

Thu
Apr 16

CI20: The DDR odyssey, part 3: memory remapping

This post is part of the CI20 bare-metal project (link leads to index of all posts), and is part three of four posts about initialising the DDR on the CI20 Creator. 

In the last post we got the RAM working enough to boot something into kseg0, but we missed one curious part of RAM initialisation: memory remapping.

The jz4780 DDR controller has a several DREMAP registers, described a little opaquely as "DREMAP1~5 are used to define address mapping in DDRC.” Fine, but what is “address mapping in the DDRC”?

My understanding, which is basically guesswork, goes like this: We know that DDR memory addresses are specified in four dimensions: bank, row, column, and byte number within the word*. But of course physical memory addresses are a single number. So one thing you could do is just assign the various dimensions to bits within a 32-bit address space, like this:


Then physical address 0x00000000 maps to bank 0, row 0, column 0, byte 0; and physical address, say, 0x1CE1CEBB maps to bank 3, row 19996, column 942, and byte 3. (The two unused bits at the top are because we have a 4 gigabyte address space, but only 1 gigabyte of RAM to play with.)

The problem with this particular mapping is that we don’t expect the bank to change very frequently. Programs tend to follow the principle of locality, which means that their next memory reference is likely to be very near their last memory reference. But DDR RAM works by “precharging” a bank + row combination, after which columns can be read. This precharging takes time, and only one row can be precharged per bank. If all the addresses we need in the near future reside in same bank but in different rows, we have no choice but to wait for the bank precharge multiple times in sequence, once for each row. If, however, the upcoming addresses span multiple banks, we could precharge all the banks we needed at once, saving some time.

In other words, we might prefer this arrangement, in which bank number should change more frequently than row number:


This is the purpose of the DREMAP registers — they let us swap bits around, effectively allowing us to change the positioning of our 4d-to-1d mapping.

And this is what the reference code (and, now, ddr.py) does: switch bank and row addresses to make it more likely that we’ll be able to precharge multiple banks simultaneously. This is actually a power / performance trade-off: we end up using more power (for bank precharging) but don’t spent so much time waiting.

Code is available under the ddr_remap tag:

$ git clone https://github.com/nfd/ci20-os
$ cd ci20-os
$ git checkout tags/ddr_remap

If you build and run it, you should find that nothing has changed, and everything still works. Which is a comfort.

In the next post, we’ll finish off the RAM stuff by running a proper memory test.

* It’s actually five, as we also have rank, but there’s only one of those on the CI20 (ranks make more sense for removable memory, where you can define a rank as “whatever goes in a memory slot”).