# Some quick notes on language models

The motivation behind language models is to get some idea of the likelihood of a particular sequence of words, or (same thing) the likelihood of a particular word given a context of previous words. There are many applications of language models — spelling correction and speech recognition come to mind for simple ones, and combined with the concept of word embeddings they can be used to draw analogies, identify synonyms, and more.

An **n-gram** is a continuous set of items from some text. Unigrams from the previous sentence would be “An”, “n-gram”, “is”, etc. Bigrams would be (“An”, “n-gram”), (“n-gram”, “is”), (“is”, “a”), etc.

**Language models** compute the probability of a sequence of words, or, given a sequence of words, compute the probability of an upcoming word. They are effectively probabilistic grammars. An n-gram model uses n-grams to compute this probability.

N-gram models make use of the Markov assumption, which is that the Markov property holds. The Markov property of some process is conditional probability of future states depends only on the current state, and not on past states. This means that in a bigram language model we assume that the conditional probability of future words depends only on the previous two words. This is of course totally wrong, but it’s a simplifying assumption which is sufficient for some things.

A unigram model, also known as a **bag of words** model, uses the frequency of individual words to make its predictions. For each word, the probability is simply the number of times it occurs in the training corpus divided by the total number of words in the corpus. For example, in the statement

**“If you like to like those who like people who like you, you are a you-liker liker liker”**

… P(like) is 4/18 ~= 0.22.

By extension an n-gram model uses the conditional probability of word n given word n-1, … 1. In the bigram model, we compute P(you | if), P(like | you), P(to | like), P(like | to), etc.

An n-gram language model will perform best at modelling its training set, but the sparsity of language means that it won’t work as well on other corpuses because plenty of probabilities will be exactly zero — for n words which didn’t appear at all in the training set. The general approach to fixing this is to assign small but non-zero probabilities to all zero-labelled n-grams in the set, by “stealing” from the probabilities of other members of the set. You can also, sometimes, just do really stupid things like pretend that every n-gram appears at least once. Why bother? It can make the model more realistic. There is also the practical consideration that a common measurement of model accuracy, *perplexity*, relies on non-zero probabilities.

An interesting bit of jargon: the **maximum likelihood estimate** for some parameter of a model is the estimate which maximises the likelihood that that we’re in the training set. To paraphrase the Stanford slides (see references), if “bagel” appears 400 times in a one-million-word corpus, then the MLE is 400/1000000 = 0.0004, and this is the probability that makes it most likely that the text we’re looking at is statistically the same as the training set, i.e. that a randomly-chosen word will be “bagel” 0.04% of the time.

**Continuous space models**

N-gram models, as discussed above, are based on the completely-unrealistic but sometimes-useful assumption that the n most recent words are all we need to predict future words. Ideally we would set n to be as large as possible, but the sparsity of language means that this isn’t feasible — almost all grammatical 18-word sentences, for example, including that “liker liker liker” abomination above, have never been written before; there simply isn’t enough data to train an n-gram model for large n (and “large” appears to be closer to 4 than 18, in practise).

We saw above that the n-gram approach to solving the zero-probability problem was some form of smoothing. However, this is by definition unrealistic — we invent the probabilities of unseen n-grams based on the probabilities of component (n-1)-grams — or something even less plausible, such as adding a small constant amount.

Continuous-space models address this problem. The idea behind them is to put each word into a real-valued (“continuous”) high-dimensional vector space — in other words, to assign each word a set of real values y1, y2, y3, … which together represent the word. The aim is to achieve this goal:

Goal: Ensure that each dimension of this space, or at least some of them, represents some characteristic of the word.

If we manage to do this, then

Consequence 1: the closer together (in a literal, Euclidean sense) two words are in this space, the more related they are, and

Consequence 2: you can use this property to estimate the likelihoods of n-grams which you haven’t seen before.

The output of a model trained to place words into a space in this fashion is a set of *word embeddings* — a new name, since they aren’t probabilities any more. (The popular form of continuous space model at the moment uses a neural network to learn the embeddings. Non-neural-network-based methods exist, but I’m not going to write about them here.)

The **distributional hypothesis in linguistics** states that words used in similar contexts tend to be related somehow. This makes sense intuitively — imagine sentences like “I like to eat *x*”, “Let’s go to the *x*”, “It’s not *x *being green”, and so on. We can train a neural network to take advantage of the distributional hypothesis by training it to predict the next word given a context, i.e. P(W(n) | W(n-j), W(n-j+1) … W(n-1)). The idea is to minimise the distance between the target word and its context. By doing this ‘online’ (example-by-example, rather than in a batch) the vectors for words which tend to be close together end up closer together themselves. Thus the neural network takes advantage of the distributional hypothesis to generate a set of word embeddings — words are initially assigned random vectors and gradually “move into place” during training.

The first neural-network-based word embeddings did the above by maximising the log-likelihood of a given word and its context. This is computationally expensive, because to calculate log-likelihood you need to calculate the probability of all other words in the same context at every training step.

The big innovation in the **word2vec** algorithm is to use a different training method, *noise-contrastive training*, which is more efficient computationally. Instead of maximising log-likelihood for a word given its context, word2vec trains a neural network to distinguish a word in a context from several “noise” words in the same context (where a ’noise’ word is a valid word which does not appear in the given context). This is computationally much cheaper (no calculation of other-word probabilities, and the answer is a simple logistic-regression “yes” or “no”.

A neat word2vec trick is coming up with analogies — you can ask questions like “king” - “man” + “woman” = ? and a word2vec-trained language model will answer “queen”. This isn’t specific to word2vec, though — it’s a consequence of the application of the distributional hypothesis. Word2vec made it feasible to compute these embeddings on large data sets, however.

**CBOW vs skip-gram**

As mentioned above, bag of words is the name given to unigram language models. Continuous bag of words (CBOW) is the extension of that model into a continuous vector space.

The above description of neural continuous models is CBOW. We saw that we train a network to predict the next word given a context — but that context is presented as a vector generated by averaging all the words present in it. This means that CBOW training discards information — you could imagine that multiple contexts (ideally similar ones, but nonetheless) could average to the same, or to very similar, vectors.

The alternative is **skip-gram**, which flips the training around: rather than attempting to predict word from context, predict each context word from a given target word. It turns out that this approach performs better in larger data sets, but the smoothing provided by CBOW actually helps in smaller data sets.

**References:**