Language modelling with recurrent networks
Before neural networks there were n-grams
Modelled frequency of word sequences from a corpus
Struggled with low likelihood words (sparsity problem)
Fixed slightly by smoothing and using backoff
e.g. add a small value to every count to avoid numerator of zero
e.g. 4-gram move to 5-gram
These used one-hot representations so no symantic relationships or word embeddings
Recurrent network
current -> hidden -> predicted word ^ + previous step
Recurrent network learnt word embeddings in its hidden layer (similar words / contexts predict similar next words!)
But an Elman network only looks back one time step
Unfolding (Backpropagation through time)
Choose a depth D
Break corpus into sub-sequences of D+1
Same weights for all of the hidden layer and other bits
But as you train, the gradients get smaller as we propagate to earlier words making it hard to capture long distance dependencies in text.
Evaluate a language model
Last updated
Was this helpful?