Language modelling with recurrent networks

Before neural networks there were n-grams

Modelled frequency of word sequences from a corpus
- Struggled with low likelihood words (sparsity problem)
  - Fixed slightly by smoothing and using backoff
    e.g. add a small value to every count to avoid numerator of zero
    e.g. 4-gram move to 5-gram
These used one-hot representations so no symantic relationships or word embeddings

Recurrent network

current -> hidden -> predicted word ^ + previous step

Recurrent network learnt word embeddings in its hidden layer (similar words / contexts predict similar next words!)

But an Elman network only looks back one time step

But as you train, the gradients get smaller as we propagate to earlier words making it hard to capture long distance dependencies in text.

Perplexity Allows you to evalute a model.. maths goes here..

Last updated 4 years ago

Was this helpful?