Learning theory

How do we know we have found the correct patterns?

Learners performance

Supervised learning The task in supervised learning is to find f(x,w) such that it models the relationship between a given input x and output y.

Loss function L(f(x,w), y) gives learner a score for a set of values w

Classification
- L =
  - 0 if f(x,w) = y
  - 1 if f(x,w) != y
Regression (MSE)
- L=(f(x,w)-y)^2
- Regression punishes if wrong
Cross-entropy (CE)
- L = -y ln f(x,w) - (1-y) ln (1-f(x,w))
- Encourages if correct

Training we modify w so as to minimise L

Risk

True risk True risk is the expectation of the loss (performance of the hypothesis on all possible data)

Not computable!

Empirical risk Empirical risk is the average of the loss computed from available data (performance of the hypothesis on the data we have)

Can be computed!

Generalisation

Consistency A hypothesis with a small empirical risk is consistent

Generalisation A hypothesis where empirical risk approximates true risk is said to generalise well

But we can't compute true risk so we can't guarantee generalisation

Hoeffding's inequality This inequality gives an upper bound in probability of an average of m samples being different from the expectation of the corresponding random variable by more than e

𝑃(𝐴_𝑚 ≤𝐸[𝐴_𝑚] −𝜖) ≤𝑒^(−2𝑚𝜖^2)

We can then substitute empirical risk for A_m

Giving us the probability of empirical risk being more than 𝜖 outside of true risk

The expression of ln| H | is a rough measure of complexity of H. A more accurate measure of complexity of a hypothesis space is VC-dimension, denotes as d. With VC-dimension complexity mesure we have:

True risk <= empirical risk + O(some equation to be filled in)
- with probability 1-q

Generalisation principle Choosing a hypothesis space with a smallest VC-dimension guarantees (in probability) better generalisation. In other words, the simpler the model, the better chance of generalisation.

VC-dimension

VC-dimension is the maximum number of points that a hypothesis can shatter. It measures the complexity of a hypothesis space in terms of it's representational power.

e.g. a line can shatter three points but not four

Neural networks and VC-dimension

Neural networks: d <= O(number of parameters squared)
ReLU NN: d <= (parameters)(layers)(log(parameters))
Deep vs shallow
- A single hidden layer neural network is a universal function approximator
- Single both shalow and deep can do anything, why deep?
- For some problems, the deeper network has a lower VC-dimension so generalises better
  - But deep networks can also fit random patterns easily

Reducing complexity of neural networks

Regularisation
- Penality for complexity of neural network factored into optimised cost/loss function
  - Weight decay
Constraints imposed on the model limit its representational power
- Dropout (chance for a neuron/unit to do nothing for an iteration of learning)
- Batch normalisation (standardises the inputs, makes it easier to learn)

PreviousLearning NextStochastic Neural Networks

Last updated 4 years ago

Was this helpful?