Learning theory
How do we know we have found the correct patterns?
Learners performance
Loss function L(f(x,w), y) gives learner a score for a set of values w
Classification
L =
0 if f(x,w) = y
1 if f(x,w) != y
Regression (MSE)
L=(f(x,w)-y)^2
Regression punishes if wrong
Cross-entropy (CE)
L = -y ln f(x,w) - (1-y) ln (1-f(x,w))
Encourages if correct
Training we modify w so as to minimise L
Risk
True risk True risk is the expectation of the loss (performance of the hypothesis on all possible data)
Not computable!
Empirical risk Empirical risk is the average of the loss computed from available data (performance of the hypothesis on the data we have)
Can be computed!
Generalisation
But we can't compute true risk so we can't guarantee generalisation
We can then substitute empirical risk for A_m
Giving us the probability of empirical risk being more than 𝜖 outside of true risk
The expression of ln| H | is a rough measure of complexity of H. A more accurate measure of complexity of a hypothesis space is VC-dimension, denotes as d. With VC-dimension complexity mesure we have:
True risk <= empirical risk + O(some equation to be filled in)
with probability 1-q
VC-dimension
VC-dimension is the maximum number of points that a hypothesis can shatter. It measures the complexity of a hypothesis space in terms of it's representational power.
e.g. a line can shatter three points but not four
Neural networks and VC-dimension
Neural networks: d <= O(number of parameters squared)
ReLU NN: d <= (parameters)(layers)(log(parameters))
Deep vs shallow
A single hidden layer neural network is a universal function approximator
Single both shalow and deep can do anything, why deep?
For some problems, the deeper network has a lower VC-dimension so generalises better
But deep networks can also fit random patterns easily
Reducing complexity of neural networks
Regularisation
Penality for complexity of neural network factored into optimised cost/loss function
Weight decay
Constraints imposed on the model limit its representational power
Dropout (chance for a neuron/unit to do nothing for an iteration of learning)
Batch normalisation (standardises the inputs, makes it easier to learn)
Last updated
Was this helpful?