Lecture 8 Notes
When we are close enough to the optimum the formula becomes quadratic and convergence rate depends on the ratio λmax / λmin
Grading descent equation from previous lectures:
w(t+1) = w(t) - η ∇E (where η is learning rate)
Depending on our chosen learning rate we can have different outcomes:
a) learning rate is too big and convergence will never happen - with every step of descent we will move further and further from the optimum:
b) learning rate is too small, convergence will happen but this will be a slow progress and will take a lot of iterations of the algorithm
c) well-chosen convergence rate will converge and do so reasonably quickly
To obtain such optimal learning rate we will introduce new variable - α, for momentum, such that
0 ≤ α < 1
(at zero we have no momentum at all, and at 1 momentum will not stop at the minimum, we need some friction to slow it down once we've reached our goal)
If we add this new factor our formula becomes:
w(t+1) = w(t) - η ∇E + α (w(t) - w(t-1))
Simplify:
w(t+1) - w(t) = - η∇E + α(w(t) - w(t-1))
Rewrite w(t+1) - w(t) as Δw(t) and substitute it into the formula above:
Δw(t) = - η ∇E + α Δw(t-1)
where for stability η < 2 / λmax
There has to be a balance in setting the momentum to apply optimally to both λmax and λmin. If λmax is a lot bigger than λmin we calculate momentum using the following formula:
Δbmin = - η∂E/∂bmin + αΔbmin
Δbmin - αΔbmin = - η∂E/∂bmin
(1 - α)Δbmin = - η∂E/∂bmin
Δbmin = - η/(1 - α)∂E/∂bmin
if α is too high it would affect our λmax which will have it overshoot the minimum.
However, in practice, for big data sets there are too many calculations to use batch grading descent. Instead, another option is to use stochastic grading descent:
Δw(t) = - η ∇Ê
(where Ê is ∇E plus some bounded amount of zero-mean noise)
- η has to go to zero, slowly enough that we can get rid of the noise.
Σt=1...∞ η(t) = ∞
(to have enough momentum), but also, for descent to be fast enough
Σt=1...∞ η(t)2 < ∞:
so
O(1/√t) < η(t) ≤ O(1/t)
In practice stochastic descent is not used either. We can't always get an optimum, for an algorithm that uses real life changing data a good approximation is what we aim for.
All gradient descent methods are weak, it is much better to analytically find the optimum and just "jump" there.
MLP require a lot of tuning. They are hard to implement and there's a lot of decisions that we have to make. New approach - SVM (developed by Vladimir Vapnik and his colleagues) Some of the SVM logic can be added to MLPs to enhance their performance. Data example:
For data like this we can fit many possible linear thresholds that will predict different values. If we had to choose one, we could find a place, such that:
- classifies data correctly
- positioned in such a way that the nearest square and the nearest circle are as far away from it as possible.
Its position depends solely on data points touching the margin (the support vectors).
SVM - linear classifier with maximum margin. (The bigger the margin, the better)