Lecture 8 Notes

When we are close enough to the optimum the formula becomes quadratic and convergence rate depends on the ratio λ_max / λ_min

Grading descent equation from previous lectures:

w(t+1) = w(t) - η ∇E (where η is learning rate)

Depending on our chosen learning rate we can have different outcomes:

a) learning rate is too big and convergence will never happen - with every step of descent we will move further and further from the optimum:

b) learning rate is too small, convergence will happen but this will be a slow progress and will take a lot of iterations of the algorithm

c) well-chosen convergence rate will converge and do so reasonably quickly

To obtain such optimal learning rate we will introduce new variable - α, for momentum, such that

0 ≤ α < 1

(at zero we have no momentum at all, and at 1 momentum will not stop at the minimum, we need some friction to slow it down once we've reached our goal)

If we add this new factor our formula becomes:

w(t+1) = w(t) - η ∇E + α (w(t) - w(t-1))

Simplify:

w(t+1) - w(t) = - η∇E + α(w(t) - w(t-1))

Rewrite w(t+1) - w(t) as Δw(t) and substitute it into the formula above:

Δw(t) = - η ∇E + α Δw(t-1)

where for stability η < 2 / λ_max

There has to be a balance in setting the momentum to apply optimally to both λ_max and λ_min. If λ_max is a lot bigger than λ_min we calculate momentum using the following formula:

Δb_min = - η∂E/∂b_min + αΔb_min

Δb_min - αΔb_min = - η∂E/∂b_min

(1 - α)Δb_min = - η∂E/∂b_min

Δb_min = - η/(1 - α)∂E/∂b_min

if α is too high it would affect our λ_max which will have it overshoot the minimum.

However, in practice, for big data sets there are too many calculations to use batch grading descent. Instead, another option is to use stochastic grading descent:

Δw(t) = - η ∇Ê

(where Ê is ∇E plus some bounded amount of zero-mean noise)

η has to go to zero, slowly enough that we can get rid of the noise.

Σ_t=1...∞ η(t) = ∞

(to have enough momentum), but also, for descent to be fast enough

Σ_t=1...∞ η(t)² < ∞:

so

O(1/√t) < η(t) ≤ O(1/t)

In practice stochastic descent is not used either. We can't always get an optimum, for an algorithm that uses real life changing data a good approximation is what we aim for.

All gradient descent methods are weak, it is much better to analytically find the optimum and just "jump" there.

Support Vector Machines (SVMs):

MLP require a lot of tuning. They are hard to implement and there's a lot of decisions that we have to make. New approach - SVM (developed by Vladimir Vapnik and his colleagues) Some of the SVM logic can be added to MLPs to enhance their performance. Data example:

For data like this we can fit many possible linear thresholds that will predict different values. If we had to choose one, we could find a place, such that:

classifies data correctly
positioned in such a way that the nearest square and the nearest circle are as far away from it as possible.

Its position depends solely on data points touching the margin (the support vectors).

SVM - linear classifier with maximum margin. (The bigger the margin, the better)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture-08.md

lecture-08.md

Support Vector Machines (SVMs):

Files

lecture-08.md

Latest commit

History

lecture-08.md

File metadata and controls

Support Vector Machines (SVMs):