20/10/2015
#Support Vector Machine (SVM) Last Lecture: Maximum margin problems
-
Introduction of soft margin
When there is mislabelled data; a hyperplane is introduced to cleanly split data and maximise margin distance -
Problem based on linear classification
Margin is straightforward to calculate in linear case but not when the problem's non-linear:
![Non-linear margin] (http://www.blaenkdenum.com/images/notes/machine-learning/support-vector-machines/x-space-non-linear-svm.png)
##Kernel Trick
φ: ℝ106⟼ ℝ10100
- φ(x) - Intractable (heard to calculate by itself)
Analogy - GPU input vector which you cannot alter once it is being processed
##Kernelize the Algorithm
- Instead of operating in input space - change x's to φ(xi)'s to move to feature space
ω: primal/single representation of the vector
α: dual representation of the vector
ξi: slack parameter
ω,θ,ξ - penetration variables (penetrate margins)
y(i) = ±1 for either "yes"/"no" class of data points
Minimise(ω,θ,ξ) 1/2‖ω‖2 + cΣi(ξi) Subject to: y(i)(ω·x(i) - θ) ≥ 1 - ξi And: ξi ≥ 0
Math Breakdown:
ω= Σi(αiy(i)φ(x(i)) ‖ω‖2 = ω·ω = (Σi(αiy(i)φ(x(i)))·(Σj(αjy(j)φ(x(j))) = Σi,j(αiαjy(i)y(j)(φ(x(i))φ(x(j)))) = Σi,j(αiαjy(i)y(j)k(x(i),x(j))) = αT(y(i)y(j)ki,j)i,j
Substitute ω·φ(x(i)) = Σj(αjy(j)φ(x(j))φ(x(i))) = Σj(αjy(j)k(x(i),x(j)))
###Kernelized Algorithm
Minimise(α,θ,ξ) 1/2αT(y(i)y(j)ki,j)i,j + cΣi(ξi) Subject to: y(i)(Σj(αjy(j)k(x(i),x(j))) - θ) ≥ 1 - ξi And: ξi ≥ 0
- Dual representation (α) used for quadratic programming problems as opposed to primal representation (ω)
##Kernel Function
####Mercer's Theorem
Pre-condition:
If k is symmetric:
k(u, v) = k(v, u)
,non-negative definite:
![non-negative definite kernel] (https://upload.wikimedia.org/math/7/9/e/79e0f0a14643312d46347a004e688ef7.png)
for all finite sequences of points x1,..., xn of [a, b] and all choices of real numbers c1,..., cn
Post-condition:
⇒ ∃ φ s.t. k(u,v)=φ(u)·(v)
Examples:
Identity Kernel:
k(u,v)=u·v
- takes O(n) work in n-space
k(u,v) = (u·v)2 = (Σk(ukvk))2 = (Σk(ukvk))(Σk'(uk'vk')) = Σk,k'(ukuk')(vkvk') = Σk,k'φ(u)k,k'φ(v)k,k'
###Polynomial Kernel
For degree-d polynomials, the polynomial kernel is defined as:
![polynomial kernel] (https://upload.wikimedia.org/math/e/0/e/e0e6e2ac260502f8818fb8c55cec2227.png)
where x and y are vectors in the input space and c ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial.
φ: ℝn⟼ ℝnp/≈p!
-
When we used quadratic kernel we dropped all linear terms
-
No arguments about which kernel function to use as you can always use them both (add them up) Example:
k(u,v) = (u·v + 1)2 = (Σkukvk + 1)(Σk'uk'vk' + 1) = Σk,k'ukuk'vkvk' + 2Σkukvk + 1
##Gaussian Process
- Gaussian Kernel:
k(u,v) = e-d|u-v|2
- Can be used to compute similarities between images
- Fee for using this: maps to infinite dimensions
##Kernel Function applications
- Find similarities between two pieces of text
*When we finished the optimisation above:
Problem: there were no x's left, just i's and j's
When we want to embed SVM in your system (sneeze function in camera):
ω·φ(x)⩼ θ Σ(αiy(i)φ(x(i))·φ(x)) =Σ(s.t. αi≠0)(αiy(i)k(x(i),x)) most αi are zero!
We only need to store the support vectors of people sneezing from the training set
Only download these into the camera:
e.g. 200/1000 training cases
##SVM Conclusions
Popular kernels: quite robust
-Reasons why people like SVM instead
Positives:
- Beautiful Math (Kercher's...)
- SVM depends on number of support vectors
- can work in higher-dimensional space as only looks at subset of vectors
- Turn Key
Criticism:
- Glorified template matching