slidenumbers: true autoscale:true
Boston, Sep 19th, 2015
#[fit] ACTING ON DATA
###Rahul Dave ([email protected])
###@rahuldave
###Data Science. ###Simulation. ###Software.
machine learning, complex systems, stochastic methods, viz, extreme computing
###DEGREE PROGRAMS:
- Master’s of science– one year
- Master’s of engineering – two year with thesis/research project
#[fit]CLASSIFICATION
- will a customer churn?
- is this a check? For how much?
- a man or a woman?
- will this customer buy?
- do you have cancer?
- is this spam?
- whose picture is this?
- what is this text about?1
#[fit]REGRESSION
- how many dollars will you spend?
- what is your creditworthiness
- how many people will vote for Bernie t days before election
- use to predict probabilities for classification
- causal modeling in econometrics
#[fit]From Bayesian Reasoning and Machine Learning, David Barber:
"A father decides to teach his young son what a sports car is. Finding it difficult to explain in words, he decides to give some examples. They stand on a motorway bridge and ... the father cries out ‘that’s a sports car!’ when a sports car passes by. After ten minutes, the father asks his son if he’s understood what a sports car is. The son says, ‘sure, it’s easy’. An old red VW Beetle passes by, and the son shouts – ‘that’s a sports car!’. Dejected, the father asks – ‘why do you say that?’. ‘Because all sports cars are red!’, replies the son."
30 points of data. Which fit is better? Line in
#[fit]What does it mean to FIT?
Minimize distance from the line?
Minimize squared distance from the line.
##[fit]Get intercept
#EMPIRICAL RISK MINIMIZATION
The sample must be representative of the population!
A: Empirical risk estimates out of sample risk. B: Thus the out of sample risk is also small.
Which fit is better now? The line or the curve?
#[fit] DETERMINISTIC NOISE (Bias) vs STOCHASTIC NOISE
#UNDERFITTING (Bias) #vs OVERFITTING (Variance)
#[fit]If you are having problems in machine learning the reason is almost always2
#[fit]OVERFITTING
DATA SIZE MATTERS: straight line fits to a sine curve
Corollary: Must fit simpler models to less data!
#HOW DO WE LEARN?
#BALANCE THE COMPLEXITY
#[fit]VALIDATION
- train-test not enough as we fit for
$$d$$ on test set and contaminate it - thus do train-validate-test
#[fit]CROSS-VALIDATION
#[fit]REGULARIZATION
Keep higher a-priori complexity and impose a
#complexity penalty
on risk instead, to choose a SUBSET of
#[fit]REGULARIZATION
As we increase
Lasso uses
Thus regularization automates:
FEATURE ENGINEERING
#[fit]CLASSIFICATION
#[fit]BY LINEAR SEPARATION
#Which line?
-
Different Algorithms, different lines.
-
SVM uses max-margin1
#DISCRIMINATIVE CLASSIFIER
##VS. DISCRIMINANT
#GENERATIVE CLASSIFIER
#[fit]The virtual ATM3
#[fit] PCA4 #[fit]unsupervised dim reduction from 332x137x3 to 50
#kNN
#[fit] BIAS and VARIANCE in KNN
#kNN #CROSS-VALIDATED
#[fit]ENSEMBLE LEARNING
#Combine multiple classifiers and vote.
Egs: Decision Trees
#EVALUATING CLASSIFIERS
confusion_matrix(bestcv.predict(Xtest), ytest)
[[11, 0],
[3, 4]]
checks=blue=1, dollars=red=0
#[fit]CLASSIFIER PROBABILITIES
- classifiers output rankings or probabilities
- ought to be well callibrated, or, atleast, similarly ordered
#CLASSIFICATION RISK
$$R_{g,\cal{D}}(x) = P(y_1 | x) \ell(g,y_1) + P(y_0 | x) \ell(g,y_0) $$ - The usual loss is the 1-0 loss
$$\ell = \mathbb{1}_{g \ne y}$$ . - Thus,
$$R_{g=y_1}(x) = P(y_0 |x)$$ and$$R_{g=y_0}(x) = P(y_1 |x)$$
CHOOSE CLASS WITH LOWEST RISK
1 if
choose 1 if
#ASYMMETRIC RISK5
want no checks misclassified as dollars, i.e., no false negatives:
Start with
choose 1 if
#ASYMMETRIC RISK
i.e.
Or
confusion_matrix(ytest, bestcv2, r=10, Xtest)
[[5, 4],
[0, 9]]
#[fit]ROC SPACE6
#[fit]ROC CURVE
#ROC CURVE
- Rank test set by prob/score from highest to lowest
- At beginning no +ives
- Keep moving threshold
- confusion matrix at each threshold
#[fit]COMPARING CLASSIFERS
Telecom customer Churn data set from @YhatHQ7
#ROC curves
#ASYMMETRIC CLASSES
- A has large FP8
- B has large FN.
- On asymmetric data sets, A will do very bad.
- But is it so?
#EXPECTED VALUE FORMALISM
Can be used for risk or profit/utility (negative risk)
Fraction of test set pred to be positive
#ASYMMETRIC CLASSES
Equicost lines with slope
Small
Churn and Cancer u dont want FN: an uncaught churner or cancer patient (P=churn/cancer)
#Profit curve
- Rank test set by prob/score from highest to lowest
- Calculate the expected profit/utility for each confusion matrix (
$$U$$ ) - Calculate fraction of test set predicted as positive (
$$x$$ ) - plot
$$U$$ against$$x$$
#Finite budget8
- 100,000 customers, 40,000 budget, 5$ per customer
- we can target 8000 customers
- thus target top 8%
- classifier 1 does better there, even though classifier 2 makes max profit
#[fit]WHERE TO GO FROM HERE?
- Follow @YhatHQ, @kdnuggets etc
- Read Provost, Foster; Fawcett, Tom. Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media.
- check out Harvard's cs109: cs109.org
- Ask lots of questions of your data science team
- Follow your intuition
#THANKS!!
###@rahuldave
Footnotes
-
image from code in http://bit.ly/1Azg29G ↩ ↩2
-
Background from http://commons.wikimedia.org/wiki/File:Overfitting.svg ↩
-
Inspired by http://blog.yhathq.com/posts/image-classification-in-Python.html ↩
-
Diagram from http://stats.stackexchange.com/a/140579 ↩
-
image of breast carcinoma from http://bit.ly/1QgqhBw ↩
-
this+next fig: Data Science for Business, Foster et. al. ↩
-
http://blog.yhathq.com/posts/predicting-customer-churn-with-sklearn.html ↩