Skip to content

Latest commit

 

History

History
522 lines (287 loc) · 12 KB

actingondata.md

File metadata and controls

522 lines (287 loc) · 12 KB

slidenumbers: true autoscale:true

inline, left, 30%

Boston, Sep 19th, 2015

#[fit] ACTING ON DATA

###Rahul Dave ([email protected])

###@rahuldave

30%, Left, inline 

40%, right, inline


60%, inline

###Data Science. ###Simulation. ###Software.

###Social Good. ###Interesting Problems.

70%, inline

machine learning, complex systems, stochastic methods, viz, extreme computing

###DEGREE PROGRAMS:

  • Master’s of science– one year
  • Master’s of engineering – two year with thesis/research project

#[fit]CLASSIFICATION

  • will a customer churn?
  • is this a check? For how much?
  • a man or a woman?
  • will this customer buy?
  • do you have cancer?
  • is this spam?
  • whose picture is this?
  • what is this text about?1

fit, left


#[fit]REGRESSION

  • how many dollars will you spend?
  • what is your creditworthiness
  • how many people will vote for Bernie t days before election
  • use to predict probabilities for classification
  • causal modeling in econometrics

fit, right


#[fit]From Bayesian Reasoning and Machine Learning, David Barber:

"A father decides to teach his young son what a sports car is. Finding it difficult to explain in words, he decides to give some examples. They stand on a motorway bridge and ... the father cries out ‘that’s a sports car!’ when a sports car passes by. After ten minutes, the father asks his son if he’s understood what a sports car is. The son says, ‘sure, it’s easy’. An old red VW Beetle passes by, and the son shouts – ‘that’s a sports car!’. Dejected, the father asks – ‘why do you say that?’. ‘Because all sports cars are red!’, replies the son."


30 points of data. Which fit is better? Line in $$\cal{H_1}$$ or curve in $$\cal{H_{20}}$$?

inline, fitinline, fit


fit, left

#[fit]What does it mean to FIT?

Minimize distance from the line?

$$R_{\cal{D}}(h_1(x)) = \frac{1}{N} \sum_{y_i \in \cal{D}} (y_i - h_1(x_i))^2 $$

Minimize squared distance from the line.

$$ g_1(x) = \arg\min_{h_1(x) \in \cal{H}} R_{\cal{D}}(h_1(x)).$$

##[fit]Get intercept $$w_0$$ and slope $$w_1$$.


#EMPIRICAL RISK MINIMIZATION

The sample must be representative of the population!

fit, left

$$A : R_{\cal{D}}(g) ,,smallest,on,\cal{H}$$ $$B : R_{out ,of ,sample} (g) \approx R_{\cal{D}}(g)$$

A: Empirical risk estimates out of sample risk. B: Thus the out of sample risk is also small.


THE REAL WORLD HAS NOISE

Which fit is better now?                                               The line or the curve?

fit, inline


filtered

#[fit] DETERMINISTIC NOISE (Bias) vs STOCHASTIC NOISE

inline, fit, left inline, fit, right


#UNDERFITTING (Bias) #vs OVERFITTING (Variance)

inline, fit


#[fit]If you are having problems in machine learning the reason is almost always2

#[fit]OVERFITTING


DATA SIZE MATTERS: straight line fits to a sine curve

inline, fit

Corollary: Must fit simpler models to less data!


#HOW DO WE LEARN?

inline

right, fit


left, fit

#BALANCE THE COMPLEXITY

inline


right, fit

#[fit]VALIDATION

  • train-test not enough as we fit for $$d$$ on test set and contaminate it
  • thus do train-validate-test

inline


#[fit]CROSS-VALIDATION

inline

right, fit


original, right, fit

#[fit]REGULARIZATION

Keep higher a-priori complexity and impose a

#complexity penalty

on risk instead, to choose a SUBSET of $$\cal{H}_{big}$$. We'll make the coefficients small:

$$\sum_{i=0}^j a_i^2 < C.$$


fit


original, left, fit

#[fit]REGULARIZATION $$\cal{R}(h_j) = \sum_{y_i \in \cal{D}} (y_i - h_j(x_i))^2 +\alpha \sum_{i=0}^j a_i^2.$$

As we increase $$\alpha$$, coefficients go towards 0.

Lasso uses $$\alpha \sum_{i=0}^j |a_i|,$$ sets coefficients to exactly 0.

Thus regularization automates:

       FEATURE ENGINEERING


#[fit]CLASSIFICATION

#[fit]BY LINEAR SEPARATION

#Which line?

  • Different Algorithms, different lines.

  • SVM uses max-margin1

fit, right


#DISCRIMINATIVE CLASSIFIER $$P(y|x): P(male | height, weight)$$

inline, fitinline, fit

##VS. DISCRIMINANT


#GENERATIVE CLASSIFIER $$P(y|x) \propto P(x|y)P(x): P(height, weight | male) \times P(male)$$

inline, fitinline, fit


#[fit]The virtual ATM3

left right


#[fit] PCA4      #[fit]unsupervised dim reduction from 332x137x3 to 50

100%,  left

inline inline


#kNN

inline

right, fit


#[fit] BIAS and VARIANCE in KNN left, 99%

inline

inline


#kNN #CROSS-VALIDATED

left, fit

inline


#[fit]ENSEMBLE LEARNING

inline

#Combine multiple classifiers and vote.

Egs: Decision Trees $$\to$$ random forest, bagging, boosting


#EVALUATING CLASSIFIERS

left, fit

inline

confusion_matrix(bestcv.predict(Xtest), ytest)

[[11, 0],
 [3, 4]]

checks=blue=1, dollars=red=0


#[fit]CLASSIFIER PROBABILITIES

  • classifiers output rankings or probabilities
  • ought to be well callibrated, or, atleast, similarly ordered

right, fit

inline


#CLASSIFICATION RISK

  • $$R_{g,\cal{D}}(x) = P(y_1 | x) \ell(g,y_1) + P(y_0 | x) \ell(g,y_0) $$
  • The usual loss is the 1-0 loss $$\ell = \mathbb{1}_{g \ne y}$$.
  • Thus, $$R_{g=y_1}(x) = P(y_0 |x)$$ and $$R_{g=y_0}(x) = P(y_1 |x)$$

       CHOOSE CLASS WITH LOWEST RISK

1 if $$R_1 \le R_0 \implies$$ 1 if $$P(0 | x) \le P(1 |x)$$.

       choose 1 if $$P(1|x) \ge 0.5$$ ! Intuitive!


#ASYMMETRIC RISK5

want no checks misclassified as dollars, i.e., no false negatives: $$\ell_{10} \ne \ell_{01}$$.

Start with $$R_{g}(x) = P(1 | x) \ell(g,1) + P(0 | x) \ell(g,0) $$

$$R_1 = l_{11} p_1 + l_{10} p_0 = l_{10}p_0$$ $$R_0 = l_{01} p_1 + l_{00} p_0 = l_{01}p_1$$

choose 1 if $$R_1 &lt; R_0$$


#ASYMMETRIC RISK

right, fit

i.e. $$r p_0 &lt; p_1$$ where

$$r = \frac{l_{10}}{l_{01}} = \frac{l_{FP}}{l_{FN}}$$

Or $$P_1 &gt; t$$ where $$t = \frac{r}{1+r}$$

confusion_matrix(ytest, bestcv2, r=10, Xtest)
[[5, 4],
 [0, 9]]

#[fit]ROC SPACE6

$$TPR = \frac{TP}{OP} = \frac{TP}{TP+FN}.$$

$$FPR = \frac{FP}{ON} = \frac{FP}{FP+TN}$$

left, fit

inline


fill, Original

inline, right, 70%, filtered

#[fit]ROC CURVE


#ROC CURVE

  • Rank test set by prob/score from highest to lowest
  • At beginning no +ives
  • Keep moving threshold
  • confusion matrix at each threshold

fit, original


#[fit]COMPARING CLASSIFERS

Telecom customer Churn data set from @YhatHQ7

inline


#ROC curves

fit, original


#ASYMMETRIC CLASSES

  • A has large FP8
  • B has large FN.
  • On asymmetric data sets, A will do very bad.
  • But is it so?

fit, left


#EXPECTED VALUE FORMALISM

Can be used for risk or profit/utility (negative risk)

$$EP = p(1,1) \ell_{11} + p(0,1) \ell_{10} + p(0,0) \ell_{00} + p(1,0) \ell_{01}$$

$$EP = p_a(1) [TPR,\ell_{11} + (1-TPR) \ell_{10}]$$          $$ + p_a(0)[(1-FPR) \ell_{00} + FPR,\ell_{01}]$$

Fraction of test set pred to be positive $$x=PP/N$$:

$$x = (TP+FP)/N = TPR,p_o(1) + FPR,p_o(0)$$


#ASYMMETRIC CLASSES

$$r = \frac{l_{FP}}{l_{FN}}$$

Equicost lines with slope

$$\frac{r,p(0)}{p(1)} = \frac{r,p(-)}{r,p(+)}$$

Small $$r$$ penalizes FN.

Churn and Cancer u dont want FN: an uncaught churner or cancer patient (P=churn/cancer)

fit, right


fit, original

#Profit curve

  • Rank test set by prob/score from highest to lowest
  • Calculate the expected profit/utility for each confusion matrix ($$U$$)
  • Calculate fraction of test set predicted as positive ($$x$$)
  • plot $$U$$ against $$x$$

#Finite budget8

fit, right

  • 100,000 customers, 40,000 budget, 5$ per customer
  • we can target 8000 customers
  • thus target top 8%
  • classifier 1 does better there, even though classifier 2 makes max profit

#[fit]WHERE TO GO FROM HERE?

  • Follow @YhatHQ, @kdnuggets etc
  • Read Provost, Foster; Fawcett, Tom. Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media.
  • check out Harvard's cs109: cs109.org
  • Ask lots of questions of your data science team
  • Follow your intuition

#THANKS!!

###@rahuldave

Footnotes

  1. image from code in http://bit.ly/1Azg29G 2

  2. Background from http://commons.wikimedia.org/wiki/File:Overfitting.svg

  3. Inspired by http://blog.yhathq.com/posts/image-classification-in-Python.html

  4. Diagram from http://stats.stackexchange.com/a/140579

  5. image of breast carcinoma from http://bit.ly/1QgqhBw

  6. this+next fig: Data Science for Business, Foster et. al.

  7. http://blog.yhathq.com/posts/predicting-customer-churn-with-sklearn.html

  8. figure from Data Science for Business, Foster et. al. 2