-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path02_notes.qmd
295 lines (215 loc) · 10.3 KB
/
02_notes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Notes {-}
## What is Statistical Learning?
In this chapter will deal with
developing an **accurate** model
that can be used to **predict**
some value.
Notation:
- **Input variables**: $X_1, \cdots, X_p$
Also known as *predictors, features, independent variables*.
- **Output variable**: $Y$
Also known as *response or dependent variable*.
We assume there is some relationship between
$Y$ and $X = \left( X_1, \cdots, X_p \right)$,
which we write as:
$$Y = f(X) + \epsilon$$
, where $\epsilon$ is a random **error term** which
is **independent** from $X$ and has mean zero;
and, $f$ represents the **systematic information**
that $X$ provides about $Y$ .
```{r}
#| label: 02-prediction
#| echo: false
#| fig-cap: Income data set
#| out-width: 100%
knitr::include_graphics("images/02-prediction.jpg")
```
In essence, statistical learning deals with **different approaches to estimate $f$** .
### Why estimate $f$?
Two main reasons to estimate $f$:
#### Prediction
- Predict $Y$ using a set of inputs $X$ .
- Representation: $\hat{Y}= \hat{f}(X)$,
where $\hat{f}$ represents our *estimate* for $f$,
and $\hat{Y}$ our *prediction* for $Y$ .*
- In this setting, $\hat{f}$ is often treated as a
**black-box**, meaning we don't mind not knowing
the exact form of $\hat{f}$, if it generates
accurate predictions for $Y$ .
- $\hat{Y}$'s accuracy depends on:
- **Reducible error**
- Due to $\hat{f}$ not being a perfect estimate for $f$.
- **Can be reduced** by using a proper statistical learning technique.
- **Irreducible error**
- Due to $\epsilon$ and its variability.
- $\epsilon$ is independent from $X$, so no matter
how well we estimate $f$, we can't reduce this error.
- The quantity $\epsilon$ may contain **unmeasured variables**
useful for predicting $Y$; or, may contain **unmeasure variation**,
so no prediction model will be perfect.
- Mathematical form, after choosing predictors $X$ and an estimate
$\hat{f}$:
$$
E( Y - \hat{Y} )^2 =
E(f(X) + \epsilon - \hat{f}(X))^2 =
\underbrace{[f(X) - \hat{f}(X)]^2}_{reducible} +
\underbrace{\text{ Var}(\epsilon)}_{irreducible}\; .
$$
In practice, we almost always don't know how
$\epsilon$'s variability affects our model,
so, in this boook, we will focus on techniques for estimating $f$ .
#### Inference
In this case, we are interested in **understanding the association**
between $Y$ and $X_1, \cdots, X_p$.
- For example:
- *Which predictors are most associated with response?*
- *What is the relationship between the response and each predictor?*
- *Can such relationship be summarized via a linear equation, or is it more complex?*
The exact form of $\hat{f}$ is required.
Linear models allow for easier interpretability, but
can lack in prediction accuracy; while,
non-linear models can be more accurate, but less interpretable.
### How do we estimate $f$ ?
- First, let's agree on some conventions:
- $n$ : Number of observations.
- $x_{ij}$: Value of the $j\text{th}$ predictor, for $i\text{th}$ observation.
- $y_i$ : Response variable for $i\text{th}$ observation.
- **Training data**:
- Set of observations.
- Used to esmitate $f$.
- $\left\{ (x_1, y_1), \cdots, (x_n, y_n) \right\}$,
where $x_i = (x_{i1}, \cdots, x_{ip})^T$ .
- Goal: Find a function $\hat{f}$ such that $Y\approx\hat{f}(X)$
for any observation $(X,Y)$ .
- Most statistical methods for achieving this goal can be
characterized as either **parametric** or **non-parametric**.
#### Parametric methods
- Steps:
1. Make an assumption about the **form** of $f$. \
It could be linear
($f(X) = \beta_0 + \beta_1 X_1 + \cdot + \beta_p X_p,$
parameters $\beta_0, \cdots, \beta_p$ to be estimated) or not.
1. The model has been selected. \
Now, we need a procedure to **fit** the model using the training data. \
The most common of such fitting procedures is called
**(ordinary) least squares**.
- Via these steps, the problem of estimating $f$ has been reduced
to a problem of estimating a **set of parameters**.
- We can make the models more **flexible** via considering a
greater number of parameters, but, this can lead to **overfitting the data**, that is, following the errors/noise too closely,
which will not yield accurate estimates of the response for
observations outside of the original training data. \
#### Non-parametric methods
- No assumptions about the form of $f$ are made.
- Instead, we seek an estimate of $f$ which
that gets as close to the data point as possible.
- Has the potential to fit a wider range of possible forms for $f$.
- Tipically requires a very large number of observations
(compared to paramatric approach) in order to accurately estimate $f$.
### The trade-off between prediction accuracy and model interpretability
We've seen that parametric models are usually restrictive; and,
non-parametric models, flexible. However:
- **Restrictive** models are usually more **interpretable**,
so they are useful for inference.
- **Flexible** models can be difficult to interpret, due to
the complexity of $\hat{f}$.
Despite this, we will often obtain **more accurate predictions**
usinf a **less flexible method**, due to the potential for
*overfitting the data* in highly flexible models.
### Supervised vs Unsupervised Learning
In **supervised learning**, we wish to fit a model
that relates inputs/predictors to some output.
In **unsupervised learning**, we lack a reponse/variable
to predict. Instead, we seek to understand the relationships
between the variables or between the observations.
There are instances where a mix of such methods is required
(**semi-supervised learning problems**), but such topic
will not be covered in this book.
### Regression vs Classification problems
- If the **response is** ...
- **Quantitative**, then, it's a regression problem.
- **Categorical**, then, it's a classification problem.
- Most of the methods covered in this book can be applied
regardless of the predictor variable type, but the categorical
variables will require some pre-processing.
## Assessing model accuracy
- There is no **best method** for Statistical Learning,
the method's efficacy can depend on the data set.
- For a specific data set, **how do we select the best Statistics approach**?
### Measuring the quality of fit
- The performance of a statistical learning method can be evaluated
comparing the predictions of the model, with their true/real response.
- Most commonly used measure for this:
- **Mean squared error**
- $\text{ MSE } = \dfrac{1}{n}\displaystyle{ \sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2 }$
- Small MSE means that the predicted and the true responses are very close.
- We want the model to accurately predict **unseen data** (testing data),
not so much the training data, where the response is already known.
- The *best* model will be the one which produces the **lowest test MSE**,
not the lowest training MSE.
- It's **not true** that the model with lowest training MSE will also
have the lowest test MSE.
```{r}
#| label: 02-train-test-MSE
#| echo: false
#| fig-cap: Training MSE vs Test MSE
#| out-width: 100%
knitr::include_graphics("images/02-train-test-MSE.jpg")
```
- **Fundamental property**: For any data set and any statistical learning method used, as the flexibility of the statistical learning method increases:
- The training MSE decreases monotonically.
- The test MSE graph has a *U*-shape.
> As model flexibility increases, training MSE will decrease,
> but the test MSE **may not**.
- Small training MSE but big test MSE implies having overfitted the data.
- Regardless of overfitting or not, we almost always expect
$\text{training MSE } < \text{ testing MSE }$, beacuse most statistical learning methods seek to minimize the training MSE.
- Estimating test MSE is very difficult, usually because lack of data.
Later in this book, we'll discuss approaches to estimate the
**mininum point** for the test MSE curve.
### The Bias-Variance Trade-off
- **Definition**: The **expected test MSE** at $x_0$
($E(y_0 - \hat{f}(x_0))^2$) refers to
the averga test MSE that we would obtain after repeatedly estimating
$f$ using a large number of training sets, and tested each esimate
at $x_0$.
- **Definition**: The variance of a statistical learning method which
produces an estimate $\hat{f}$ refers to how the estimate function
changes, for different training sets.
- **Definition**: **Bias** refers to the error generated by approximating
a possibly complicated model (like in real-life usually), by a much simpler one ... (how $f$ and the possibles $\hat{f}$ *differ*).
- As a general rule, the **more flexible** a statistical method,
the **higher its variance** and **lower its bias**.
- For any given value $x_0$, the following can be proved:
$$
E(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + \text{Bias}(\hat{f}(x_0))^2 + \text{ Var }(\epsilon)
$$
- Due to variance and squared bias being non negative, the previous equation
implies that, to **minimize the expected test error**, we require a
statistical learnig method which achieves **low variance** and **low bias**.
- The tradeoff:
- *Extremely low bias but high variance*: For example, draw a line which passes over every single point in the training data.
- *Extremely low variance but high bias*: For example, fit a
horizontal line to the data.
- > The challenge lies in finding
> a method for which both the variance and the squared bias are low.
- In a real-life situation, $f$ is usually unkwon, so it's not possible
to explicitly compute the test MSE, bias or variance of a statistical method.
- The test MSE can be estimated using **cross-validation**,
but we'll discuss it later in this book.
### The Classification setting
Let's see how the concepts recently discussed change
when we the prediction is a categorical variable.
The most common approach for quantifying the accuracy
of our estimate $\hat{f}$ is the **training error rate**,
the proportion of mistakes made by applying $\hat{f}$
to the training observations:
$$
\dfrac{1}{n}\displaystyle{ \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)}
$$
, where $I$ is $1$ when $y_i = \hat{y}_i$, and $0$ otherwise.
- The **test error rate** is defined as
$\text{ Average}(I(y_i \neq \hat{y}_i))$, where the average
is computed by comparing the predictions $\hat{y}_i$ with the
true response $y_i$.
- A **good classifier** is one for which the test error is smallest.