-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathch6.Rmd
978 lines (702 loc) · 31.2 KB
/
ch6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
---
title: "6.8 Exercises"
output:
github_document:
md_extensions: -fancy_lists+startnum
html_notebook:
md_extensions: -fancy_lists+startnum
---
```{r setup, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(leaps) # for feature selection
library(glmnet) # for lasso and ridge
library(ISLR) # for datasets
library(modelr) # for tidy manipulation of models
library(pls) # for Principal Components regression
```
## Conceptual
(1) We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain $p + 1$
models, containing $0, 1, 2, ..., p$ predictors. Explain your answers:
(a) Which of the three models with k predictors has the smallest training RSS?
A: Best subset selection should have the smallest training RSS (or at least equal to others methods), because it will check all the combinations with k predictors, that is $\binom nk$ combinations. Forward and backward stepwise, instead, will check only a subset of those combinations ($p-k+1$ in forward selection and $k+1$ in backward selection).
(b) Which of the three models with k predictors has the smallest test RSS?
A: Again, best subset selection should provide the model with smallest test RSS, simply because it checks more combinations of predictors. However, since there is chance involved in the test RSS, stepwise could provide a model with lower test RSS in some rare cases.
(c) True or False:
i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the $(k+1)$-variable model identified by forward stepwise selection.
TRUE
ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the $(k +1)$-variable model identified by backward stepwise selection.
TRUE
iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the $(k +1)$-variable model identified by forward stepwise selection.
FALSE
iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the $(k+1)$-variable model identified by backward stepwise selection.
FALSE
v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the $(k +1)$-variable model identified by best subset selection.
FALSE
(2) For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
A: (iii) is correct. Lasso increases bias by introducing a "budget restriction" on coefficients that shrinks their value (and even can set some of them as zero). This leads to more bias, which is accepted to the extent that there is a larger decrease in variance (due to less overfitting to the training sample).
(b) Repeat (a) for ridge regression relative to least squares.
A: (iii) is correct. For the same reasons stated for the lasso regression (the only difference between the lasso and ridge is the "shape" of the constrain on the coefficients).
(c) Repeat (a) for non-linear methods relative to least squares.
A: (ii) is correct. Non-linear methods provide more flexibility than least squares, hence they can reduce bias, at the expense of higher variance. If the true relationship in the data is non linear, we can expect an improvement in prediction accuracy (the reduction in bias will be larger than the increase in variance).
(3) Suppose we estimate the regression coefficients in a linear regression model by minimizing
![](ch6_files/chapter6_1.png)
for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a Ushape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant
A: (iv) Steadily decrease. The flexibility of the model increses as $s$ increases, and higher flexibility is always associated with lower training RSS.
(b) Repeat (a) for test RSS.
A: (ii) The test RSS will decrease as long as the increase in variance (due to higher flexibility) is smaller than the reduction in bias. At some point (which depends on the true relationships in the data) the trend will invert, because the reduction in bias will no longer offset the increase in variance.
(c) Repeat (a) for variance.
A: (iii) Variance always increases as there is more flexibility.
(d) Repeat (a) for (squared) bias.
A: (iv) Bias should decrease, or at least remain constant, as the flexibility increases.
(e) Repeat (a) for the irreducible error.
A: (v). The irreducible error doesn't depend on the flexibility of the model.
(4) Suppose we estimate the regression coefficients in a linear regression model by minimizing
![](ch6_files/chapter6_2.png)
for a particular value of λ. For parts (a) through (e), indicate which
of i. through v. is correct. Justify your answer.
(a) As we increase λ from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a Ushape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
A: (iii) An increase in $\lambda$ implies a decrease in flexibility (model with $\lambda = 0$ has the same flexibility as OLS, and model with $\lambda = \infty$ is regression with just the intercept). So the training RSS will increase as $\lambda$ increases (and flexibility decreases).
(b) Repeat (a) for test RSS.
A: (ii) Lower flexibility has the benefit of reducing variance, at the expense of increasing bias. We should expect the reduction in variance to offset the increase in bias for a range, reach a minimum in total test RSS, and then see the trend reversed.
(c) Repeat (a) for variance.
A: (iv) Variance always decreases as flexibility decreases.
(d) Repeat (a) for (squared) bias.
A: (iii) As $\lambda$ gets higher, the model is more constrained, so the error due to bias increases.
(e) Repeat (a) for the irreducible error.
A: (v). The irreducible error doesn't depend on the flexibility of the model.
(5) It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.
Suppose that $n =2$, $p =2$, $x_{11} = x_{12}$, $x_{21} = x_{22}$.Furthermore, suppose that $y1+y2 = 0$ and $x_{11} +x_{21} =0$ and $x_{12} + x_{22} = 0$, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: $\hat{β}_0 = 0$.
(a) Write out the ridge regression optimization problem in this setting.
![](ch6_files/chapter6_3.png)
(b) Argue that in this setting, the ridge coefficient estimates satisfy $\hat{β}_1 = \hat{β}_2$.
A: We see that the RSS part of the expression depends only on the *sum* of both coefficients (not on their individual values). For each value of the sum, there are infinite pairs of coefficients that result on that number. In this case, since there are two coefficients, those pairs form a line on the plane, with slope of minus 45°.
The regularization part of the expression implies that we want the pair, inside that line, in which the sum of the squares of the coefficients is minimized. Geometrically, this happens in the point of the line which is tangent to the smallest circle centered in (0,0). And in this point both coefficients have equal value.
![](ch6_files/chapter6_4.png)
(c) Write out the lasso optimization problem in this setting.
![](ch6_files/chapter6_5.png)
(d) Argue that in this setting, the lasso coefficients are not unique — in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.
A: Similarly than in the previous case, here the RSS part of the expression also imposses that the coefficents must be pairs inside a line with slope of -45°. But the regularization part is different, it implies that we want pairs in the line for which the sum of their absolute values is minimized.
Geometrically, this is as having a square-shaped restriction, centered in zero, and rotated in 45° degrees. In this setting, the intersection of the line and the border of the restriction is not a point, but a whole segment of the line (which contains infinite pairs or solutions).
![](ch6_files/chapter6_6.png)
6. We will now explore (6.12) and (6.13) further.
(a) Consider (6.12) with $p = 1$. For some choice of $y_1$ and $λ > 0$, plot (6.12) as a function of $β_1$. Your plot should confirm that (6.12) is solved by (6.14).
A: Here is the plot for $y_1 = 5$ and $\lambda = 1$.
![](ch6_files/chapter6_7.png)
We see that in the minimum $\beta_1$ is equal to 2.5, instead of 5. This confirms the formula given in (6.14): the ridge coefficient is equal to the OLS coefficient, divided by $(1 + \lambda)$ (which is 2, in this particular case).
(b) Consider (6.13) with $p = 1$. For some choice of $y_1$ and $λ> 0$, plot (6.13) as a function of $β_1$. Your plot should confirm that
(6.13) is solved by (6.15).
![](ch6_files/chapter6_7b.png)
We see that in the minimum $\beta_1$ is equal to 4.5, insted of 5 (the actual value of $y$). This matches the formula given in (6.15), in which if $y_i$ is higher than $\lambda/2$ (as is the case), then the lasso coefficient is equal to $y_i - \lambda/2$ (5 - 1/2, in this particular case).
(7) We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.
Skipped. Good solution here: https://blog.princehonest.com/stat-learning/ch6/7.html
## Applied
(8) In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
(a) Use the `rnorm()` function to generate a predictor $X$ of length
$n = 100$, as well as a noise vector $\epsilon$ of length $n = 100$.
```{r}
set.seed(42)
x <- rnorm(100)
noise <- rnorm(100)
```
(b) Generate a response vector $Y$ of length $n = 100$ according to the model $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon$, where $β_0$, $β_1$, $β_2$,and $β_3$ are constants of your choice:
```{r}
y <- 1 + 2*x + 0.4*x^2 + 0.17*x^3 + noise
```
(c) Use the `regsubsets()` function to perform best subset selection in order to choose the best model containing the predictors $X,X^2,...,X^{10}$. What is the best model obtained according to Cp, BIC,and adjusted $R^2$? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.
First we create the dataframe:
```{r}
df <- tibble(
y = y,
x1 = x,
x2 = x^2,
x3 = x^3,
x4 = x^4,
x5 = x^5,
x6 = x^6,
x7 = x^7,
x8 = x^8,
x9 = x^9,
x10 = x^10
)
df
```
```{r}
bestsubset <- regsubsets(x = as.matrix(select(df,-y)),
y = as.matrix(select(df, y)))
summary(bestsubset)
```
Now, for each number of predictors, we have the best model. Let's compare the models with different number of predictors:
```{r}
summary(bestsubset)[["adjr2"]] %>% plot()
```
```{r}
summary(bestsubset)[["adjr2"]] %>% which.max()
```
```{r}
summary(bestsubset)[["cp"]] %>% plot()
```
```{r}
summary(bestsubset)[["cp"]] %>% which.min()
```
```{r}
summary(bestsubset)[["bic"]] %>% plot()
```
```{r}
summary(bestsubset)[["bic"]] %>% which.min()
```
Based on Adj-R2 and Cp, the best model is the one with 7 variables. Based on BIC, the best is the one with 5 (both conclusions are wrong, because we generated the data using a model with 3 variables).
The coefficients for the "best" models:
```{r}
bestsubset %>% coef(id = 5)
```
```{r}
bestsubset %>% coef(id = 7)
```
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the
results in (c)?
First we do forward selection:
```{r}
forwardselection <- regsubsets(x = as.matrix(select(df, -y)),
y = as.matrix(select(df, y)),
method = "forward")
summary(forwardselection)
```
The best models, based on forward selection and chosen by Adj-R2, Cp, and BIC, are as follows:
```{r}
forwardmodels <-
tibble(
metric = c("adjr2", "cp", "bic"),
best_model = c(
summary(forwardselection)[["adjr2"]]%>% which.max(),
summary(forwardselection)[["cp"]] %>% which.min(),
summary(forwardselection)[["bic"]] %>% which.min()
)
)
forwardmodels
```
Now with backward selection:
```{r}
backwardselection <- regsubsets(x = as.matrix(select(df,-y)),
y = as.matrix(select(df, y)),
method = "backward")
summary(backwardselection)
```
The best models, based on forward selection and chosen by Adj-R2, Cp, and BIC, are as follows:
```{r}
backwardmodels <-
tibble(
metric = c("adjr2", "cp", "bic"),
best_model = c(
summary(backwardselection)[["adjr2"]]%>% which.max(),
summary(backwardselection)[["cp"]] %>% which.min(),
summary(backwardselection)[["bic"]] %>% which.min()
)
)
backwardmodels
```
Unfortunately, none of the feature selection methods was able to find the "true model" used to generate this data.
(e) Now fit a lasso model to the simulated data, again using $X,X^2, ...,X^{10}$ as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.
Using cross validation to select the optimal value of λ.
```{r}
cv_lasso <-
cv.glmnet(
x = as.matrix(select(df, -y)),
y = as.matrix(select(df, y)),
alpha = 1,
nfolds = 10
)
cv_lasso[["lambda.min"]]
```
Plot of CV error as function of λ:
```{r}
error_vs_lambda <-
tibble(
cvm = cv_lasso[["cvm"]],
lambdas = cv_lasso[["lambda"]]
)
ggplot(data = error_vs_lambda) +
geom_line(aes(lambdas, cvm)) +
geom_vline(xintercept = cv_lasso[["lambda.min"]], color = "red") +
geom_label(x = cv_lasso[["lambda.min"]],
y = 7,
label = round(cv_lasso[["lambda.min"]], 2),
color = "red") +
labs(y = "Mean of CV error",
title = "Choosing lambda in lasso regression")
```
Obtaining coefficients with the optimal lambda:
```{r}
glmnet(
x = as.matrix(select(df,-y)),
y = as.matrix(select(df, y)),
alpha = 1,
lambda = cv_lasso[["lambda.min"]]
) %>% coef()
```
We see that the lasso does a much better job in selecting the features that are truly part of the data generating process. Also, the coeficient estimates are very close to the true ones (2, 0.4 and 0.17).
(f) Now generate a response vector Y according to the model $Y = β_0 + β_7X^7 + \epsilon$, and perform best subset selection and the lasso. Discuss the results obtained.
```{r}
df2 <- df %>%
mutate(y = 1 + 2*x7 + noise)
```
```{r}
bestsubset2 <- regsubsets(
x = as.matrix(select(df2, -y)),
y = as.matrix(select(df2, y))
)
bestmodels2 <-
tibble(
metric = c("adjr2", "cp", "bic"),
best_model = c(
summary(bestsubset2)[["adjr2"]]%>% which.max(),
summary(bestsubset2)[["cp"]] %>% which.min(),
summary(bestsubset2)[["bic"]] %>% which.min()
)
)
bestmodels2
```
```{r}
summary(bestsubset2)
```
Now we do feature selection using the lasso:
```{r}
cv_lasso2 <-
cv.glmnet(
x = as.matrix(select(df2, -y)),
y = as.matrix(select(df2, y)),
alpha = 1,
nfolds = 100
)
glmnet(
x = as.matrix(select(df2,-y)),
y = as.matrix(select(df2, y)),
alpha = 1,
lambda = cv_lasso2[["lambda.min"]]
) %>% coef()
```
Again, the lasso comes closer to choosing the features that are part of the true model. However, here it incorrectly picks $X^9$ as predictor, and also estimates an intercept a bit off the true one.
(9) In this exercise, we will predict the number of applications received using the other variables in the `College` data set.
(a) Split the data set into a training set and a test set.
```{r}
college_train <- ISLR::College %>%
as_tibble(rownames = "name") %>%
sample_frac(size = 0.5)
college_test <- ISLR::College %>%
as_tibble(rownames = "name") %>%
anti_join(college_train, by = "name")
```
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
```{r}
lm_model <- college_train %>%
select(-name) %>%
lm(Apps ~ ., data = .)
college_test %>%
add_predictions(model = lm_model, var = "Apps_pred") %>%
select(Apps, Apps_pred, everything()) %>%
summarise(test_mse = mean((Apps_pred-Apps)^2))
```
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
```{r}
grid <- c(10^seq(10, -10, length = 300),
0)
best_lambda <-
cv.glmnet(
x = model.matrix(Apps ~ . - name, college_train),
y = as.matrix(select(college_train, Apps)),
nfold = 300,
alpha = 0,
lambda = grid
)[["lambda.min"]]
ridge_college <-
glmnet(
x = model.matrix(Apps ~ . - name, college_train),
y = college_train[["Apps"]],
alpha = 0,
lambda = best_lambda,
thresh =1e-12
)
pred_apps_ridge <-
predict.glmnet(ridge_college,
newx = model.matrix(Apps ~ . - name, college_test),
s = best_lambda)
mean((pred_apps_ridge - college_test[["Apps"]])^2)
```
I don't understand why test MSE in ridge is higher than test MSE with OLS :(
The coefficients
```{r}
predict(ridge_college,
type = 'coefficients',
s = best_lambda)
```
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
```{r}
best_lambda_lasso <-
cv.glmnet(
x = model.matrix(Apps ~ . - name, college_train),
y = as.matrix(select(college_train, Apps)),
nfold = 300,
alpha = 1,
lambda = grid
)[["lambda.min"]]
lasso_college <-
glmnet(
x = model.matrix(Apps ~ . - name, college_train),
y = as.matrix(select(college_train, Apps)),
alpha = 1,
lambda = best_lambda_lasso,
thresh =1e-12
)
pred_apps_lasso <-
predict.glmnet(lasso_college,
newx = model.matrix(Apps ~ . - name, college_test),
s = best_lambda_lasso)
mean((pred_apps_lasso - college_test[["Apps"]])^2)
```
It is lower than in OLS and ridge :)
The coefficients (using the whole dataset)
```{r}
full_lasso_college <-
glmnet(
x = model.matrix(Apps ~ ., College),
y = as.matrix(select(College, Apps)),
alpha = 1,
lambda = best_lambda_lasso,
thresh =1e-12
)
predict(full_lasso_college, type = "coefficients",
s = best_lambda_lasso)
```
We see that `perc_alumni` was dropped by the lasso.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value
of M selected by cross-validation.
```{r}
pcr_college <-
pcr(Apps ~ . - name,
data = college_train,
scale = TRUE,
validation = "CV")
summary(pcr_college)
```
We see that the lowest CV error is achieved when we use all the available componentes (M = 17).
```{r}
validationplot(pcr_college,val.type="MSEP")
```
The lowest CV RMSE happens with M=17. I will pick M=16 since then RMSE is very close to M=17.
Now let's check the test MSE:
```{r}
apps_pred_pcr <-
predict(pcr_college, newdata = college_test, ncomp = 16)
mean((apps_pred_pcr - college_test[["Apps"]])^2)
```
The test MSE is lower than in Lasso, Ridge and linear regression.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
```{r}
pls_college <-
plsr(Apps ~ . - name,
data = college_train,
scale = TRUE,
validation = "CV")
summary(pls_college)
```
```{r}
validationplot(pls_college,val.type="MSEP")
```
The lower CV error is achieved with M = 10.
```{r}
apps_pred_pls <-
predict(pls_college, newdata = college_test, ncomp = 10)
mean((apps_pred_pls - college_test[["Apps"]])^2)
```
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
ANSWER: It seems that PCR is the method which can more accurately predict the number of college applications in test data. However, most of the test MSE obtained are roughly similar, and there are not big improvements over the regular linear regression performance. This may suggest that the error of model comes mostly from bias, and not from variance (therefore, methods of dimensionality reduction cannot provide big improvements in error reduction).
R-squared from test data: 93% (very high)
```{r}
avg_apps <- mean(college_test[["Apps"]])
1 - mean((college_test[["Apps"]] - apps_pred_pcr)^2)/mean((college_test[["Apps"]] - avg_apps)^2)
```
(10) We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
(a) Generate a data set with $p =20$ features, $n = 1000$ observations, and an associated quantitative response vector generated according to the model $Y = Xβ + \epsilon$, where $β$ has some elements that are exactly equal to zero.
```{r}
set.seed(42)
df_sim <-
tibble(.rows = 1000)
for (i in 1:20) {
varname <- str_c("X", i)
df_sim[[varname]] <- rnorm(1000)
}
noise <- rnorm(1000)
m_sim <- as.matrix(df_sim)
# Declare coefficients
betas <- rep(c(1,2,3,0), length.out = 20) %>% as.matrix()
y <-
m_sim %*% betas + noise
df_sim <- df_sim %>%
mutate(y = y,
id = row_number())
```
(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations
```{r}
sim_training <- df_sim %>%
sample_n(size = 100)
sim_test <- df_sim %>%
anti_join(sim_training, by = "id")
```
(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
```{r}
sim_bestsubset <-
regsubsets(y ~ . -id, data = sim_training, nvmax = 20)
summary(sim_bestsubset)
```
```{r}
sim_best_rss <- summary(sim_bestsubset)[["rss"]]
sim_n_minus_m <- rep(1000, times = 20) - 2:21
sim_best_mse <- sim_best_rss/sim_n_minus_m
plot(sim_best_mse)
```
```{r}
which.min(sim_best_mse)
```
The model with the lowest training MSE is the one with most variables (p = 20).
(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
```{r}
# Model matrix for test data
sim_test_matrix <-
sim_test %>% model.matrix(y ~ . - id, data = .)
# Create function for obtaing test MSE for each best model
mse_for_bestsubset <- function(n_vars) {
# Extract coefficients for each model size
coefs_i <- sim_bestsubset %>% coef(id = n_vars)
# Compute Y_predicted
pred_i <- sim_test_matrix[, names(coefs_i)] %*% coefs_i
# Obtain MSE
mse_i <- mean((sim_test[["y"]] - pred_i) ^ 2)
mse_i
}
# Iterate for 1:20 number of variables
test_mse_best <-
tibble(
n_vars = 1:20,
test_mse = map_dbl(n_vars, mse_for_bestsubset)
)
#Plot
min_test_mse <- which.min(test_mse_best[["test_mse"]])
ggplot(test_mse_best,
aes(n_vars, test_mse)) +
geom_line() +
geom_vline(xintercept = min_test_mse, color = "red")
```
The minimum test MSE is achieved in the model with 15 variables (plus the intercept).
(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
ANSWER: In this case, the best subset algorithm correctly chooses a model that includes all the true predictors, and only the true predictors.
Also, the estimated coefficients are very close to the true ones (used for generating the data).
```{r}
coef(sim_bestsubset, id = 15)
```
(g) Create a plot displaying of $\sqrt{\sum_{j=1}^p(\beta_j-\hat{\beta}_j^r)^2}$ for a range of values of $r$, where $\hat{\beta}_j^r$ is the jth coefficient estimate for the best model containing $r$ coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?
```{r}
# Names of all the variables
names_betas <-
coef(sim_bestsubset, id = 20) %>% names()
# Create vector with true coefficients + their names
betas_named <- c(0, as.vector(betas)) %>%
set_names(names_betas)
error_coeffs <- function(r) {
zeros <- rep(0, times = 21) %>%
set_names(names_betas)
coefs_r <-
coef(sim_bestsubset, id = r)
zeros[names(coefs_r)] <- coefs_r
((betas_named - zeros) ^ 2) %>% sum() %>% sqrt()
}
test_mse_best <- test_mse_best %>%
mutate(error_coeffs = map_dbl(n_vars, error_coeffs))
ggplot(test_mse_best) +
geom_line(aes(x = n_vars, y = error_coeffs)) +
geom_vline(xintercept = which.min(test_mse_best[["error_coeffs"]]),
color = "red")
```
This result matches what we get in (d), the model that minimizes the MSE in predictions, and also the error coefficient estimates, is the one with r = 15 (i.e. the best model containing 15 coefficients).
(11) We will now try to predict per capita crime rate in the `Boston` data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
```{r}
(boston <- MASS::Boston %>% as_tibble())
```
First lets split the data into training and test set.
```{r}
boston_training <- boston %>%
sample_frac(size = 0.5)
boston_test <- boston %>%
anti_join(boston_training)
```
Best subset selection:
```{r}
crim_mod_best <-
regsubsets(crim ~ ., data = boston_training, nvmax = 13)
summary(crim_mod_best)
```
Comparing "best subset" models for different number of variables:
```{r}
best_sub_models <-
tibble(
nvars = 1:13
)
boston_test_matrix <- model.matrix(crim ~ ., data = boston_test)
# Function to compute test MSE
mse_for_bestsubset_boston <- function(n_vars) {
# Extract coefficients for each model size
coefs_i <- crim_mod_best %>% coef(id = n_vars)
# Compute Y_predicted
pred_i <- boston_test_matrix[, names(coefs_i)] %*% coefs_i
# Obtain MSE
mse_i <- mean((boston_test[["crim"]] - pred_i) ^ 2)
mse_i
}
best_sub_models <- best_sub_models %>%
mutate(mse = map_dbl(nvars, mse_for_bestsubset_boston))
ggplot(best_sub_models, aes(nvars, mse)) +
geom_line() +
geom_vline(xintercept = which.min(best_sub_models[["mse"]]),
color = "red")
```
I'm obtaining weird results with test/training spliting, so I'm going to use cross-validation to find the optimum model size:
```{r}
set.seed(42)
# Split original dataset into 10 folds
boston_cv <- boston %>%
mutate(fold = sample(1:10, nrow(boston), replace = TRUE))
# Declaring function
mse_cv_bestsubsets_boston <- function(n_vars, fold_i) {
# Split in train/test according to fold
boston_cv_train <-
boston_cv %>%
filter(fold != fold_i)
boston_cv_test <-
boston_cv %>%
filter(fold == fold_i)
boston_cv_test_matrix <- model.matrix(crim ~ ., boston_cv_test)
# Train best-subsets
cv_bestsubset <- regsubsets(crim ~ ., data = boston_cv_train, nvmax = 13)
# Extract coefficients by model size
coefs_i <- cv_bestsubset %>% coef(id = n_vars)
# Compute Y_predicted
pred_i <- boston_cv_test_matrix[, names(coefs_i)] %*% coefs_i
# Obtain MSE
mse_i <- mean((boston_cv_test[["crim"]] - pred_i) ^ 2)
mse_i
}
cv_best_subsets <-
expand_grid(n_vars = 1:13, fold_i = 1:10) %>%
mutate(mse = map2_dbl(n_vars, fold_i, mse_cv_bestsubsets_boston)) %>%
group_by(n_vars) %>%
summarise(mse = mean(mse))
ggplot(cv_best_subsets, aes(n_vars, mse)) +
geom_line() +
geom_vline(xintercept = which.min(cv_best_subsets[["mse"]]),
color = "red")
```
Therefore, we conclude that the optimal model size is 9 variables/predictors. This is the Test MSE for a model of that size, in `boston_test`.
```{r}
best_subset_mse <-
mse_for_bestsubset_boston(9)
best_subset_mse
```
Now let's try lasso (alpha = 1).
First finding alpha through CV:
```{r}
boston_train_m <- model.matrix(crim ~ ., data = boston_training)
lambda_seq <- 10^seq(10, -10, length.out = 400)
cv_lambda_boston <-
cv.glmnet(
x = boston_train_m,
y = boston_training[["crim"]],
lambda = lambda_seq,
nfold = 10,
alpha = 1
)
best_lambda_boston <- cv_lambda_boston$lambda.min
boston_lasso <-
glmnet(
x = boston_train_m,
y = boston_training[["crim"]],
lambda = best_lambda_boston,
alpha = 1
)
```
Now let's estimate the Test MSE:
```{r}
boston_test_m <- model.matrix(crim ~ ., data = boston_test)
crim_boston_lasso <-
predict.glmnet(boston_lasso, newx = boston_test_m, s = best_lambda_boston)
lasso_mse <-
mean((crim_boston_lasso - boston_test[["crim"]])^2)
lasso_mse
```
It's slightly higher than the Best Subset MSE.
Now with ridge:
```{r}
cv_lambda_boston_ridge <-
cv.glmnet(
x = boston_train_m,
y = boston_training[["crim"]],
lambda = lambda_seq,
nfold = 10,
alpha = 0
)
best_lambda_boston_ridge <- cv_lambda_boston_ridge$lambda.min
boston_ridge <-
glmnet(
x = boston_train_m,
y = boston_training[["crim"]],
lambda = best_lambda_boston_ridge,
alpha = 0
)
crim_boston_ridge <-
predict.glmnet(boston_ridge, newx = boston_test_m,
s = best_lambda_boston_ridge)
ridge_mse <-
mean((crim_boston_ridge - boston_test[["crim"]])^2)
ridge_mse
```
Slightly higher than Lasso MSE.
Now with PCR:
```{r}
boston_pcr <-
pcr(crim ~ .,
data = boston_training,
scale = TRUE,
validation = "CV")
summary(boston_pcr)
```
I will use 4 components, because the CV MSE barely moves when including more components after that.
```{r}
boston_pcr_train <-
pcr(crim ~ .,
data = boston_training,
scale = TRUE)
crim_pcr <- predict(boston_pcr_train, newx = boston_test, ncomp = 4)
pcr_mse <- mean((crim_pcr - boston_test[["crim"]])^2)
pcr_mse
```
Based on the previous results, I would choose the Ridge regression (the model with the lowest test MSE). It uses all the variables in the data (since ridge doesn't perform feature selection), but shrinks the coefficient values, based on the regularization term and the optimal lambda value.
```{r}
boston_ridge %>% predict.glmnet(type = "coefficients")
```