-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab2.Rmd
259 lines (168 loc) · 6.23 KB
/
lab2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Lab2"
author:
date: "September 7, 2019"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo =TRUE)
library(dplyr)
library(GGally)
```
## Topics
* added variable plots `avPlots` from `library(car)`
* `termplot` Base R
* `boxCox` from `library(car)`
* `powerTransform` from `library(car)`
## Interpretation with Multiple Regression: Added Variable Plots
* Regression Coefficients in Multiple Regression are adjusted for other variables
* use the United Nations data 'full' model
$$E[\log(\text{Fertility}) \mid \log(\text{PPgdp}) = x_1, \text{Purban} = x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$
* Keeping `Purban` fixed, with a one unit change in `log(PPgdp)` we expect `log(Fertility)` to change by $\beta_1$
* Visualize with added-variable plots
## Data and Cleaning
```{r, echo=T}
data(UN3, package="alr3")
UN = dplyr::select(UN3, c(Fertility, PPgdp, Purban)) %>%
mutate(logPPgdp = log(PPgdp),
logFertility = log(Fertility)) %>%
na.omit()
```
## Scatter plots
```{r}
ggpairs(UN, c(3,4,5))
```
## Adjustment for `Purban`
1) Regress `log(Fertility)` on `Purban` and obtain the residuals; call them `e_Y` for example.
```{r eY, echo=T}
e_Y = residuals(lm(logFertility ~ Purban, data=UN))
```
2) Regress `logPPgdp` on `Purban` and obtain the residuals (call `e_X1`)
```{r, echo=T}
e_X1 = residuals(lm(logPPgdp ~ Purban, data=UN))
```
3) regress `e_Y` on `e_X`
4) slope is the same as the estimated slope for `logPPgdp` in the full model with both predictors.
## Added Variable Plot
```{r, echo=T, fig.width=5, fig.height=3}
df = data.frame(e_Y = e_Y, e_X1 = e_X1)
ggplot(data=df, aes(x = e_X1, y = e_Y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
## Compare
```{r, echo=T}
summary(lm(logFertility ~ logPPgdp + Purban, data=UN))$coef
summary(lm(e_Y ~ e_X1, data=df))$coef
```
* coefficients are the same
* residuals are the same
* t-values not quite (why?)
## In General `avPlots`
```{r, echo=T}
car::avPlots(lm(logFertility ~ logPPgdp + Purban, data=UN))
```
## Added Variable Plot Example & Transformations
```{r simdata, echo=T}
n = 100
logx1 = rnorm(n)
x1 = exp(logx1)
x2 = abs(rnorm(n))
logy = .5 + 3*logx1 - 2*sqrt(x2) + rnorm(n)
simdat = data.frame(X1=x1, X2 = x2, Y=exp(logy))
```
```{r}
library(car)
mod1 = lm(Y ~ X1 + X2, data=simdat)
avPlots(mod1)
```
In the added variable plot,
(like termplot below) this can also show if the linear term in $X1$ is appropriate or perhaps there is a nonlinear relationship or if there are outliers or influential points that affect the adjusted relationship.
## Transformations
```{r}
data("Prestige")
```
```{r}
pairs(Prestige)
```
### BoxCox
```{r}
car::boxCox(lm(prestige ~ income + education + type + poly(women, 2), data=Prestige))
```
see `help(boxCox)` for generalizations for non-positive responses
### PowerTransform
```{r}
car::powerTransform(Prestige[, 1:5], family="bcnPower")
#omit type last column as it is discrete
# allow non-positive values
#
# see help(powerTransform) for examples and testing for "nice" powers
```
## Termplot output
```{r}
termplot(lm(Y ~ X1 + X2, data=simdat), terms = "X1",
partial.resid = T, se=T, rug=T,
smooth = panel.smooth)
```
## Termplot output
```{r}
termplot(lm(Y ~ X1 + X2, data=simdat), terms = "X2",
partial.resid = T, se=T, rug=T,
smooth = panel.smooth)
```
## termplot with transformation of Y and X1
```{r}
termplot(lm(log(Y) ~ log(X1) + X2, data=simdat),
terms = "log(X1)", partial.resid = T, se=T, rug=T,
smooth = panel.smooth)
```
## termplot with transformation of Y and X1
```{r}
termplot(lm(log(Y) ~ log(X1) + X2, data=simdat), terms = "X2",
partial.resid = T, se=T, rug=T, smooth = panel.smooth)
```
## termplot with transformation of Y, X1, and X2
```{r}
termplot(lm(log(Y) ~ log(X1) + sqrt(X2), data=simdat), terms = "sqrt(X2)",
partial.resid = T, se=T, rug=T, smooth = panel.smooth)
```
## What is in a term plot?
* x-axis is the (untransformed) variable in your dataframe $(X1, X2)$
* line is the "term" of that variable's contribution to $f(x)$
* y-axis is partial residuals for term
* `partial.resid = T` adds the partial residuals to the plot
* `rug = T` shows location of data on axes
* `se = T` adds the SE of the term's contribution to $f(x)$
* `smooth = panel.smooth` adds "smoothed" means to plot
## Terms
$$Y = \hat{\beta}_0 + \hat{\beta}_1 X1 + \hat{\beta}_2 X2 + e$$
Equivalent to centered model
$$Y = \bar{Y} + \hat{\beta}_1 (X1 - \bar{X1}) + \hat{\beta}_2 (X2 - \bar{X2}) + e$$
Terms are coefficient estimates times centered predictors
$$\hat{\beta}_1 (X1 - \bar{X1}) $$
$$\hat{\beta}_2 (X2 - \bar{X2}) $$
## Terms with transformations
$$\log(Y) = \hat{\beta}_0 + \hat{\beta}_1 \log(X1) + \hat{\beta}_2 X2 + e$$
Equivalent to centered model
$$\log(Y) = \bar{\log(Y)} + \hat{\beta}_1 (\log(X1) - \bar{\log(X1)}) + \hat{\beta}_2 (X2 - \bar{X2}) + e$$
Terms are coefficient estimates times centered "predictors"
$$\hat{\beta}_1 (\log(X1) - \bar{\log(X1)}) $$
## partial residuals for a term
$$\log(Y) = \bar{\log(Y)} + \hat{\beta}_1 (\log(X1) - \bar{\log(X1)}) + \hat{\beta}_2 (X2 - \bar{X2}) + e$$
$$\log(Y) - (\bar{\log(Y)} + \hat{\beta}_1 (\log(X1) - \bar{\log(X1)})) = \hat{\beta}_2 (X2 - \bar{X2}) + e$$
* Lefthand side takes response and removes the part of the response that is explained by $X1$
* Equal to the `term` for $X2$ plus the residual $e$
* part of residual variation that is not explained by the other terms that potentially can be explained by $X2$ = partial residual for $X2$
* partial residual for X1
$$\hat{\beta}_1 (\log(X1) - \bar{\log(X1)}) + e$$
(add back in term to residual to obtain partial residual)
## Summary
* Linear predictors may be based on functions of other predictors
(dummy variables, interactions, non-linear terms)
* need to change back to original units
* log transform useful for non-negative responses (ensures predictions are non-negative)
* Be careful of units of data
+ plots should show units
+ summary statements should include units
* Goodness of fit measure: $R^2$ and Adjusted $R^2$ depend on scale $R^2$ is percent variation in "$Y$" that is explained by the model
$$ R^2 = 1 - SSE/SST$$
where SST = $\sum_i (Y_i - \bar{Y})^2$