-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathcategorical-target.qmd
248 lines (193 loc) · 7.98 KB
/
categorical-target.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
pagetitle: "Feature Engineering A-Z | Target Encoding"
---
# Target Encoding {#sec-categorical-target}
::: {style="visibility: hidden; height: 0px;"}
## Target Encoding
:::
**Target encoding** (also called **mean encoding**, **likelihood encoding**, or **impact encoding**) is a method that maps the categorical levels to probabilities of your target variable [@micci2001].
This method is in some ways quite similar to [frequency encoding](categorical-frequency).
We are taking a single categorical variable, and turning it into a single numeric categorical variable.
This is a trained and supervised method since we are using the outcome of our modeling problem to guide the way this method is estimated.
In the most simple formulation, target encoding is done by replacing each level of a categorical variable with the mean of the target variable within said level.
The target variable will typically be the outcome, but that is not necessarily a requirement.
:::: {.callout-caution}
# TODO
show motivated example of what happens when this method is applied to constant predictor. It would take full information of the outcome and become a perfect predictor.
:::
Consider the following example data set
```{r}
#| label: dog-cat-example
#| echo: false
animals <- tibble::tibble(
cuteness = c(1, 5, 9, 3, 2, 4),
animal = c("dog", "cat", "cat", "cat", "dog", "horse")
)
knitr::kable(animals)
```
If we were to calculate target encoding on `animal` using `cuteness` as the target, we would first need to calculate the mean of `cuteness` within each
```{r}
#| label: animals_means
#| echo: false
#| message: false
library(dplyr)
animals_means <- animals |>
summarise(
math = paste(cuteness, collapse = " + "),
math = if_else(length(cuteness) == 1,
paste0(math, " / 1"),
paste0("(", math, ") / ", length(cuteness))),
mean = mean(cuteness),
.by = animal
)
knitr::kable(animals_means)
```
Taking these means we can now use them as an encoding
```{r}
#| label: example-table
#| echo: false
animals |>
left_join(animals_means, by = join_by(animal)) |>
select(-animal, -math) |>
rename(animal = mean) |>
knitr::kable()
```
From the above example, we notice 3 things.
Firstly, once the calculations have been done, applying the encoding to new data is a fairly easy procedure as it amounts to a left join.
Lastly, how will this method handle unseen levels?
Let us think about the unseen levels first.
If we have no information about a given class.
This could happen in at least two different ways.
Because the level is truly unseen because the company was just started and wasn't known in the training data set.
Or because the known level wasn't observed, e.i.
no Sundays in the training data set.
Regardless of the reason, we will want to give these levels a baseline number.
For this, we can use the mean value of the target, across all of the training data set.
So for our toy example, we have a mean cuteness of `r mean(animals$cuteness)`, which we will assign to any new animal.
This value is by no means a good value, but it is an educated guess that can be calculated with ease.
This also means that regardless of the distribution of the target, these values can be calculated.
We have so far talked about target encoding, from the perspective of a regression task.
But target encoding is not limited to numeric outcomes, but can be used in classification settings as well.
In the classification setting, where we have a categorical outcome, instead of calculating the mean of the target variable, we need to figure something else out.
It could be calculating the probability of the first level, or we could try to go more linear by converting them to log odds.
Now everything works as before.
::: callout-caution
# TODO
find good example
:::
A type of data that is sometimes seen is hierarchical categorical variables.
Typical examples are city-state and region-subregion.
With hierarchical categorical variables you often end up with many levels, and target encoding can be used on such data, and can even use the hierarchical structure.
The encoding is calculated like normal, but the smoothing is done on a lower level than at the top.
So instead of adjusting by the global mean, you smooth by the level below it.
## Low count groups {#sec-target-low-count-groups}
Target encoding as described in this chapter,
doesn't apply smoothing to the statistics that are calculated.
Depending on the data it won't have too bad of a negative effect.
This subsection goes over what happens if one or more of the groups have few counts.
It will at the same time act as the main motivation,
for why you shouldn't be using the base form of target encoding as described in this chapter,
and instead use one with smoothing as seen in @sec-categorical-glmm.
Suppose we have the following data.
```{r}
#| label: lwo-count-target
#| echo: false
low_count <- tibble::tribble(
~target, ~predictor,
"A", 0.4,
"A", 0.6,
"A", 0.5,
"A", 0.2,
"A", 0.5,
"B", 0.2,
"B", 0.3,
"B", 0.2,
"B", 0.4,
"B", 0.6,
"C", 0.1
)
low_count |>
knitr::kable()
```
Calculating the means gives us the following encoding.
```{r}
#| label: low-count-target-means
#| echo: false
low_count |>
dplyr::summarise(predictor = mean(predictor), .by = target) |>
knitr::kable()
```
However, how much can we trust this encoding?
`C` only has a single value and `A` and `C` both have 5.
Maybe `0.1` is a good estimate of the target means under `C`,
but it could also be an outlier of the distribution itself.
Taking a single value of `A` would give a value between `0.2` and `0.6`.
We are overfitting to the training data set,
by having a single observation for `C` alone determine the encoding.
Smoothing would take the global mean (here `r round(mean(low_count$predictor), 2)`)
into account by having the mean encoding deviate from the global mean.
Target levels with many counts would use a mean close to the class mean,
and target levels with few counts would use a mean close to the global mean.
We see this in sports and online reviews.
An athlete that scored 1 out of 1 goal isn't a reliable statistic,
as the athlete that scores 80 out of 100.
Like-wise, 10 five-star reviews aren't a good as 100 five-star reviews.
## Pros and Cons
### Pros
- Can deal with categorical variables with many levels
- Can deal with unseen levels in a sensible way
### Cons
- Can be prone to overfitting
## R Examples
The embed package comes with a couple of functions to do target encoding, we will look at `step_lencode_glm()` as it applied the method best described by this chapter.
These functions are named such because they **l**ikelihood **encode** variables, and because the encodings can be calculated using no intercept models.
::: callout-caution
# TODO
find a better data set
:::
```{r}
#| label: set-seed
#| echo: false
set.seed(1234)
```
```{r}
#| label: step_lencode_glm
#| message: false
library(recipes)
library(embed)
data(ames, package = "modeldata")
rec_target <- recipe(Sale_Price ~ Neighborhood, data = ames) |>
step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) |>
prep()
rec_target |>
bake(new_data = NULL)
```
And we see that it works as intended, we can pull out the exact levels using the `tidy()` method
```{r}
#| label: tidy
rec_target |>
tidy(1)
```
## Python Examples
```{python}
#| label: python-setup
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples.
{sklearn} provided the `TargetEncoder()` method we can use.
For this to work, we need to remember to specify an outcome when we `fit()`.
```{python}
#| label: argetencoder
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import TargetEncoder
ct = ColumnTransformer(
[('target', TargetEncoder(target_type="continuous"), ['Neighborhood'])],
remainder="passthrough")
ct.fit(ames, y=ames[["Sale_Price"]].values.flatten())
ct.transform(ames).filter(regex="target.*")
```