-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathchemometrics.qmd
392 lines (298 loc) · 16.8 KB
/
chemometrics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Chemometrics {#sec-chemometrics}
In this section, we introduce an alternative modeling paradigm that is commonly referred as **Chemometrics**. According to [Wikipedia](https://en.wikipedia.org/wiki/Chemometrics),
> **Chemometrics** is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.
Chemometrics has been largely employed in soil spectroscopy especially with the use of traditional preprocessing tools and Partial Least Squares Regression (PLSR). PLSR is a classic algorithm that is able do deal with the multivariate and multicolinear nature of the spectra to create robust regression models. Along with regression, many other features were built around PLSR and PCA that aid in the interpretation of models and prediction results. Therefore, this section will present and discuss these possibilities.
The choice of using classic Chemometrics tools or the ones presented in the [**Processing**](#subsec-preparing) and [**Machine Learning**](#subsec-ml) sections will largely depend on the project application and familiarity. There is no perfect method so we usually recommend testing both and choosing the one that best suit your needs.
The **mdatools** R package is a collection of Chemometrics features that share a common "user interface". It contains a series of features for preprocessing, exploring, modeling, and interpreting multivariate spectral data.
The **mdatools** framework is presented in @Kucheryavskiy2020. For a full documentation of the available features, please visit <https://mdatools.com/docs/index.html>.
A list of all required packages for this section is provided in the following code chunk:
```{r, libraries, message=FALSE, warning=FALSE}
library("tidyverse") # Regular data wrangling
library("mdatools") # Chemometrics
library("yardstick") # Additional performance metrics
library("qs") # Loading serialized files
```
## Preparation of files
Instead of using the `train.csv` and `test.csv` files from the **Machine Learning** section, we are going to preprocess and model the same spectral data using a different set of tools. For this, we are going to use the same Neospectra database that is part of the Open Soil Spectral Library (OSSL).
```{r files, message=FALSE, warning=FALSE}
## Internet configuration for downloading big datasets
options(timeout = 10000)
## Reading serialized files
neospectra.soil <- qread_url("https://storage.googleapis.com/soilspec4gg-public/neospectra_soillab_v1.2.qs")
dim(neospectra.soil)
neospectra.site <- qread_url("https://storage.googleapis.com/soilspec4gg-public/neospectra_soilsite_v1.2.qs")
dim(neospectra.site)
neospectra.nir <- qread_url("https://storage.googleapis.com/soilspec4gg-public/neospectra_nir_v1.2.qs")
dim(neospectra.nir)
```
From the original files, we can now select only the relevant data for our modeling.
```{r preparation, message=FALSE, warning=FALSE}
# Selecting relevant site data
neospectra.site <- neospectra.site %>%
select(id.sample_local_c, location.country_iso.3166_txt)
# Selecting relevant soil data
neospectra.soil <- neospectra.soil %>%
select(id.sample_local_c, oc_usda.c729_w.pct)
# Selecting relevant NIR data and taking average across repeats
neospectra.nir <- neospectra.nir %>%
select(id.sample_local_c, starts_with("scan_nir")) %>%
group_by(id.sample_local_c) %>%
summarise_all(mean, .group = "drop")
# Renaming spectral columns headers to numeric integers
neospectra.nir %>%
select(starts_with("scan_nir")) %>%
names() %>%
head()
old.names <- neospectra.nir %>%
select(starts_with("scan_nir")) %>%
names()
new.names <- gsub("scan_nir.|_ref", "", old.names)
neospectra.nir <- neospectra.nir %>%
rename_with(~new.names, all_of(old.names))
spectral.column.names <- new.names
head(spectral.column.names)
tail(spectral.column.names)
```
Lastly, we just make sure to have all data prepared and complete.
```{r join, message=FALSE, warning=FALSE}
# Joining data
neospectra <- left_join(neospectra.site,
neospectra.soil,
by = "id.sample_local_c") %>%
left_join(., neospectra.nir, by = "id.sample_local_c")
# Filtering out samples without SOC values
neospectra <- neospectra %>%
filter(!is.na(oc_usda.c729_w.pct))
```
A subset of the imported spectra can be visualized below.
```{r visualization, message=FALSE, warning=FALSE}
# Spectral visualization
set.seed(1993)
neospectra %>%
sample_n(100) %>%
pivot_longer(any_of(spectral.column.names),
names_to = "wavelength",
values_to = "reflectance") %>%
mutate(wavelength = as.numeric(wavelength),
reflectance = as.numeric(reflectance)) %>%
ggplot(data = .) +
geom_line(aes(x = wavelength, y = reflectance, group = id.sample_local_c),
alpha = 0.25, linewidth = 0.25) +
theme_light()
```
## Chemometric modeling
To start with chemometrics, the **mdatools** offer a series of built-in function for preprocessing spectra. We are going to use the **Standard Normal Variate** function.
```{r preprocessing, message=FALSE, warning=FALSE}
# Preprocessing with SNV
neospectra.snv <- neospectra %>%
select(all_of(spectral.column.names)) %>%
as.matrix() %>%
prep.snv(.) %>%
as_tibble() %>%
bind_cols({neospectra %>%
select(-all_of(spectral.column.names))}, .)
# Visualization
set.seed(1993)
neospectra.snv %>%
sample_n(100) %>%
pivot_longer(any_of(spectral.column.names),
names_to = "wavelength",
values_to = "reflectance") %>%
mutate(wavelength = as.numeric(wavelength),
reflectance = as.numeric(reflectance)) %>%
ggplot(data = .) +
geom_line(aes(x = wavelength, y = reflectance, group = id.sample_local_c),
alpha = 0.25, linewidth = 0.25) +
theme_light()
```
Before model calibration, we are going to split the dataset into train and test sets. We use the same split rule employed in the **Machine Learning** section.
```{r split, message=FALSE, warning=FALSE}
# Preparing train split, separating predictors and outcome, and applying log1p
neospectra.train <- neospectra.snv %>%
filter(location.country_iso.3166_txt == "USA")
neospectra.train.predictors <- neospectra.train %>%
select(all_of(spectral.column.names))
neospectra.train.outcome <- neospectra.train %>%
select(oc_usda.c729_w.pct) %>%
mutate(oc_usda.c729_w.pct = log1p(oc_usda.c729_w.pct))
# Preparing test split, separating predictors and outcome, and applying log1p
neospectra.test <- neospectra.snv %>%
filter(location.country_iso.3166_txt != "USA")
neospectra.test.predictors <- neospectra.test %>%
select(all_of(spectral.column.names))
neospectra.test.outcome <- neospectra.test %>%
select(oc_usda.c729_w.pct) %>%
mutate(oc_usda.c729_w.pct = log1p(oc_usda.c729_w.pct))
```
Now we just feed the preprocessed spectra into the PLSR algorithm.
PLSR in this case, will compress the spectra into several orthogonal features that both maximize the variance in the spectra and the soil property of interest, in this case, SOC. The resulting scores from these latent factors are feed in a multivariate linear regression model.
::: {.callout-note}
## Note
If you want to learn more about PLSR and PCA, the **mdatools** documentation offers a good explanation. Similarly, take a look at this awesome chapter from [**All Models Are Wrong: Concepts of Statistical Learning**](https://allmodelsarewrong.github.io/pls.html).
:::
We are going to test up to 20 factors (`ncomp = 20`), run 10-fold cross-validation for the train data (`cv = 10`), center and scale the spectra before compression (which are performed locally, i.e., independently for each fold [`center = TRUE, scale = TRUE, cv.scope = 'local'`]), and use a data driven method (`"ddmoments"`) to estimate critical limits for the extreme and outlier samples. The model is named "SOC prediction model".
We can fit a PLSR by passing the train and test samples together in the call...
```{r plsr, message=FALSE, warning=FALSE}
set.seed(1993)
pls.model.soc <- pls(x = neospectra.train.predictors,
y = neospectra.train.outcome,
ncomp = 20,
x.test = neospectra.test.predictors,
y.test = neospectra.test.outcome,
center = TRUE, scale = TRUE,
cv = 10, lim.type = "ddmoments", cv.scope = 'local',
info = "SOC prediction model")
```
Or run separately.
```{r plsr_split, message=FALSE, warning=FALSE}
# Alternatively
set.seed(1993)
pls.model.soc.calibration <- pls(x = neospectra.train.predictors,
y = neospectra.train.outcome,
ncomp = 20,
center = TRUE, scale = TRUE,
cv = 10, lim.type = "ddmoments", cv.scope = 'local',
info = "SOC prediction model")
pls.model.soc.predictions <- predict(pls.model.soc.calibration,
x = neospectra.test.predictors,
y = neospectra.test.outcome, cv = F)
```
From these model fits, we can get a summary of the model.
```{r model_summary}
# Get an overview of performance
summary(pls.model.soc)
```
You can see that the number of selected components is 17, which was automatically chosen from 10CV results.
Let's plot the models. We will explore in detail the visualization features soon.
```{r model_viz}
# Visualize representation limits, model coefficient, performance and fit
# Remove legend for proper visualization
plot(pls.model.soc, show.legend = T)
```
There is big mismatch between CV and test performance, and this happened likely due to the geographical difference of the samples. Let's set the number of comps that worked best for test samples.
```{r}
# The performance is more consistent across train, cv, and test with 5 comps
pls.model.soc <- selectCompNum(pls.model.soc, 5)
summary(pls.model.soc)
plot(pls.model.soc, show.legend = F)
```
# Model and predictions interpretation
We can start the interpretation by looking at the observations classification based on the spectral dissimilarity. For this, **mdatools** use both the **Q statistics** and **Hotelling $T^2$** as distance metrics for classifying samples into regular, extreme or outlier.
The **Q statistics**, known as orthogonal distance in **mdatools**, measures the remaining spectral variance that is not accounted during spectral compression and is not used by the models. Therefore, observations with unique features not well represented by PLSR are flagged. On the other hand, the **Hotelling $T^2$** focus on the mean deviation of observation values (scores) produced by the retained factors used in the models, with very distinct samples being flagged. Therefore, these two complimentary methods are able to detect potential observations that are underrepresented by the PLSR model. The classification is based on critical limits that are calculated from the calibration set, and **mdatools** currently make possible to use four different methods. By default, it uses the data driven moments (`lim.type = "ddmoments"`) developed by the author.
Let's see the classification and plot it with the model residuals. We can see that for the test set, several observations were flagged as potential extreme observations, with no spectral outlier detected.
```{r classification, message=FALSE, warning=FALSE}
# Categorization
outlier.detection <- categorize(pls.model.soc.calibration,
pls.model.soc.predictions,
ncomp = 5)
head(outlier.detection)
plotResiduals(pls.model.soc.predictions,
cgroup = outlier.detection,
ncomp = 5)
```
We can further explore the PLSR model with additional built-in plot functions, for example, exploring in detail the model performance across the range of components tested, for all data splits.
```{r performance_components}
plotRMSE(pls.model.soc, show.labels = TRUE)
```
With the predictions made, we can run a classic observed vs predicted scatterplot.
```{r scatterplot}
# Predictions scatterplot
plotPredictions(pls.model.soc, show.line = T, pch = 20, cex = 0.25)
abline(a=0, b=1)
# plotPredictions(pls.model.soc.predictions, show.line = T, pch = 20, cex = 0.25)
# abline(a=0, b=1)
```
And inspect the residuals...
```{r residuals}
# Inspection plot for outcome residual
plotYResiduals(pls.model.soc, cex = 0.5, show.label = TRUE)
```
Another common visualization that aids the interpretation of models is the variable importance. For PLSR, the Variable Importance to Projection (VIP) is routinely employed.
```{r vip}
# Variable influence on projection (VIP) scores
plotVIPScores(pls.model.soc)
vip = vipscores(pls.model.soc, ncomp = 5)
head(vip)
```
There are many other plot functions for interpreting the PLSR model.
For example, how much of the original spectral variance (X variables) was retained by 5 components/factors? Around 97% for both train and test sets.
```{r x_variance}
# Cumulative variance retained from predictors (spectra)
plotXCumVariance(pls.model.soc, type = 'h', show.labels = TRUE, legend.position = 'bottomright')
```
As the PLSR try to maximize both the X (predictors) and Y (single or multiple outcome) original variance during compression, how much of the outcome variance was retained by the 5 components/factors? About 40% in the train and 90% in the test samples.
```{r y_variance}
# Cumulative variance retained from outcome
plotYCumVariance(pls.model.soc, type = 'b', show.labels = TRUE, legend.position = 'bottomright')
```
And visualize the scores of the latent factors produced. In this case, we are plotting the first 3 factors.
```{r x_scores}
# Scores plots for compressed spectra ()
plotXScores(pls.model.soc, comp = c(1, 2), show.legend = T, cex = 0.5, legend.position = 'topleft')
plotXScores(pls.model.soc, comp = c(1, 3), show.legend = T, cex = 0.5, legend.position = 'bottomright')
```
During spectra compression, we can visualize the estimated loadings for the first 5 components across the original spectrum.
```{r loadings}
# Inspection plot for outcome residual
plotXLoadings(pls.model.soc, comp = c(1, 2, 3, 4, 5), type = 'l')
```
Lastly, we can explore all the internal information retained in the **pls** and **plsres** objects produced by the **mdatools**.
```{r internal}
# Exploring internal info from model object
pls.model.soc$T2lim
pls.model.soc$Qlim
pls.model.soc$res$cal
pls.model.soc$res$cal$xdecomp
pls.model.soc$res$test
pls.model.soc$res$test$xdecomp
```
Let's grab the test predictions for a customized plot.
```{r predictions}
# Grabbing test predictions
test.predictions <- pls.model.soc$res$test$y.pred
dim(test.predictions)
# Selecting "Comp 5 predictons"
test.predictions <- tibble(predicted = test.predictions[,5,1])
# Formatting
neospectra.test.results <- neospectra.test %>%
select(-all_of(spectral.column.names)) %>%
bind_cols(test.predictions) %>%
rename(observed = oc_usda.c729_w.pct) %>%
mutate(predicted = expm1(predicted))
```
And calculated additional evalution metrics.
```{r yardstick}
# Calculating performance metrics
library("yardstick")
neospectra.test.performance <- neospectra.test.results %>%
summarise(n = n(),
rmse = rmse_vec(truth = observed, estimate = predicted),
bias = msd_vec(truth = observed, estimate = predicted),
rsq = rsq_trad_vec(truth = observed, estimate = predicted),
ccc = ccc_vec(truth = observed, estimate = predicted, bias = T),
rpiq = rpiq_vec(truth = observed, estimate = predicted))
neospectra.test.performance
```
And plot the scatterplot with **ggplot**.
```{r ggplot}
# Final accuracy plot
perfomance.annotation <- paste0("CCC = ", round(neospectra.test.performance[[1,"ccc"]], 2),
"\nRMSE = ", round(neospectra.test.performance[[1,"rmse"]], 2), " wt%")
p.final <- ggplot(neospectra.test.results) +
geom_point(aes(x = observed, y = predicted)) +
geom_abline(intercept = 0, slope = 1) +
geom_text(aes(x = -Inf, y = Inf, hjust = -0.1, vjust = 1.2),
label = perfomance.annotation, size = 3) +
labs(title = "Soil Organic Carbon (wt%) test prediction",
x = "Observed", y = "Predicted") +
theme_light() + theme(legend.position = "bottom")
# Making sure it we have a square plot
r.max <- max(layer_scales(p.final)$x$range$range)
r.min <- min(layer_scales(p.final)$x$range$range)
s.max <- max(layer_scales(p.final)$y$range$range)
s.min <- min(layer_scales(p.final)$y$range$range)
t.max <- round(max(r.max,s.max),1)
t.min <- round(min(r.min,s.min),1)
p.final <- p.final + coord_equal(xlim = c(t.min,t.max), ylim = c(t.min,t.max))
p.final
```