-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
126 lines (84 loc) · 5.37 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Soil spectroscopy ring trial"
output:
github_document:
toc: true
toc_depth: 4
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
library("tidyverse")
library("qs")
mnt.dir <- "~/projects/mnt-ringtrial/"
```
## Overview
Inter-laboratory comparison of soil spectral measurements as part of the SoilSpec4GG project.
This repository is used for analyzing the metadata of different instruments of the ring trial.
The workspace development is defined by:
- GitHub repository: [soilspectroscopy/ringtrial-metadata](https://github.com/soilspectroscopy/ringtrial-metadata).
- Google Cloud storage for efficient file storage and access: [whrc.org/soilcarbon-soilspec/storage/sc-ringtrial](https://console.cloud.google.com/storage/browser/sc-ringtrial).
## Metadata
Similar levels were grouped to a common format. All strings starts with upper case in the first letter. Spaces are replaced with dash.
The following information was prepared:
- Manufacturer: the instrument's manufacturer. Only spaces replaced with dash.
- Model: the instrument's model. Several levels are provided, so model number was omitted to group the variation to a common model. Spaces replaced with dash.
- Year: year the instrument was built. No modification.
- Beamsplitter: The crystal used to split the beam for generating the interferogram. Some proprietary materials have some coatings but they were placed under the same basic material.
- Detector: The beam detector that is part of the interferogram. Again, some variations were placed under the same basic type.
- Mirror: mirror material as part of the interferogram..
- Accessory: scanning accessory for DRIFT. Manufacturer and accessory name is provided. Names formatted to a common string
- Background: material used as reference for internal calibration.
- Sample presentation: soil sample presentation (in the accessory) before scanning.
- Neat/Mulled: additional sample preparation. Mulled = mixed with a compound to form a paste. Neat = unmixed.
- Purged: internal gas cleaning before scanning.
Prepared metadata:
```{r metadata, message=F, warning=F, echo=F}
read_csv("outputs/instruments_metadata_clean.csv") %>%
knitr::kable()
```
## Pooled PCA
Pooled PCA was performend to retain 99.5% of the original variance. This resulted in 7 components.
<img src="outputs/plot_pca_scores_pooled_raw.png" width=100% heigth=100%>
<img src="outputs/plot_pca_loadings_pooled_raw.png" width=100% heigth=100%>
## Clustering analysis
<img src="outputs/plot_kmeans_aic_absolute.png" width=100% heigth=100%>
<img src="outputs/plot_kmeans_aic_relative.png" width=100% heigth=100%>
<img src="outputs/plot_kmeans_clusters.png" width=100% heigth=100%>
Proportion of instrument samples (%) belonging to spectral clusters
```{r proportion, message=F, warning=F, echo=F}
read_csv("outputs/proportions_clustering.csv") %>%
rename(instrument = organization) %>%
knitr::kable(digits = 3)
```
<img src="outputs/plot_kmeans_clusters_majority.png" width=100% heigth=100%>
<img src="outputs/plot_spectral_variation_clusters.png" width=100% heigth=100%>
## Correspondence analysis
Correspondence analysis is used to explore relationships among qualitative variables. Like principal component analysis, it provides a solution for summarizing and visualizing data set in two-dimension plots. It is based on the frequencies formed by two categorical data, i.e. a contingency table. Using an asymmetrical biplot, one can plot metadata information (columns) over the cluster space (rows) in order to understand the associations between the categorical levels. For this, rows are represented in principal coordinates and columns are projected with standard coordinates (row-metric-preserving).
The chi-square test of independence is used to analyze the frequency table (i.e. contingency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables:
- Null hypothesis (H0): the row and the column variables of the contingency table are independent.
- Alternative hypothesis (H1): row and column variables are dependent.
We accept H1 when p-value is below alpha (probability error of 1 or 5%)
Example of contingency table using manufacturer info.
```{r contigency_table, message=F, warning=F, echo=F}
scores <- qread(paste0(mnt.dir, "metadata/pca_majority.qs"))
metadata <- read_csv("outputs/instruments_metadata_clean.csv")
metadata <- metadata %>%
rename(organization = code) %>%
mutate(organization = as.factor(organization))
ca.data <- left_join(scores, metadata, by = "organization") %>%
mutate(cluster = paste0("C", cluster)) %>%
mutate(majority = paste0("C", majority))
ct.manufacturer <- table(ca.data$majority, ca.data$manufacturer)
knitr::kable(ct.manufacturer)
```
<img src="outputs/plot_ca_manufacturer.png" width=100% heigth=100%>
<img src="outputs/plot_ca_model.png" width=100% heigth=100%>
<img src="outputs/plot_ca_beamsplitter.png" width=100% heigth=100%>
<img src="outputs/plot_ca_detector.png" width=100% heigth=100%>
<img src="outputs/plot_ca_mirror_material.png" width=100% heigth=100%>
<img src="outputs/plot_ca_accessory.png" width=100% heigth=100%>
<img src="outputs/plot_ca_background.png" width=100% heigth=100%>
<img src="outputs/plot_ca_sample_presentation.png" width=100% heigth=100%>
<img src="outputs/plot_ca_purged.png" width=100% heigth=100%>