README.Rmd

---
title: "Soil spectroscopy ring trial"
output: 
  github_document:
    toc: true
    toc_depth: 4
editor_options: 
  markdown: 
    wrap: 72
---
  
```{r setup, include=FALSE}
library("tidyverse")
library("qs")
mnt.dir <- "~/projects/mnt-ringtrial/"
```

## Overview

Inter-laboratory comparison of soil spectral measurements as part of the SoilSpec4GG project.

This repository is used for analyzing the metadata of different instruments of the ring trial.

The workspace development is defined by:

- GitHub repository: [soilspectroscopy/ringtrial-metadata](https://github.com/soilspectroscopy/ringtrial-metadata).
- Google Cloud storage for efficient file storage and access: [whrc.org/soilcarbon-soilspec/storage/sc-ringtrial](https://console.cloud.google.com/storage/browser/sc-ringtrial).

## Metadata

Similar levels were grouped to a common format. All strings starts with upper case in the first letter. Spaces are replaced with dash.

The following information was prepared:

- Manufacturer: the instrument's manufacturer. Only spaces replaced with dash.  
- Model: the instrument's model. Several levels are provided, so model number was omitted to group the variation to a common model. Spaces replaced with dash.  
- Year: year the instrument was built. No modification.  
- Beamsplitter: The crystal used to split the beam for generating the interferogram. Some proprietary materials have some coatings but they were placed under the same basic material.  
- Detector: The beam detector that is part of the interferogram. Again, some variations were placed under the same basic type.  
- Mirror: mirror material as part of the interferogram..  
- Accessory: scanning accessory for DRIFT. Manufacturer and accessory name is provided. Names formatted to a common string
- Background: material used as reference for internal calibration.  
- Sample presentation: soil sample presentation (in the accessory) before scanning.  
- Neat/Mulled: additional sample preparation. Mulled = mixed with a compound to form a paste. Neat = unmixed.  
- Purged: internal gas cleaning before scanning.  

Prepared metadata:
```{r metadata, message=F, warning=F, echo=F}
read_csv("outputs/instruments_metadata_clean.csv") %>%
  knitr::kable()
```

## Pooled PCA

Pooled PCA was performend to retain 99.5% of the original variance. This resulted in 7 components.

<img src="outputs/plot_pca_scores_pooled_raw.png" width=100% heigth=100%>

<img src="outputs/plot_pca_loadings_pooled_raw.png" width=100% heigth=100%>

## Clustering analysis

<img src="outputs/plot_kmeans_aic_absolute.png" width=100% heigth=100%>

<img src="outputs/plot_kmeans_aic_relative.png" width=100% heigth=100%>

<img src="outputs/plot_kmeans_clusters.png" width=100% heigth=100%>

Proportion of instrument samples (%) belonging to spectral clusters

```{r proportion, message=F, warning=F, echo=F}
read_csv("outputs/proportions_clustering.csv") %>%
  rename(instrument = organization) %>%
  knitr::kable(digits = 3)
```

<img src="outputs/plot_kmeans_clusters_majority.png" width=100% heigth=100%>

<img src="outputs/plot_spectral_variation_clusters.png" width=100% heigth=100%>

## Correspondence analysis

Correspondence analysis is used to explore relationships among qualitative variables. Like principal component analysis, it provides a solution for summarizing and visualizing data set in two-dimension plots. It is based on the frequencies formed by two categorical data, i.e. a contingency table. Using an asymmetrical biplot, one can plot metadata information (columns) over the cluster space (rows) in order to understand the associations between the categorical levels. For this, rows are represented in principal coordinates and columns are projected with standard coordinates (row-metric-preserving).

The chi-square test of independence is used to analyze the frequency table (i.e. contingency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables:  
- Null hypothesis (H0): the row and the column variables of the contingency table are independent.  
- Alternative hypothesis (H1): row and column variables are dependent.  

We accept H1 when p-value is below alpha (probability error of 1 or 5%)

Example of contingency table using manufacturer info.
```{r contigency_table, message=F, warning=F, echo=F}
scores <- qread(paste0(mnt.dir, "metadata/pca_majority.qs"))

metadata <- read_csv("outputs/instruments_metadata_clean.csv")

metadata <- metadata %>%
  rename(organization = code) %>%
  mutate(organization = as.factor(organization))

ca.data <- left_join(scores, metadata, by = "organization") %>%
  mutate(cluster = paste0("C", cluster)) %>%
  mutate(majority = paste0("C", majority))

ct.manufacturer <- table(ca.data$majority, ca.data$manufacturer)

knitr::kable(ct.manufacturer)
```

<img src="outputs/plot_ca_manufacturer.png" width=100% heigth=100%>

<img src="outputs/plot_ca_model.png" width=100% heigth=100%>

<img src="outputs/plot_ca_beamsplitter.png" width=100% heigth=100%>

<img src="outputs/plot_ca_detector.png" width=100% heigth=100%>

<img src="outputs/plot_ca_mirror_material.png" width=100% heigth=100%>

<img src="outputs/plot_ca_accessory.png" width=100% heigth=100%>

<img src="outputs/plot_ca_background.png" width=100% heigth=100%>

<img src="outputs/plot_ca_sample_presentation.png" width=100% heigth=100%>

<img src="outputs/plot_ca_purged.png" width=100% heigth=100%>