-
Notifications
You must be signed in to change notification settings - Fork 9
MetQy functions and usage examples – Analysis functions
The analysis family of functions is designed to facilitate the analysis primarily of the output of the query_genomes_to_modules function, which generates a matrix of mcf values for the genomes and modules analysed.
MetQy features three analysis functions:
- analysis_pca_mean_distance_calculation
- analysis_pca_mean_distance_grouping
- analysis_genomes_module_output
This function takes the principal component matrix generated by the stats::prcomp
function on the mcf matrix (query_genomes_to_modules output) and calculates the mean Euclidean distance between all points as a proxy for within-group variance.
The mean distance of p points is calculated by the sum of the individual Euclidean distances in n dimensions , divided by the total number of distances .
> library(MetQy)
> data(data_example_moduleIDs)
> data(data_example_genomeIDs)
# Calculate the module completion fraction (mcf) for the genomes and modules contained in the data objects above.
> OUT <- query_genomes_to_modules(data_example_genomeIDs,MODULE_ID = data_example_moduleIDs)
> pca <- prcomp(OUT$MATRIX)
> mean_dist <- analysis_pca_mean_distance_calculation(pca$x)
# [1] 0.4805169
This function calculates the mean Euclidean distances (by calling analysis_pca_mean_distance_calculation) for each group specified by a grouping factor as a proxy for within-group variance. A plot of these distances can be made using plot_scatter.
> library(MetQy)
> data(data_example_moduleIDs)
> data(data_example_genomeIDs)
# Calculate the module completion fraction (mcf) for the genomes and modules contained in the data objects above.
> OUT <- query_genomes_to_modules(data_example_genomeIDs,MODULE_ID = data_example_moduleIDs)
> pca <- prcomp(OUT$MATRIX)
# Group data
> this_FACTOR <- rep(LETTERS[1:5],length(data_example_genomeIDs)/5)
> mean_dist_output <- analysis_pca_mean_distance_grouping(pca$x,this_FACTOR,xLabs_angle = F,Width = 2, Height = 1.5,Filename = "plot_pca_scatter.png")
Carry out an automated analysis and report using the analysis_genomes_module_output function to analyse all the data generated by query_genomes_to_modules. Additionally, this function can take in a grouping factor to split the data corresponding to the genomes (e.g. genus, species, sample, ...
).
This function will:
1) report the total number of data sets (genomes) and modules analysed,
2) generate a heatmap of the mcf of all genomes and modules analysed,
3) generate boxplots of the mcf across all genomes for each module,
4) generate a scatter plot of the standard deviation of the mcf across all genomes for each module,
5) identify any modules that have a constant (zero-variance) mcf across all genomes,
6) group the genomes by genus and make a heatmap of the mean mcf for each module and genus,
7) carry out a PCA analysis, showing the cumulative variance and a PC plot,
8) visualise the PC plot overlaying the genus grouping, and
9) measure the within-group (genus) variance, using the mean Euclidean distance as a proxy for spread.
See the worked out biological example (specifically part 6) ) and the report generated in this section.