This guide showcase a few of the basic functions in the ampvis
package. I’m greatfull for the awesome R packages dplyr
, vegan
, ggplot2
and phyloseq
which makes up the backbone ampvis
.
A few packages needs to be installed before ampvis can be installed. Make sure you run RStudio as Administrator
to enable installation of all packages. If you are asked to update packages, then press “a” for all.
install.packages("devtools")
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("DESeq2")
biocLite("phyloseq")
Ampvis is installed directly from github using the devtools package.
library("devtools")
install_github("MadsAlbertsen/ampvis")
library(ampvis)
Ampvis works directly on phyloseq objects. However, for internal compability data generated with workflow scripts v.4+ can be loaded using the amp_load
function. It converts the data into a single phyloseq object.
otutable <- read.delim(file = "data/otutable.txt", sep = "\t", header = T, check.names = F, row.names = 1)
metadata <- read.delim(file = "data/metadata.txt", header = T, sep = "\t")
refseq <- readDNAStringSet("data/otus.fa", format = "fasta")
… and then converted to a single phyloseq object using the amp_load
function.
d <- amp_load(otutable = otutable,
metadata = metadata,
refseq = refseq)
The MiDAS dataset is directly included in ampvis and can be loaded using the data()
function.
data(MiDAS_1.20)
The data is a phyloseq object with 574 samples from Danish wastewater treatment plants, which have been sampled up to 4 times per year since 2006.
MiDAS_1.20
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 14791 taxa and 574 samples ]
## sample_data() Sample Data: [ 574 samples by 10 sample variables ]
## tax_table() Taxonomy Table: [ 14791 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 14791 reference sequences ]
The principle behind ampvis is that you first subset the data to what you want to look at using phyloseq
and then visualise it using ampvis
. Samples can be subset based on any available metadata. See the phyloseq guide for more examples.
sample_variables(MiDAS_1.20)
## [1] "SampleID" "Plant" "Date" "DeltaDays" "Year"
## [6] "Period" "LIB.ID" "Latitude" "Longitude" "RawData"
Here we’ll subset to samples from the wastewater treatment plants Aalborg West
and Aalborg East
using the Plant
variable in the metadata and store the data in the object ds
.
ds <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East"))
ds
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 14791 taxa and 52 samples ]
## sample_data() Sample Data: [ 52 samples by 10 sample variables ]
## tax_table() Taxonomy Table: [ 14791 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 14791 reference sequences ]
The included data have different number of sequences per sample. If the purpose is to make heatmaps or boxplots, it is reccomended to convert the abundances to percentages instead using the transform_sample_counts
function.
dsn <- transform_sample_counts(ds, function(x) x / sum(x) * 100)
If the purpose is to make ordination it is recommended to rarefy to the same number of reads using the rarefy_even_depth
function.
dsr <- rarefy_even_depth(ds, sample.size = 10000, rngseed = 712)
If doing ordination, it is recommended to first remove low abundant species. This can be done using the filter_taxa
function from phyloseq. Here we keep OTUs that have been seen more than 9 times (of 10000) in at least 1 sample.
dsf <- filter_taxa(ds, function(x) max(x) >= 10, TRUE)
dsf
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1496 taxa and 52 samples ]
## sample_data() Sample Data: [ 52 samples by 10 sample variables ]
## tax_table() Taxonomy Table: [ 1496 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 1496 reference sequences ]
The data can also be limited to specific taxonomic groups. Here I’ve limited the analysis to the genus Tetrasphaera using the subset_taxa
function.
dsl <- subset_taxa(ds, Genus == "g__Tetrasphaera")
dsl
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 29 taxa and 52 samples ]
## sample_data() Sample Data: [ 52 samples by 10 sample variables ]
## tax_table() Taxonomy Table: [ 29 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 29 reference sequences ]
The above functions can be combined more elegantly in order to make fewer temporary files using the then
operator %>%
from the magrittr
package. It takes the output from a given function and feeds it to the next. It is often called piping or chaining.
dsrf <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East")) %>%
rarefy_even_depth(sample.size = 10000, rngseed = 712) %>%
filter_taxa(function(x) max(x) >= 10, TRUE)
Looking at phylum level (default) on Aalborg East (AAE)
and Aalborg West (AAW)
from 2006 to 2013. Proteobacteria have been split into their respective classes. Samples can be grouped based on any metadata variable or combination of vairables. All functions have a basic help file that can be acessed using e.g. ?amp_heatmap
. The scale.seq
option is used to indicate that we are using percentages instead.
amp_heatmap(data = dsn,
group = c("Plant", "Year"),
tax.class = "p__Proteobacteria",
scale.seq = 100)
The tax.aggregate
parameter controls which level the taxonomic information is shown on. Here the 25 most abundant genera are shown. The data is sorted based on average abundance across all groups of samples. In addition, we scale the color using log10 instead of the default square root, using the plot.colorscale
option.
amp_heatmap(data = dsn,
group = c("Plant", "Year"),
tax.class = "p__Proteobacteria",
scale.seq = 100,
tax.aggregate = "Genus",
tax.show = 25,
plot.colorscale = "log10")
You can always add higher taxonomic ranks to the names using the tax.add
parameter. Here we look at the 25 most abundant OTUs and label each OTU with their respective genus and phylum name. If you don’t like the numbers, you can just remove them using the plot.numbers
option.
amp_heatmap(data = dsn,
group = c("Plant", "Year"),
tax.class = "p__Proteobacteria",
scale.seq = 100,
tax.aggregate = "OTU",
tax.add = c("Phylum", "Genus"),
tax.show = 25,
plot.colorscale = "log10",
plot.na = T,
plot.numbers = F)
The amp_heatmap
function always displays the average if multiple samples are present in each group. Here the average composition in AAW
and AAE
is shown.
amp_heatmap(data = dsn,
group = "Plant",
tax.class = "p__Proteobacteria",
scale.seq = 100,
tax.aggregate = "Genus",
tax.add = c("Phylum"),
tax.show = 25,
plot.colorscale = "log10",
plot.na = T)
The heatmaps always show the average of the samples. The function amp_rabund
(short for Rank Abundance) can be used to generate boxplots. The syntax is similar to amp_heatmap
.
amp_rabund(data = dsn,
tax.aggregate = "Genus",
tax.add = "Phylum",
scale.seq = 100)
Boxplots can also be grouped.
amp_rabund(data = dsn,
tax.aggregate = "Genus",
tax.add = "Phylum",
scale.seq = 100,
tax.show = 10,
group = "Plant",
tax.class = "p__Proteobacteria")
The amp_rabund
can also be used to generate cumulative rank abundance curves using the plot.type
variable. These can be used to compare the diversity between samples. I.e. how many OTUs constitute 75% of all reads in the sample.
amp_rabund(data = dsn,
plot.type = "curve",
tax.aggregate = "OTU",
scale.seq = 100,
group = "Plant",
plot.log = T)
The function amp_ordinate
can be used to make simple PCA plots (and NMDS) of your data. Notice that the PCA uses the square root transformed OTU counts directly and not e.g. bray.curtis similarities. The amp_ordinate
uses the vegan
package for ordination.
amp_ordinate(data = dsrf,
plot.color = "Plant")
The plot.group
variable can be used to make the groupings more easy to see.
amp_ordinate(data = dsrf,
plot.color = "Plant",
plot.group = "chull")
There is a large number of ways to visualise the data. Here a few basic options are shown. You can for example add the n most influencial species (using genus classification by default).
amp_ordinate(data = dsrf,
plot.color = "Plant",
plot.group = "chull",
plot.nspecies = 10)
The trajectory
and trajectory.group
options can be used to trace the development over time of different groups of samples in the same plot. In addtion the group names can be directly added to the plot using the plot.group.label
. It plots the name of the group in the centroid of all samples in the group.
amp_ordinate(data = dsrf,
plot.color = "Year",
trajectory = "Date",
trajectory.group = "Plant",
plot.group.label = "Plant")
All functions have a plot.theme
option that can be used to change the overall theme of the plots. The plot.theme = clean
option is often suited for publication. In addition, all plots are ggplot2
objects and can be easily manipulated. Here, the legend is removed and the limits of the x-axis is changed.
amp_ordinate(data = dsrf,
plot.color = "Year",
trajectory = "Date",
trajectory.group = "Plant",
plot.group.label = "Plant",
plot.theme = "clean") +
theme(legend.position = "none") +
xlim(-3,4)
Another option is to test for significance of environmental factors (uses the vegan
envfit function). Here we also use the output = "complete"
option to get the data behind the plot, so we can see the significance of the tested variables. Note: envifit only test if there is a correlation to the PC’s you display, here PC1 and PC2. There might be correlations hidden in the other PC’s.
res <- amp_ordinate(data = dsrf,
plot.color = "Plant",
envfit.factor = c("Plant","Period"),
output = "complete")
You can access plots, data and modelling results in the res
object. Here we take a look to see if the fitting of Plant and Peroid was significant based on a permutation test.
res$eff.model
##
## ***FACTORS:
##
## Centroids:
## PC1 PC2
## PlantAalborg East -2.1760 -0.2102
## PlantAalborg West 2.5386 0.2453
## PeriodFall -0.1496 -0.8700
## PeriodSpring 0.0342 0.5931
## PeriodSummer 0.2758 -0.9135
## PeriodWinter -0.1334 1.2523
##
## Goodness of fit:
## r2 Pr(>r)
## Plant 0.4779 0.001 ***
## Period 0.0805 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
Finally you can constrain the PCA and look for relationships to specific variables. Here we can see that there seem to be a shift in both plants to more Dechloromonas from 2006 to 2013.
amp_ordinate(data = dsrf,
plot.color = "Year",
constrain = "Year",
plot.nspecies = 10)
The amp_core
function can be used to investigate the core community in your samples. The default plot is a frequency plot
that groups the OTUs based on how many samples they are seen in. If weight
is set to TRUE
then each OTU is weighted by their abundance.
amp_core(data = dsn,
plot.type = "frequency",
weight = T,
scale.seq = 100)
..if not it simply display the number of OTUs in each group.
amp_core(data = dsn,
plot.type = "frequency",
weight = F)
The amp_core
function also enables more refined core plots. Here, each OTU is plottted as “Abundant or not” vs. their frequency. I’ve set the “Abundant cirteira” as 0.1 % abundance.
amp_core(data=dsn,
plot.type = "core",
abund.treshold = 0.1,
scale.seq = 100)
All functions have an option to export the plot and the processed data. This is particulary usefull if you want to get the names and sequences of the “core” species. Here we store the results in the res
object.
res <- amp_core(data=dsn,
plot.type = "core",
abund.treshold = 0.1,
output = "complete",
scale.seq = 100)
The processed data is stored as data
in the output. Hence, to get the species that are abundant and seen in all samples we filter the data by Frequency
and freq_A
. Then we select the data we want to see, and finally sort it based on abundance.
core_OTUs <- filter(res$data, Frequency > 51, freq_A > 51) %>%
select(OTU, Abundance, Phylum, Genus) %>%
arrange(desc(Abundance))
print(core_OTUs)
## OTU Abundance Phylum Genus
## 1 MiDAS_3 5.88 Firmicutes Trichococcus
## 2 MiDAS_1 5.81 Actinobacteria Tetrasphaera
## 3 MiDAS_2 2.96 Actinobacteria Candidatus Microthrix
## 4 MiDAS_6 1.73 Actinobacteria Fodinibacter
## 5 MiDAS_7 1.31 Actinobacteria Candidatus Microthrix
## 6 MiDAS_19 1.06 Proteobacteria Acidovorax
## 7 MiDAS_8 1.01 Proteobacteria Rhodobacter
## 8 MiDAS_55 0.83 Proteobacteria Sphingopyxis
## 9 MiDAS_12 0.82 Bacteroidetes Candidatus Epiflobacter
## 10 MiDAS_42 0.73 Firmicutes p-55-a5
## 11 MiDAS_16 0.69 Proteobacteria Hyphomicrobium
## 12 MiDAS_32 0.69 Proteobacteria Rhodoferax
## 13 MiDAS_57 0.69 Proteobacteria 188up
## 14 MiDAS_28 0.68 Proteobacteria f__Bradyrhizobiaceae_MiDAS_28
## 15 MiDAS_20 0.63 Proteobacteria Rhodobacter
## 16 MiDAS_37 0.61 Proteobacteria f__Phyllobacteriaceae_MiDAS_37
## 17 MiDAS_69 0.53 Proteobacteria QEEB1BB10
## 18 MiDAS_60 0.45 Actinobacteria Leucobacter
## 19 MiDAS_22 0.36 Actinobacteria HTG5
## 20 MiDAS_58 0.34 Proteobacteria Paracoccus
## 21 MiDAS_30 0.28 Proteobacteria MNG7
## 22 MiDAS_119 0.27 Firmicutes Clostridium sensu stricto 1
## 23 MiDAS_106 0.25 Bacteroidetes PHOS-HE31
## 24 MiDAS_153 0.24 Firmicutes p-55-a5
As a last thing we want to extract the OTU sequences for further work. This can be done using the amp_export
function. It simply takes a phyloseq object and exports the associated sequences. However, first we subset the phyloseq object to our “core OTUs” using the prune_taxa
function from phyloseq
.
core_p <- prune_taxa(x = dsn, taxa = as.character(core_OTUs$OTU))
.. then the sequences are exported.
amp_export(data = core_p, file = "my_sequences.fa")