This guide showcase a few of the basic functions in the ampvis package. I’m greatfull for the awesome R packages dplyr, vegan, ggplot2 and phyloseq which makes up the backbone ampvis.

Installation

Install R, Rtools and Rstudio

Install the latest version of R, Rtools and Rstudio.

Install ampvis

A few packages needs to be installed before ampvis can be installed. Make sure you run RStudio as Administrator to enable installation of all packages. If you are asked to update packages, then press “a” for all.

install.packages("devtools")
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("DESeq2")
biocLite("phyloseq")

Ampvis is installed directly from github using the devtools package.

library("devtools")
install_github("MadsAlbertsen/ampvis")

Load ampvis and import data

Load ampvis

library(ampvis)

Load data

Ampvis works directly on phyloseq objects. However, for internal compability data generated with workflow scripts v.4+ can be loaded using the amp_load function. It converts the data into a single phyloseq object.

otutable <- read.delim(file = "data/otutable.txt", sep = "\t", header = T, check.names = F, row.names = 1)
metadata <- read.delim(file = "data/metadata.txt", header = T, sep = "\t")
refseq <- readDNAStringSet("data/otus.fa", format = "fasta")

… and then converted to a single phyloseq object using the amp_load function.

d <- amp_load(otutable = otutable, 
              metadata = metadata, 
              refseq = refseq)

MiDAS data

The MiDAS dataset is directly included in ampvis and can be loaded using the data() function.

data(MiDAS_1.20)

The data is a phyloseq object with 574 samples from Danish wastewater treatment plants, which have been sampled up to 4 times per year since 2006.

MiDAS_1.20
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 14791 taxa and 574 samples ]
## sample_data() Sample Data:       [ 574 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 14791 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 14791 reference sequences ]

Analyse the data

Subset and filtering

The principle behind ampvis is that you first subset the data to what you want to look at using phyloseq and then visualise it using ampvis. Samples can be subset based on any available metadata. See the phyloseq guide for more examples.

sample_variables(MiDAS_1.20)
##  [1] "SampleID"  "Plant"     "Date"      "DeltaDays" "Year"     
##  [6] "Period"    "LIB.ID"    "Latitude"  "Longitude" "RawData"

Here we’ll subset to samples from the wastewater treatment plants Aalborg West and Aalborg East using the Plant variable in the metadata and store the data in the object ds.

ds <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East"))
ds
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 14791 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 14791 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 14791 reference sequences ]

The included data have different number of sequences per sample. If the purpose is to make heatmaps or boxplots, it is reccomended to convert the abundances to percentages instead using the transform_sample_counts function.

dsn <- transform_sample_counts(ds, function(x) x / sum(x) * 100)

If the purpose is to make ordination it is recommended to rarefy to the same number of reads using the rarefy_even_depth function.

dsr <- rarefy_even_depth(ds, sample.size = 10000, rngseed = 712)

If doing ordination, it is recommended to first remove low abundant species. This can be done using the filter_taxa function from phyloseq. Here we keep OTUs that have been seen more than 9 times (of 10000) in at least 1 sample.

dsf <- filter_taxa(ds, function(x) max(x) >= 10, TRUE)
dsf
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1496 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 1496 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 1496 reference sequences ]

The data can also be limited to specific taxonomic groups. Here I’ve limited the analysis to the genus Tetrasphaera using the subset_taxa function.

dsl <- subset_taxa(ds, Genus == "g__Tetrasphaera")
dsl       
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 29 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 29 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 29 reference sequences ]

Chaining it all together

The above functions can be combined more elegantly in order to make fewer temporary files using the then operator %>% from the magrittr package. It takes the output from a given function and feeds it to the next. It is often called piping or chaining.

dsrf <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East")) %>%
        rarefy_even_depth(sample.size = 10000, rngseed = 712) %>%
        filter_taxa(function(x) max(x) >= 10, TRUE)

Basic ampvis functions

Heatmap

Looking at phylum level (default) on Aalborg East (AAE) and Aalborg West (AAW) from 2006 to 2013. Proteobacteria have been split into their respective classes. Samples can be grouped based on any metadata variable or combination of vairables. All functions have a basic help file that can be acessed using e.g. ?amp_heatmap. The scale.seq option is used to indicate that we are using percentages instead.

amp_heatmap(data = dsn, 
            group = c("Plant", "Year"), 
            tax.class = "p__Proteobacteria",
            scale.seq = 100)

The tax.aggregate parameter controls which level the taxonomic information is shown on. Here the 25 most abundant genera are shown. The data is sorted based on average abundance across all groups of samples. In addition, we scale the color using log10 instead of the default square root, using the plot.colorscale option.

amp_heatmap(data = dsn, 
            group = c("Plant", "Year"), 
            tax.class = "p__Proteobacteria",
            scale.seq = 100,
            tax.aggregate = "Genus",
            tax.show = 25,
            plot.colorscale = "log10")