This guide showcase a few of the basic functions in the ampvis package. I’m greatfull for the awesome R packages dplyr, vegan, ggplot2 and phyloseq which makes up the backbone ampvis.

Installation

Install R, Rtools and Rstudio

Install the latest version of R, Rtools and Rstudio.

Install ampvis

A few packages needs to be installed before ampvis can be installed. Make sure you run RStudio as Administrator to enable installation of all packages. If you are asked to update packages, then press “a” for all.

install.packages("devtools")
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("DESeq2")
biocLite("phyloseq")

Ampvis is installed directly from github using the devtools package.

library("devtools")
install_github("MadsAlbertsen/ampvis")

Load ampvis and import data

Load ampvis

library(ampvis)

Load data

Ampvis works directly on phyloseq objects. However, for internal compability data generated with workflow scripts v.4+ can be loaded using the amp_load function. It converts the data into a single phyloseq object.

otutable <- read.delim(file = "data/otutable.txt", sep = "\t", header = T, check.names = F, row.names = 1)
metadata <- read.delim(file = "data/metadata.txt", header = T, sep = "\t")
refseq <- readDNAStringSet("data/otus.fa", format = "fasta")

… and then converted to a single phyloseq object using the amp_load function.

d <- amp_load(otutable = otutable, 
              metadata = metadata, 
              refseq = refseq)

MiDAS data

The MiDAS dataset is directly included in ampvis and can be loaded using the data() function.

data(MiDAS_1.20)

The data is a phyloseq object with 574 samples from Danish wastewater treatment plants, which have been sampled up to 4 times per year since 2006.

MiDAS_1.20

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 14791 taxa and 574 samples ]
## sample_data() Sample Data:       [ 574 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 14791 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 14791 reference sequences ]

Analyse the data

Subset and filtering

The principle behind ampvis is that you first subset the data to what you want to look at using phyloseq and then visualise it using ampvis. Samples can be subset based on any available metadata. See the phyloseq guide for more examples.

sample_variables(MiDAS_1.20)

##  [1] "SampleID"  "Plant"     "Date"      "DeltaDays" "Year"     
##  [6] "Period"    "LIB.ID"    "Latitude"  "Longitude" "RawData"

Here we’ll subset to samples from the wastewater treatment plants Aalborg West and Aalborg East using the Plant variable in the metadata and store the data in the object ds.

ds <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East"))
ds

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 14791 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 14791 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 14791 reference sequences ]

The included data have different number of sequences per sample. If the purpose is to make heatmaps or boxplots, it is reccomended to convert the abundances to percentages instead using the transform_sample_counts function.

dsn <- transform_sample_counts(ds, function(x) x / sum(x) * 100)

If the purpose is to make ordination it is recommended to rarefy to the same number of reads using the rarefy_even_depth function.

dsr <- rarefy_even_depth(ds, sample.size = 10000, rngseed = 712)

If doing ordination, it is recommended to first remove low abundant species. This can be done using the filter_taxa function from phyloseq. Here we keep OTUs that have been seen more than 9 times (of 10000) in at least 1 sample.

dsf <- filter_taxa(ds, function(x) max(x) >= 10, TRUE)
dsf

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1496 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 1496 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 1496 reference sequences ]

The data can also be limited to specific taxonomic groups. Here I’ve limited the analysis to the genus Tetrasphaera using the subset_taxa function.

dsl <- subset_taxa(ds, Genus == "g__Tetrasphaera")
dsl

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 29 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 29 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 29 reference sequences ]

Chaining it all together

The above functions can be combined more elegantly in order to make fewer temporary files using the then operator %>% from the magrittr package. It takes the output from a given function and feeds it to the next. It is often called piping or chaining.

dsrf <- subset_samples(MiDAS_1.20, Plant %in% c("Aalborg West", "Aalborg East")) %>%
        rarefy_even_depth(sample.size = 10000, rngseed = 712) %>%
        filter_taxa(function(x) max(x) >= 10, TRUE)

Basic ampvis functions

Heatmap

Looking at phylum level (default) on Aalborg East (AAE) and Aalborg West (AAW) from 2006 to 2013. Proteobacteria have been split into their respective classes. Samples can be grouped based on any metadata variable or combination of vairables. All functions have a basic help file that can be acessed using e.g. ?amp_heatmap. The scale.seq option is used to indicate that we are using percentages instead.

amp_heatmap(data = dsn, 
            group = c("Plant", "Year"), 
            tax.class = "p__Proteobacteria",
            scale.seq = 100)

The tax.aggregate parameter controls which level the taxonomic information is shown on. Here the 25 most abundant genera are shown. The data is sorted based on average abundance across all groups of samples. In addition, we scale the color using log10 instead of the default square root, using the plot.colorscale option.

amp_heatmap(data = dsn, 
            group = c("Plant", "Year"), 
            tax.class = "p__Proteobacteria",
            scale.seq = 100,
            tax.aggregate = "Genus",
            tax.show = 25,
            plot.colorscale = "log10")

You can always add higher taxonomic ranks to the names using the tax.add parameter. Here we look at the 25 most abundant OTUs and label each OTU with their respective genus and phylum name. If you don’t like the numbers, you can just remove them using the plot.numbers option.

amp_heatmap(data = dsn, 
            group = c("Plant", "Year"), 
            tax.class = "p__Proteobacteria",
            scale.seq = 100,
            tax.aggregate = "OTU",
            tax.add = c("Phylum", "Genus"),
            tax.show = 25,
            plot.colorscale = "log10",
            plot.na = T,
            plot.numbers = F)

The amp_heatmap function always displays the average if multiple samples are present in each group. Here the average composition in AAW and AAE is shown.

amp_heatmap(data = dsn, 
            group = "Plant", 
            tax.class = "p__Proteobacteria",
            scale.seq = 100,
            tax.aggregate = "Genus",
            tax.add = c("Phylum"),
            tax.show = 25,
            plot.colorscale = "log10",
            plot.na = T)

Boxplots

The heatmaps always show the average of the samples. The function amp_rabund (short for Rank Abundance) can be used to generate boxplots. The syntax is similar to amp_heatmap.

amp_rabund(data = dsn,
           tax.aggregate = "Genus",
           tax.add = "Phylum",
           scale.seq = 100)

Boxplots can also be grouped.

amp_rabund(data = dsn,
           tax.aggregate = "Genus",
           tax.add = "Phylum", 
           scale.seq = 100,
           tax.show = 10,
           group = "Plant",
           tax.class = "p__Proteobacteria")

The amp_rabund can also be used to generate cumulative rank abundance curves using the plot.type variable. These can be used to compare the diversity between samples. I.e. how many OTUs constitute 75% of all reads in the sample.

amp_rabund(data = dsn,
           plot.type = "curve",
           tax.aggregate = "OTU",
           scale.seq = 100,
           group = "Plant",
           plot.log = T)

PCA

The function amp_ordinate can be used to make simple PCA plots (and NMDS) of your data. Notice that the PCA uses the square root transformed OTU counts directly and not e.g. bray.curtis similarities. The amp_ordinate uses the vegan package for ordination.

amp_ordinate(data = dsrf, 
             plot.color = "Plant")

The plot.group variable can be used to make the groupings more easy to see.

amp_ordinate(data = dsrf, 
             plot.color = "Plant", 
             plot.group = "chull")

There is a large number of ways to visualise the data. Here a few basic options are shown. You can for example add the n most influencial species (using genus classification by default).

amp_ordinate(data = dsrf, 
             plot.color = "Plant", 
             plot.group = "chull",
             plot.nspecies = 10)

The trajectory and trajectory.group options can be used to trace the development over time of different groups of samples in the same plot. In addtion the group names can be directly added to the plot using the plot.group.label. It plots the name of the group in the centroid of all samples in the group.

amp_ordinate(data = dsrf, 
             plot.color = "Year", 
             trajectory = "Date", 
             trajectory.group = "Plant",
             plot.group.label = "Plant")

All functions have a plot.theme option that can be used to change the overall theme of the plots. The plot.theme = clean option is often suited for publication. In addition, all plots are ggplot2 objects and can be easily manipulated. Here, the legend is removed and the limits of the x-axis is changed.

amp_ordinate(data = dsrf, 
             plot.color = "Year", 
             trajectory = "Date", 
             trajectory.group = "Plant",
             plot.group.label = "Plant",
             plot.theme = "clean") +
  theme(legend.position = "none") +
  xlim(-3,4)

Another option is to test for significance of environmental factors (uses the vegan envfit function). Here we also use the output = "complete" option to get the data behind the plot, so we can see the significance of the tested variables. Note: envifit only test if there is a correlation to the PC’s you display, here PC1 and PC2. There might be correlations hidden in the other PC’s.

res <- amp_ordinate(data = dsrf, 
                    plot.color = "Plant", 
                    envfit.factor = c("Plant","Period"), 
                    output = "complete")

You can access plots, data and modelling results in the res object. Here we take a look to see if the fitting of Plant and Peroid was significant based on a permutation test.

res$eff.model

## 
## ***FACTORS:
## 
## Centroids:
##                       PC1     PC2
## PlantAalborg East -2.1760 -0.2102
## PlantAalborg West  2.5386  0.2453
## PeriodFall        -0.1496 -0.8700
## PeriodSpring       0.0342  0.5931
## PeriodSummer       0.2758 -0.9135
## PeriodWinter      -0.1334  1.2523
## 
## Goodness of fit:
##            r2 Pr(>r)    
## Plant  0.4779  0.001 ***
## Period 0.0805  0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

Finally you can constrain the PCA and look for relationships to specific variables. Here we can see that there seem to be a shift in both plants to more Dechloromonas from 2006 to 2013.

amp_ordinate(data = dsrf, 
             plot.color = "Year", 
             constrain = "Year", 
             plot.nspecies = 10)

Core plots

The amp_core function can be used to investigate the core community in your samples. The default plot is a frequency plot that groups the OTUs based on how many samples they are seen in. If weight is set to TRUE then each OTU is weighted by their abundance.

amp_core(data = dsn, 
         plot.type = "frequency", 
         weight = T,
         scale.seq = 100)

..if not it simply display the number of OTUs in each group.

amp_core(data = dsn,
         plot.type = "frequency", 
         weight = F)

The amp_core function also enables more refined core plots. Here, each OTU is plottted as “Abundant or not” vs. their frequency. I’ve set the “Abundant cirteira” as 0.1 % abundance.

amp_core(data=dsn, 
         plot.type = "core", 
         abund.treshold = 0.1,
         scale.seq = 100)

All functions have an option to export the plot and the processed data. This is particulary usefull if you want to get the names and sequences of the “core” species. Here we store the results in the res object.

res <- amp_core(data=dsn, 
                plot.type = "core", 
                abund.treshold = 0.1, 
                output = "complete",
                scale.seq = 100)

The processed data is stored as data in the output. Hence, to get the species that are abundant and seen in all samples we filter the data by Frequency and freq_A. Then we select the data we want to see, and finally sort it based on abundance.

core_OTUs <- filter(res$data, Frequency > 51, freq_A > 51) %>% 
             select(OTU, Abundance, Phylum, Genus) %>%
             arrange(desc(Abundance))

print(core_OTUs)

##          OTU Abundance         Phylum                          Genus
## 1    MiDAS_3      5.88     Firmicutes                   Trichococcus
## 2    MiDAS_1      5.81 Actinobacteria                   Tetrasphaera
## 3    MiDAS_2      2.96 Actinobacteria          Candidatus Microthrix
## 4    MiDAS_6      1.73 Actinobacteria                   Fodinibacter
## 5    MiDAS_7      1.31 Actinobacteria          Candidatus Microthrix
## 6   MiDAS_19      1.06 Proteobacteria                     Acidovorax
## 7    MiDAS_8      1.01 Proteobacteria                    Rhodobacter
## 8   MiDAS_55      0.83 Proteobacteria                   Sphingopyxis
## 9   MiDAS_12      0.82  Bacteroidetes        Candidatus Epiflobacter
## 10  MiDAS_42      0.73     Firmicutes                        p-55-a5
## 11  MiDAS_16      0.69 Proteobacteria                 Hyphomicrobium
## 12  MiDAS_32      0.69 Proteobacteria                     Rhodoferax
## 13  MiDAS_57      0.69 Proteobacteria                          188up
## 14  MiDAS_28      0.68 Proteobacteria  f__Bradyrhizobiaceae_MiDAS_28
## 15  MiDAS_20      0.63 Proteobacteria                    Rhodobacter
## 16  MiDAS_37      0.61 Proteobacteria f__Phyllobacteriaceae_MiDAS_37
## 17  MiDAS_69      0.53 Proteobacteria                      QEEB1BB10
## 18  MiDAS_60      0.45 Actinobacteria                    Leucobacter
## 19  MiDAS_22      0.36 Actinobacteria                           HTG5
## 20  MiDAS_58      0.34 Proteobacteria                     Paracoccus
## 21  MiDAS_30      0.28 Proteobacteria                           MNG7
## 22 MiDAS_119      0.27     Firmicutes    Clostridium sensu stricto 1
## 23 MiDAS_106      0.25  Bacteroidetes                      PHOS-HE31
## 24 MiDAS_153      0.24     Firmicutes                        p-55-a5

As a last thing we want to extract the OTU sequences for further work. This can be done using the amp_export function. It simply takes a phyloseq object and exports the associated sequences. However, first we subset the phyloseq object to our “core OTUs” using the prune_taxa function from phyloseq.

core_p <- prune_taxa(x = dsn, taxa = as.character(core_OTUs$OTU))

.. then the sequences are exported.

amp_export(data = core_p, file = "my_sequences.fa")

Ampvis basic guide

Mads Albertsen

March 12, 2015