Example: Load Data

Install mmgenome

In case you havn’t installed the mmgenome package, it can be installed using the devtools package through github. You might need enable administrative privileges to install some of the packages if you are using windows. At the time of compiling the Biostrings package wasn’t available for R 3.1.1, hence it needs to be installed through bioconducter seperately.

install.packages("devtools")
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
devtools::install_github("MadsAlbertsen/mmgenome/mmgenome")

Load mmgenome

library("mmgenome")

Download example data

The raw example dataset used in this tutorial can be downloaded from figshare. For simplicity all raw data files are flat text files, hence take a look at them if you want to load your own custom data.

Load data

Metagenome assembly (required)

The assembly in fasta format is loaded.

assembly <- readDNAStringSet("data/assembly.fa", format = "fasta")

Coverage profiles (1 coverage profile is required)

The coverage data is loaded. Note that each coverage dataset must have the scaffold name in the first column and the coverage in the second column. The name of the example coverage datasets reflect when the data was sampled. The coverage data loaded in this example is exported directly from CLC genomics workbench and contains a number of columns that isn’t needed. Hence, we only store the relevant columns (Reference.sequence and Average.coverage).

C13.11.14 <- read.table("data/C13.11.14.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]               
C13.11.25 <- read.table("data/C13.11.25.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]               
C13.12.03 <- read.table("data/C13.12.03.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]               
C14.01.09 <- read.table("data/C14.01.09.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]

Essential genes (optional, but required for some functions)

The overview of the number of essential genes in each scaffold is loaded.

ess <- read.table("data/essential.txt", header = T, sep = " ")

Taxonomy based on essential genes (optional)

Load basic taxonomy of the scaffolds. Only scaffolds with essential genes are taxonomic classified.

tax <- read.table("data/tax.txt", header = T, sep = "\t")

Paired-end connections (optional)

Load information on which scaffolds that are connected by either paired-end (PE) or mate-pair (MP) connections.

pe <- read.table("data/network_pe.txt", header = T, sep = "\t")
mp <- read.table("data/network_mp.txt", header = T, sep = "\t")

Note: you can load any additional dataset and integrate it with the rest of the data as long as the first column contains the scaffold name.

Additional dataset: 16S rRNA information

The 16S sequences were taxonomically classified using SINA.

rRNA16S <- read.table("data/16S.txt", header = T, sep = "\t")

Additional dataset: PPS+ taxonomic classification.

All metagenome scaffolds were taxonomically classified using PhyloPythiaS+. The original output file is slightly modified after being loaded.

pps <- read.table("data/pps.txt", sep = "\t", header = F, col.names = c("scaffold","pps_root", "pps_kingdom", "pps_phylum", "pps_class", "pps_order", "pps_family", "pps_genus", "pps_species"))[,c(1,4,5,6,7,8)]
pps[pps == ""] <- NA

Merge data

The loaded data is merged to a single object using the mmload function. The assembly is used to extract scaffold length, GC content and tetranucleotide frequencies (used for a very simple PCA). All tax parameters refers to the loaded taxonomic classification of the essential genes. The tax.expand parameter converts a phylum level taxonomic classification to the underlying classes instead. The tax.freq parameter removes any classification with less than 50 entries in the complete dataset. The other parameter can be used to load any additional datasets.

d <- mmload(assembly = assembly, 
            pca = T,
            coverage = c("C13.11.14", "C13.11.25", "C13.12.03", "C14.01.09"), 
            essential = ess,           
            tax = tax,
            tax.expand = "Proteobacteria",
            tax.freq = 50,
            other = c("rRNA16S", "pps")
           )

## [1] "Loading scaffold length, coverage and gc."
## [1] "Calculating PCA."
## [1] "Loading taxonomy."
## [1] "Loading additional datasets."

## Warning: The file rRNA16S contains less scaffolds than the assembly. Missing values are treated as NA.
## Warning: The file pps contains less scaffolds than the assembly. Missing values are treated as NA.

Save data

Remove temporary data and save the generated data for easy loading.

rm(list = c("C13.11.14", "C13.11.25", "C13.12.03", "C14.01.09", "ess", "tax", "rRNA16S","pps"))
save.image(file="data.RData")