In case you havn’t installed the mmgenome package, it can be installed using the devtools
package through github. You might need enable administrative privileges to install some of the packages if you are using windows. At the time of compiling the Biostrings
package wasn’t available for R 3.1.1
, hence it needs to be installed through bioconducter seperately.
install.packages("devtools")
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
devtools::install_github("MadsAlbertsen/mmgenome/mmgenome")
library("mmgenome")
The raw example dataset used in this tutorial can be downloaded from figshare. For simplicity all raw data files are flat text files, hence take a look at them if you want to load your own custom data.
The assembly in fasta format is loaded.
assembly <- readDNAStringSet("data/assembly.fa", format = "fasta")
The coverage data is loaded. Note that each coverage dataset must have the scaffold name in the first column and the coverage in the second column. The name of the example coverage datasets reflect when the data was sampled. The coverage data loaded in this example is exported directly from CLC genomics workbench and contains a number of columns that isn’t needed. Hence, we only store the relevant columns (Reference.sequence
and Average.coverage
).
C13.11.14 <- read.table("data/C13.11.14.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
C13.11.25 <- read.table("data/C13.11.25.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
C13.12.03 <- read.table("data/C13.12.03.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
C14.01.09 <- read.table("data/C14.01.09.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
The overview of the number of essential genes in each scaffold is loaded.
ess <- read.table("data/essential.txt", header = T, sep = " ")
Load basic taxonomy of the scaffolds. Only scaffolds with essential genes are taxonomic classified.
tax <- read.table("data/tax.txt", header = T, sep = "\t")
Load information on which scaffolds that are connected by either paired-end (PE) or mate-pair (MP) connections.
pe <- read.table("data/network_pe.txt", header = T, sep = "\t")
mp <- read.table("data/network_mp.txt", header = T, sep = "\t")
Note: you can load any additional dataset and integrate it with the rest of the data as long as the first column contains the scaffold name.
The 16S sequences were taxonomically classified using SINA.
rRNA16S <- read.table("data/16S.txt", header = T, sep = "\t")
All metagenome scaffolds were taxonomically classified using PhyloPythiaS+. The original output file is slightly modified after being loaded.
pps <- read.table("data/pps.txt", sep = "\t", header = F, col.names = c("scaffold","pps_root", "pps_kingdom", "pps_phylum", "pps_class", "pps_order", "pps_family", "pps_genus", "pps_species"))[,c(1,4,5,6,7,8)]
pps[pps == ""] <- NA
The loaded data is merged to a single object using the mmload
function. The assembly is used to extract scaffold length, GC content and tetranucleotide frequencies (used for a very simple PCA). All tax
parameters refers to the loaded taxonomic classification of the essential genes. The tax.expand
parameter converts a phylum level taxonomic classification to the underlying classes instead. The tax.freq
parameter removes any classification with less than 50 entries in the complete dataset. The other
parameter can be used to load any additional datasets.
d <- mmload(assembly = assembly,
pca = T,
coverage = c("C13.11.14", "C13.11.25", "C13.12.03", "C14.01.09"),
essential = ess,
tax = tax,
tax.expand = "Proteobacteria",
tax.freq = 50,
other = c("rRNA16S", "pps")
)
## [1] "Loading scaffold length, coverage and gc."
## [1] "Calculating PCA."
## [1] "Loading taxonomy."
## [1] "Loading additional datasets."
## Warning: The file rRNA16S contains less scaffolds than the assembly. Missing values are treated as NA.
## Warning: The file pps contains less scaffolds than the assembly. Missing values are treated as NA.
Remove temporary data and save the generated data for easy loading.
rm(list = c("C13.11.14", "C13.11.25", "C13.12.03", "C14.01.09", "ess", "tax", "rRNA16S","pps"))
save.image(file="data.RData")