Load packages

In case you havn’t installed all the needed packages, they can be installed via e.g. install_github("MadsAlbertsen/mmgenome/mmgenome").

library("mmgenome")
options(scipen = 8)

Load data

The assembly in fasta format is loaded in fasta using the Biostrings package.

assembly <- readDNAStringSet("data/assembly.fa", format = "fasta")

The coverage data is loaded. Note that each coverage dataset must have the scaffold name in the first column and the coverage in the second coloumn. The name of each coverage dataset reflect when the data was sampled.

CTAB <- read.table("data/T67.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
KITpe <- read.table("data/T68.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
KITmp <- read.table("data/mp.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
KIT <- KITpe
KIT$Average.coverage <- KITpe$Average.coverage + KITmp$Average.coverage

The overview of the number of essential genes in each scaffold is loaded.

ess <- read.table("data/essential.txt", header = T, sep = " ")

Load basic taxonomy of the scaffolds. Only scaffolds with essential genes are taxonomic classified.

tax <- read.table("data/tax.txt", header = T, sep = "\t")

Load information on which scaffolds that are connected by either paired-end (PE) or mate-pair (MP) connections.

pe <- read.table("data/network_pe.txt", header = T, sep = "\t")
mp <- read.table("data/network_mp.txt", header = T, sep = "\t")

Additional: 16S rRNA infomration.

rRNA16S <- read.table("data/16S.txt", header = T, sep = "\t")

Merge data

The loaded data is merged to a single object. The assembly is used to extract scaffold length, GC contente and tetranucleotide frequencies, which are used for a simple PCA. The tax.expand parameter converts a phylum level taxonomic classification to the underlying classes instead. The tax.freq parameter removes any classification with less than 50 entries in the complete dataset. The other parameter can be used to load any additional datasets.

d <- mmload(assembly = assembly, 
            coverage = c("CTAB", "KITpe", "KITmp", "KIT"), 
            essential = ess,           
            tax = tax,
            tax.expand = "Proteobacteria",
            tax.freq = 85,
            other = "rRNA16S",
            pca = T
           )
## [1] "Loading scaffold length, coverage and gc."
## [1] "Calculating PCA."
## [1] "Loading taxonomy."
## [1] "Loading additional datasets."
## Warning in mmload(assembly = assembly, coverage = c("CTAB", "KITpe",
## "KITmp", : The file rRNA16S contains less scaffolds than the assembly.
## Missing values are treated as NA.

Save data

Remove temporary data and save the generated data for easy loading.

rm(list = c("ess", "rRNA16S", "KIT", "CTAB", "KITpe", "KITmp", "tax"))
save.image(file="vanKessel.RData")