This report documents the binning of a Nitrospira Comammox genome from an enrichment reactor. See vanKessel et al., 2015: Complete nitrification by a single microorganism for further details.
In case you haven’t installed the mmgenome package, see the Load data example.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport
function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.
load("vanKessel.RData")
The object d
contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns CTAB
, KITpe
, KITmp
and KIT
contain the coverage information from 4 different samples; PC1
, PC2
and PC3
contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential
contain information taxonomic information for each scaffold based on classification on essential genes; rRNA16S
contain taxonomic information on scaffolds that have an associated 16S rRNA gene.
colnames(d$scaffolds)
## [1] "scaffold" "length" "gc" "CTAB" "KITpe"
## [6] "KITmp" "KIT" "PC1" "PC2" "PC3"
## [11] "essential" "rRNA16S"
The basic statistics of the full dataset can be summarised using the mmstats
function.
mmstats(d, ncov = 4)
## General Stats
## n.scaffolds 47584.00
## GC.mean 53.90
## N50 6256.00
## Length.total 178755569.00
## Length.max 1528908.00
## Length.mean 3756.60
## Coverage.CTAB 5.81
## Coverage.KITpe 4.55
## Coverage.KITmp 6.76
## Coverage.KIT 11.31
## Ess.total 3158.00
## Ess.unique 109.00
In general, the metagenome assembly is very nice and the two Nitrospira genomes can easily be identified. The first Nitrospira genome is located in the same coverage space as a genome from the Phylum Planctomycetes. However, that is removed in the subsequent steps.
p <- mmplot(data = d,
x = "CTAB",
y = "KITpe",
color = "essential",
minlength = 10000)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(CTAB = c(10.2, 11.3, 14.3, 16.7, 15.6, 14.1, 10.5),
KITpe = c(9.82, 12.6, 14.1, 11.5, 9.39, 6.93, 7.12))
mmplot_selection(p, sel) +
scale_x_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_y_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_discrete(name = "Taxonomy") +
xlab("Coverage (CTAB)") +
ylab("Coverage (Kit)") +
theme_classic()
The scaffolds included in the defined subspace are extracted using the mmextract
function. Note that all scaffolds in the defined subspace is extracted and not just the scaffolds over 10 kbp that was plotted.
dA <- mmextract(d, sel)
The mmstats
function applies to any extracted object. Hence, it can be used directly on the subset.
mmstats(dA)
## General Stats
## n.scaffolds 311.00
## GC.mean 50.70
## N50 152547.00
## Length.total 6975425.00
## Length.max 1073143.00
## Length.mean 22429.00
## Coverage.CTAB 13.69
## Coverage.KITpe 10.07
## Ess.total 152.00
## Ess.unique 106.00
The function mmplot_pairs
can visualize a number of different variables at the same time. In this case PC2
and PC3
seem like a good choice.
mmplot_pairs(data = dA, variables = c("CTAB", "KIT", "gc", "PC1", "PC2", "PC3"))