This report documents the binning of a Nitrospira Comammox genome from an enrichment reactor. See vanKessel et al., 2015: Complete nitrification by a single microorganism for further details.
In case you haven’t installed the mmgenome package, see the Load data example.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport
function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.
load("vanKessel.RData")
The object d
contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns CTAB
, KITpe
, KITmp
and KIT
contain the coverage information from 4 different samples; PC1
, PC2
and PC3
contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential
contain information taxonomic information for each scaffold based on classification on essential genes; rRNA16S
contain taxonomic information on scaffolds that have an associated 16S rRNA gene.
colnames(d$scaffolds)
## [1] "scaffold" "length" "gc" "CTAB" "KITpe"
## [6] "KITmp" "KIT" "PC1" "PC2" "PC3"
## [11] "essential" "rRNA16S"
The basic statistics of the full dataset can be summarised using the mmstats
function.
mmstats(d, ncov = 4)
## General Stats
## n.scaffolds 47584.00
## GC.mean 53.90
## N50 6256.00
## Length.total 178755569.00
## Length.max 1528908.00
## Length.mean 3756.60
## Coverage.CTAB 5.81
## Coverage.KITpe 4.55
## Coverage.KITmp 6.76
## Coverage.KIT 11.31
## Ess.total 3158.00
## Ess.unique 109.00
In general, the metagenome assembly is very nice and the two Nitrospira genomes can easily be identified. The first Nitrospira genome is located in the same coverage space as a genome from the Phylum Planctomycetes. However, that is removed in the subsequent steps.
p <- mmplot(data = d,
x = "CTAB",
y = "KITpe",
color = "essential",
minlength = 10000)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(CTAB = c(10.2, 11.3, 14.3, 16.7, 15.6, 14.1, 10.5),
KITpe = c(9.82, 12.6, 14.1, 11.5, 9.39, 6.93, 7.12))
mmplot_selection(p, sel) +
scale_x_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_y_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_discrete(name = "Taxonomy") +
xlab("Coverage (CTAB)") +
ylab("Coverage (Kit)") +
theme_classic()
The scaffolds included in the defined subspace are extracted using the mmextract
function. Note that all scaffolds in the defined subspace is extracted and not just the scaffolds over 10 kbp that was plotted.
dA <- mmextract(d, sel)
The mmstats
function applies to any extracted object. Hence, it can be used directly on the subset.
mmstats(dA)
## General Stats
## n.scaffolds 311.00
## GC.mean 50.70
## N50 152547.00
## Length.total 6975425.00
## Length.max 1073143.00
## Length.mean 22429.00
## Coverage.CTAB 13.69
## Coverage.KITpe 10.07
## Ess.total 152.00
## Ess.unique 106.00
The function mmplot_pairs
can visualize a number of different variables at the same time. In this case PC2
and PC3
seem like a good choice.
mmplot_pairs(data = dA, variables = c("CTAB", "KIT", "gc", "PC1", "PC2", "PC3"))
We use the same procedure as with the initial subset. However, now the dA
subset is used as input data.
p <- mmplot(data = dA,
x = "PC2",
y = "PC3",
color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(PC2 = c(-0.193, -0.179, -0.138, -0.113, -0.127, -0.176),
PC3 = c(-0.169, -0.142, -0.136, -0.17, -0.197, -0.199))
mmplot_selection(p, sel) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_manual(name = "Taxonomy", values = cols[c(9,10)]) +
theme_classic()
dB <- mmextract(dA, sel)
mmstats(dB)
## General Stats
## n.scaffolds 19
## GC.mean 55
## N50 659693
## Length.total 4400979
## Length.max 1073143
## Length.mean 231631
## Coverage.CTAB 13
## Coverage.KITpe 10
## Ess.total 104
## Ess.unique 103
Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.
The function mmplot_network
can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.
mmplot_network(data = dB, network = mp, nconnections = 5, color = "essential", scale.links = 0.5)
To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by mate-pair reads to our current subset dB
using mmextract_network
. Only scaffolds directly connected to the subset is extracted.
dC <- mmextract_network(subset = dB, original = d, network = mp, nconnections = 5, type = "direct")
… and then plot the new subset.
mmplot_network(data = dC, network = mp, nconnections = 5, color = "KIT", scale.links = 0.5)
We finally remove the low abundant Nitrospira scaffolds that were included using the mate-pair data.
p <- mmplot(data = dC,
x = "CTAB",
y = "KITpe")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(CTAB = c(8.99, 10.4, 36.9, 39.3, 39.4, 28.2, 13.3),
KITpe = c(7.35, 13, 26.7, 26.7, 25, 9.95, 5.37))
mmplot_selection(p, sel)
dD <- mmextract(dC, sel)
mmstats(dD)
## General Stats
## n.scaffolds 25
## GC.mean 55
## N50 659693
## Length.total 4413075
## Length.max 1073143
## Length.mean 176523
## Coverage.CTAB 13
## Coverage.KITpe 10
## Ess.total 104
## Ess.unique 103
Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport
.
mmexport(data=dD, assembly=assembly, file = "Nitrospira1.fa")
mmplot_network(data = dD,
network = mp,
nconnections = 5,
color = "essential",
scale.links = 0.5) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_manual(name = "Taxonomy", values = cols[c(9)])