This report documents the binning of a Nitrospira Comammox genome from an enrichment reactor. See vanKessel et al., 2015: Complete nitrification by a single microorganism for further details.
In case you haven’t installed the mmgenome package, see the Load data example.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport
function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.
load("vanKessel.RData")
The object d
contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns CTAB
, KITpe
, KITmp
and KIT
contain the coverage information from 4 different samples; PC1
, PC2
and PC3
contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential
contain information taxonomic information for each scaffold based on classification on essential genes; rRNA16S
contain taxonomic information on scaffolds that have an associated 16S rRNA gene.
colnames(d$scaffolds)
## [1] "scaffold" "length" "gc" "CTAB" "KITpe"
## [6] "KITmp" "KIT" "PC1" "PC2" "PC3"
## [11] "essential" "rRNA16S"
The basic statistics of the full dataset can be summarised using the mmstats
function.
mmstats(d, ncov = 4)
## General Stats
## n.scaffolds 47584.00
## GC.mean 53.90
## N50 6256.00
## Length.total 178755569.00
## Length.max 1528908.00
## Length.mean 3756.60
## Coverage.CTAB 5.81
## Coverage.KITpe 4.55
## Coverage.KITmp 6.76
## Coverage.KIT 11.31
## Ess.total 3158.00
## Ess.unique 109.00
In general, the metagenome assembly is very nice and the two Nitrospira genomes can easily be identified. The second Nitrospira genome is located in the same coverage space as other genomes. However, that is removed in the subsequent steps.
p <- mmplot(data = d,
x = "CTAB",
y = "KITpe",
color = "essential",
minlength = 10000)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(CTAB = c(3.87, 4.44, 6.6, 6.11, 4.44),
KITpe = c(4.61, 6.11, 5.85, 4.5, 3.59))
mmplot_selection(p, sel) +
scale_x_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_y_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_discrete(name = "Taxonomy") +
xlab("Coverage (CTAB)") +
ylab("Coverage (Kit - pe)") +
theme_classic()
dA <- mmextract(d, sel)
mmstats(dA)
## General Stats
## n.scaffolds 281.00
## GC.mean 55.90
## N50 81679.00
## Length.total 4724504.00
## Length.max 335390.00
## Length.mean 16813.20
## Coverage.CTAB 4.90
## Coverage.KITpe 4.95
## Ess.total 115.00
## Ess.unique 97.00
mmplot_pairs(data = dA, variables = c("CTAB", "KIT", "gc", "PC1", "PC2", "PC3"), color = "essential", minlength = 3000)
We use the same procedure as with the initial subset. However, now the dA
subset is used as input data.
p <- mmplot(data = dA,
x = "PC2",
y = "PC3",
color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(PC2 = c(-0.177, -0.126, -0.0433, -0.0307, -0.0672, -0.146),
PC3 = c(-0.172, -0.0991, -0.117, -0.17, -0.241, -0.236))
mmplot_selection(p, sel) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20,
labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_manual(name = "Taxonomy",
values = cols[c(4,9,10)]) +
theme_classic()
dB <- mmextract(dA, sel)
mmstats(dB)
## General Stats
## n.scaffolds 70.00
## GC.mean 56.30
## N50 104130.00
## Length.total 3965939.00
## Length.max 335390.00
## Length.mean 56656.30
## Coverage.CTAB 4.91
## Coverage.KITpe 5.00
## Ess.total 97.00
## Ess.unique 96.00
Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.
The function mmplot_network
can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.
mmplot_network(data = dB, network = mp, nconnections = 1, color = "essential", print.nolinks = T)
## [1] "The following scaffolds have no links to other scaffolds:"
## [1] "15418" "28883" "3475" "5759" "6144"
To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by mate-pair reads to our current subset dB
using mmextract_network
. Only scaffolds directly connected to the subset is extracted.
dC <- mmextract_network(subset = dB, original = d, network = mp, nconnections = 5, type = "direct")
… and then plot the new subset. To identify potential wrongly included scaffolds, the original scaffolds are labelled in the network. Scaffolds that exclusively link to new scaffolds are potential from other genomes.
mmplot_network(data = dC, network = mp, nconnections = 5, color = "KIT", scale.links = 0.5, highlight = dA)
The contaminating scaffolds are removed and the network extraction is repeated.
dC <- mmextract(dA, sel, exclude = c("15418","28883", "6144", "3475"))
dD <- mmextract_network(subset = dC, original = d, network = mp, nconnections = 5, type = "direct")
mmplot_network(data = dD, network = mp, nconnections = 5, color = "KIT", scale.links = 0.5, highlight = dA)
mmstats(dD)
## General Stats
## n.scaffolds 95.00
## GC.mean 56.10
## N50 113613.00
## Length.total 4486248.00
## Length.max 381150.00
## Length.mean 47223.70
## Coverage.CTAB 5.60
## Coverage.KITpe 5.28
## Ess.total 143.00
## Ess.unique 101.00
We do a finally cleanup to be conservative.
p <- mmplot(data = dD, x = "CTAB", y = "KITpe", log.x = F, log.y = F, color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(CTAB = c(2.52, 3.09, 5.22, 6.62, 6.4, 3.51),
KITpe = c(3.61, 6.18, 7.77, 6.65, 3.32, 2.43))
mmplot_selection(p, sel)
dE <- mmextract(dD, sel)
mmstats(dE)
## General Stats
## n.scaffolds 86.00
## GC.mean 56.30
## N50 103850.00
## Length.total 4088547.00
## Length.max 335390.00
## Length.mean 47541.20
## Coverage.CTAB 4.89
## Coverage.KITpe 4.99
## Ess.total 103.00
## Ess.unique 101.00
Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport
.
mmexport(data=dE, assembly=assembly, file = "Nitrospira2.fa")
mmplot_network(data = dE,
network = mp,
nconnections = 5,
color = "essential",
scale.links = 0.5) +
scale_size_area(breaks = c(10000, 50000, 100000, 500000, 1000000),
max_size = 20, labels = c(10, 50, 100, 500, 1000),
name = "Scaffold Length (Kbp)") +
scale_color_manual(name = "Taxonomy", values = cols[c(9)])