This report documents the binning of a Nitrospira Comammox genome from a pilot-scale MBR reactor used for wastewater treatment (MBR). See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.
The metagenome data is analysed using the mmgenome package.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport
function. However, the preprocessed data can also be downloaded directly from figshare: Daims_MBR. Hence, here we import the prepocessed data from figshare instead.
load("Daims_MBR.RData")
The combination of the coverage of sample Gel1
and MBR
are used to make an intial extraction that includes the Nitrospira genome.
p <- mmplot(data = d,
x = "Gel1",
y = "MBR1",
log.x = F,
log.y = F,
color = "essential",
minlength = 5000,
factor.shape = "solid") +
xlim(5,25) +
ylim(1,15)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(Gel1 = c(8.64, 12.6, 16, 18, 15.1, 10.6, 8.27),
MBR1 = c(5.7, 8.63, 9.72, 8.93, 6.24, 4.67, 4.48))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dA <- mmextract(d, sel)
The mmstats
function applies to any extracted object. Hence, it can be used directly on the subset.
mmstats(dA, ncov = 7)
## General Stats
## n.scaffolds 1212.00
## GC.mean 59.80
## N50 8721.00
## Length.total 6571860.00
## Length.max 54219.00
## Length.mean 5422.30
## Coverage.MBR1 6.44
## Coverage.Gel1 12.24
## Coverage.Gel7 18.30
## Coverage.Gel 0.73
## Coverage.MBR 0.62
## Coverage.Foam 0.80
## Coverage.Foam2 3.61
## Ess.total 149.00
## Ess.unique 90.00
There is still a lot of contaminating scaffolds in the bin, hence we look if they can be seperated using another combination of datasets.
mmplot_pairs(data = dA,
variables = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2", "gc"),
log = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2"),
minlength = 5000,
color = "essential")
The Foam2
and Gel7
samples are chosen. We could also have used other combinations.
p <- mmplot(data = dA, x = "Gel7", y = "Foam2", log.x = T, log.y = T, color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(Gel7 = c(3.82, 3.82, 8.75, 10.6, 10.6, 6.41),
Foam2 = c(2.12, 4.22, 5.7, 3.86, 1.96, 1.47))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dB <- mmextract(dA, sel)
mmstats(dB, ncov = 7)
## General Stats
## n.scaffolds 481.00
## GC.mean 56.00
## N50 10706.00
## Length.total 3285788.00
## Length.max 41684.00
## Length.mean 6831.20
## Coverage.MBR1 6.16
## Coverage.Gel1 11.70
## Coverage.Gel7 7.41
## Coverage.Gel 0.79
## Coverage.MBR 0.80
## Coverage.Foam 0.70
## Coverage.Foam2 3.05
## Ess.total 94.00
## Ess.unique 78.00
It starts to look better. However, there is still a large amount of contamination.
mmplot_pairs(data = dB,
variables = c("MBR1", "Gel1", "Gel7", "gc", "PC1", "PC2", "PC3"),
log = c("MBR1", "Gel1", "Gel7"),
color = "essential")
The first two PCs seem to remove most contamination.
p <- mmplot(data = dB,
x = "PC1",
y = "PC2",
log.x = F,
log.y = F,
color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(PC1 = c(-0.0906, -0.0675, 0.00465, 0.0339, 0.0197, -0.0269, -0.0689),
PC2 = c(-0.192, -0.11, -0.113, -0.147, -0.266, -0.34, -0.296))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dC <- mmextract(dB, sel)
mmstats(dC, ncov = 7)
## General Stats
## n.scaffolds 313.00
## GC.mean 54.40
## N50 11753.00
## Length.total 2389054.00
## Length.max 41684.00
## Length.mean 7632.80
## Coverage.MBR1 6.52
## Coverage.Gel1 11.96
## Coverage.Gel7 6.64
## Coverage.Gel 0.74
## Coverage.MBR 0.89
## Coverage.Foam 0.69
## Coverage.Foam2 2.99
## Ess.total 81.00
## Ess.unique 78.00
Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.
The function mmplot_network
can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.
mmplot_network(data = dC, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)
To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dC
using mmextract_network
. Only scaffolds directly connected to the subset is extracted.
dD <- mmextract_network(subset = dC, original = d, network = pe, nconnections = 1, type = "direct")
… and then plot the new subset. There was no additional repeats to be included.
mmplot_network(data = dD, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)
mmstats(dD, ncov = 7)
## General Stats
## n.scaffolds 677.00
## GC.mean 54.20
## N50 9355.00
## Length.total 3622075.00
## Length.max 41684.00
## Length.mean 5350.20
## Coverage.MBR1 10.23
## Coverage.Gel1 20.20
## Coverage.Gel7 20.74
## Coverage.Gel 3.01
## Coverage.MBR 1.86
## Coverage.Foam 1.54
## Coverage.Foam2 16.58
## Ess.total 91.00
## Ess.unique 85.00
We do a final clean-up as we might have wrongly included some scaffolds.
p <- mmplot(data = dD,
x = "Gel1",
y = "MBR1",
log.x = F,
log.y = F,
color = "essential") +
xlim(1,25) +
ylim(1,15)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(Gel1 = c(2.18, 6.32, 12.9, 16, 18.2, 16.1, 7.59, 2.98),
MBR1 = c(2.73, 5.67, 9.84, 9.95, 8.65, 6.13, 1.28, 1.41))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dE <- mmextract(dD, sel)
mmstats(dE, ncov = 7)
## General Stats
## n.scaffolds 632.00
## GC.mean 54.10
## N50 9533.00
## Length.total 3483358.00
## Length.max 41684.00
## Length.mean 5511.60
## Coverage.MBR1 5.72
## Coverage.Gel1 10.58
## Coverage.Gel7 5.86
## Coverage.Gel 0.64
## Coverage.MBR 0.77
## Coverage.Foam 0.60
## Coverage.Foam2 2.57
## Ess.total 90.00
## Ess.unique 85.00
Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport
. The genome were then re-assembled using the reads in the genome bin using the SPAdes assembler.
mmexport(data = dE, assembly = assembly, file = "MBR_Nitrospira_it1.fa")