Introduction

This report documents the binning of a Nitrospira Comammox genome from a pilot-scale MBR reactor used for wastewater treatment (MBR). See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.

Load the mmgenome package

The metagenome data is analysed using the mmgenome package.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport function. However, the preprocessed data can also be downloaded directly from figshare: Daims_MBR. Hence, here we import the prepocessed data from figshare instead.

load("Daims_MBR.RData")

Extract Nitrospira 1

The combination of the coverage of sample Gel1 and MBR are used to make an intial extraction that includes the Nitrospira genome.

p <- mmplot(data = d, 
            x = "Gel1", 
            y = "MBR1", 
            log.x = F, 
            log.y = F, 
            color = "essential", 
            minlength = 5000,
            factor.shape = "solid") +
  xlim(5,25) +
  ylim(1,15)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(Gel1  =  c(8.64, 12.6, 16, 18, 15.1, 10.6, 8.27),
                  MBR1  =  c(5.7, 8.63, 9.72, 8.93, 6.24, 4.67, 4.48))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA, ncov = 7)
##                General Stats
## n.scaffolds          1212.00
## GC.mean                59.80
## N50                  8721.00
## Length.total      6571860.00
## Length.max          54219.00
## Length.mean          5422.30
## Coverage.MBR1           6.44
## Coverage.Gel1          12.24
## Coverage.Gel7          18.30
## Coverage.Gel            0.73
## Coverage.MBR            0.62
## Coverage.Foam           0.80
## Coverage.Foam2          3.61
## Ess.total             149.00
## Ess.unique             90.00

There is still a lot of contaminating scaffolds in the bin, hence we look if they can be seperated using another combination of datasets.

mmplot_pairs(data = dA, 
             variables = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2", "gc"), 
             log = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2"), 
             minlength = 5000, 
             color = "essential")

Subset B

The Foam2 and Gel7 samples are chosen. We could also have used other combinations.

p <- mmplot(data = dA, x = "Gel7", y = "Foam2", log.x = T, log.y = T, color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(Gel7  =  c(3.82, 3.82, 8.75, 10.6, 10.6, 6.41),
                  Foam2  =  c(2.12, 4.22, 5.7, 3.86, 1.96, 1.47))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dB <- mmextract(dA, sel)
mmstats(dB, ncov = 7)
##                General Stats
## n.scaffolds           481.00
## GC.mean                56.00
## N50                 10706.00
## Length.total      3285788.00
## Length.max          41684.00
## Length.mean          6831.20
## Coverage.MBR1           6.16
## Coverage.Gel1          11.70
## Coverage.Gel7           7.41
## Coverage.Gel            0.79
## Coverage.MBR            0.80
## Coverage.Foam           0.70
## Coverage.Foam2          3.05
## Ess.total              94.00
## Ess.unique             78.00

It starts to look better. However, there is still a large amount of contamination.

mmplot_pairs(data = dB, 
             variables = c("MBR1", "Gel1", "Gel7", "gc", "PC1", "PC2", "PC3"), 
             log = c("MBR1", "Gel1", "Gel7"), 
             color = "essential")

Subset C

The first two PCs seem to remove most contamination.

p <- mmplot(data = dB, 
            x = "PC1", 
            y = "PC2", 
            log.x = F, 
            log.y = F, 
            color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(PC1  =  c(-0.0906, -0.0675, 0.00465, 0.0339, 0.0197, -0.0269, -0.0689),
                  PC2  =  c(-0.192, -0.11, -0.113, -0.147, -0.266, -0.34, -0.296))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dC <- mmextract(dB, sel)
mmstats(dC, ncov = 7)
##                General Stats
## n.scaffolds           313.00
## GC.mean                54.40
## N50                 11753.00
## Length.total      2389054.00
## Length.max          41684.00
## Length.mean          7632.80
## Coverage.MBR1           6.52
## Coverage.Gel1          11.96
## Coverage.Gel7           6.64
## Coverage.Gel            0.74
## Coverage.MBR            0.89
## Coverage.Foam           0.69
## Coverage.Foam2          2.99
## Ess.total              81.00
## Ess.unique             78.00

Using paried-end connections (Subset D)

Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.

The function mmplot_network can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.

mmplot_network(data = dC, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)

To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dC using mmextract_network. Only scaffolds directly connected to the subset is extracted.

dD <- mmextract_network(subset = dC, original = d, network = pe, nconnections = 1, type = "direct")

… and then plot the new subset. There was no additional repeats to be included.

mmplot_network(data = dD, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)

mmstats(dD, ncov = 7)
##                General Stats
## n.scaffolds           677.00
## GC.mean                54.20
## N50                  9355.00
## Length.total      3622075.00
## Length.max          41684.00
## Length.mean          5350.20
## Coverage.MBR1          10.23
## Coverage.Gel1          20.20
## Coverage.Gel7          20.74
## Coverage.Gel            3.01
## Coverage.MBR            1.86
## Coverage.Foam           1.54
## Coverage.Foam2         16.58
## Ess.total              91.00
## Ess.unique             85.00

Subset E

We do a final clean-up as we might have wrongly included some scaffolds.

p <- mmplot(data = dD, 
            x = "Gel1", 
            y = "MBR1", 
            log.x = F, 
            log.y = F, 
            color = "essential") +
  xlim(1,25) +
  ylim(1,15)
  

#p
#sel <- mmplot_locator(p)

sel <- data.frame(Gel1  =  c(2.18, 6.32, 12.9, 16, 18.2, 16.1, 7.59, 2.98),
                  MBR1  =  c(2.73, 5.67, 9.84, 9.95, 8.65, 6.13, 1.28, 1.41))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dE <- mmextract(dD, sel)
mmstats(dE, ncov = 7)
##                General Stats
## n.scaffolds           632.00
## GC.mean                54.10
## N50                  9533.00
## Length.total      3483358.00
## Length.max          41684.00
## Length.mean          5511.60
## Coverage.MBR1           5.72
## Coverage.Gel1          10.58
## Coverage.Gel7           5.86
## Coverage.Gel            0.64
## Coverage.MBR            0.77
## Coverage.Foam           0.60
## Coverage.Foam2          2.57
## Ess.total              90.00
## Ess.unique             85.00

Export the scaffolds

Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport. The genome were then re-assembled using the reads in the genome bin using the SPAdes assembler.

mmexport(data = dE, assembly = assembly, file = "MBR_Nitrospira_it1.fa")