Introduction

This report documents the binning of a Nitrospira genome from a pilot-scale MBR reactor used for wastewater treatment (MBR). See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.

Load the mmgenome package

The metagenome data is analysed using the mmgenome package.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport function. However, the preprocessed data can also be downloaded directly from figshare: Daims_MBR. Hence, here we import the prepocessed data from figshare instead.

load("Daims_MBR.RData")

Extract Nitrospira 2

The combination of the coverage of sample Gel1 and MBR1 are used to make an intial extraction that includes the Nitrospira genome.

p <- mmplot(data = d, 
            x = "Gel1", 
            y = "MBR1", 
            log.x = F, 
            log.y = F, 
            color = "essential", 
            minlength = 5000,
            factor.shape = "solid") +
  xlim(1,15) +
  ylim(1,15)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(Gel1  =  c(4.29, 6.05, 7.78, 8.41, 8.35, 6.17, 4.62, 4.02),
                  MBR1 = c(6.28, 7.83, 8.65, 7.63, 6.77, 5.3, 4.33, 5.02))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA, ncov = 7)
##                General Stats
## n.scaffolds          1493.00
## GC.mean                52.30
## N50                 21763.00
## Length.total     14971450.00
## Length.max         237891.00
## Length.mean         10027.80
## Coverage.MBR1           6.04
## Coverage.Gel1           5.51
## Coverage.Gel7           7.81
## Coverage.Gel            0.83
## Coverage.MBR            0.69
## Coverage.Foam           0.69
## Coverage.Foam2          3.13
## Ess.total             366.00
## Ess.unique            108.00

There is still a lot of contaminating scaffolds in the bin, hence we look if they can be seperated using another combination of datasets.

mmplot_pairs(data = dA, 
             variables = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2", "gc"), 
             log = c("MBR1", "MBR", "Gel", "Gel1", "Gel7", "Foam", "Foam2"), 
             minlength = 5000, 
             color = "essential")

Subset B

The Gel7 sample vs. gc is chosen. We could also have used other combinations.

p <- mmplot(data = dA, x = "gc", y = "Gel7", log.y = T, color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(gc  =  c(48.1, 47, 57.6, 63.6, 64.8, 59.2),
                  Gel7  =  c(5.25, 9.07, 11.1, 7.81, 5.38, 3.8))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dB <- mmextract(dA, sel)
mmstats(dB, ncov = 7)
##                General Stats
## n.scaffolds           456.00
## GC.mean                59.00
## N50                 16661.00
## Length.total      4070910.00
## Length.max          58196.00
## Length.mean          8927.40
## Coverage.MBR1           6.16
## Coverage.Gel1           5.78
## Coverage.Gel7           6.99
## Coverage.Gel            1.23
## Coverage.MBR            0.99
## Coverage.Foam           0.96
## Coverage.Foam2          4.16
## Ess.total              93.00
## Ess.unique             89.00

It starts to look better. However, there is still a som contamination.

mmplot_pairs(data = dB, 
             variables = c("Gel7", "Foam", "Foam2", "PC1", "PC2"),
             minlength = 5000, 
             color = "essential")

Subset C

The PC1 vs. Foam2 seem to remove most contamination.

p <- mmplot(data = dB, 
            x = "PC1", 
            y = "Foam2", 
            log.x = F, 
            log.y = F, 
            color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(PC1  =  c(-0.0743, 0.00365, 0.0324, 0.0594, 0.0642, 0.0132, -0.0613, -0.0709),
                  Foam2  =  c(3.47, 6.38, 5.99, 4.32, 1.41, 0.703, 0.739, 2.16))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function. Based on the subsequent network plot some scaffolds were found to be part of other genomes. Hence, they are excluded directly here.

dC <- mmextract(dB, sel, exclude = c("7732", "1244", "11244", "8210","22459","21777", "5331", "61245", "41747", "28230", "22976"))
mmstats(dC, ncov = 7)
##                General Stats
## n.scaffolds           327.00
## GC.mean                58.60
## N50                 17994.00
## Length.total      3484805.00
## Length.max          58196.00
## Length.mean         10656.90
## Coverage.MBR1           6.11
## Coverage.Gel1           5.87
## Coverage.Gel7           6.99
## Coverage.Gel            1.21
## Coverage.MBR            0.94
## Coverage.Foam           0.88
## Coverage.Foam2          3.77
## Ess.total              91.00
## Ess.unique             88.00

Using paried-end connections (Subset D)

Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.

The function mmplot_network can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.

mmplot_network(data = dC, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)

To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dC using mmextract_network. Only scaffolds directly connected to the subset is extracted.

dD <- mmextract_network(subset = dC, original = d, network = pe, nconnections = 1, type = "direct")

… and then plot the new subset. There was no additional repeats to be included.

mmplot_network(data = dD, network = pe, nconnections = 1, color = "essential", scale.links = 0.5)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dD, ncov = 7)
##                General Stats
## n.scaffolds           578.00
## GC.mean                58.40
## N50                 16312.00
## Length.total      4337194.00
## Length.max          58196.00
## Length.mean          7503.80
## Coverage.MBR1           5.77
## Coverage.Gel1           5.66
## Coverage.Gel7           6.74
## Coverage.Gel            1.12
## Coverage.MBR            0.88
## Coverage.Foam           0.81
## Coverage.Foam2          3.52
## Ess.total              96.00
## Ess.unique             93.00

Subset E

We do a final clean-up as we might have wrongly included some scaffolds. Some scaffolds from the other Nitrospira genome had been recruited to the bin using PE reads. Hence, these are excluded.

p <- mmplot(data = dD, 
            x = "Gel1", 
            y = "MBR1", 
            log.x = F, 
            log.y = F,
            color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(Gel1  =  c(-0.972, 2.22, 12.3, 45.2, 47.5, 35.5, 13.1, 9.4, 9.31, 4.52, 0.00282),
                  MBR1  =  c(0.427, 7.71, 16.4, 21.1, 20.2, 15, 10.7, 8.02, 5.95, 0.323, -0.229))

mmplot_selection(p, sel)

The scaffolds included in the defined subspace are extracted using the mmextract function.

dE <- mmextract(dD, sel)
mmstats(dE, ncov = 7)
##                General Stats
## n.scaffolds           571.00
## GC.mean                58.40
## N50                 15957.00
## Length.total      4251713.00
## Length.max          58196.00
## Length.mean          7446.10
## Coverage.MBR1           5.73
## Coverage.Gel1           5.53
## Coverage.Gel7           6.70
## Coverage.Gel            1.12
## Coverage.MBR            0.88
## Coverage.Foam           0.81
## Coverage.Foam2          3.52
## Ess.total              94.00
## Ess.unique             91.00

Export the scaffolds

Finally the binned scaffolds are exported.

mmexport(data = dE, assembly = assembly, file = "MBR_Nitrospira2_it1.fa")