Introduction

This report documents the binning of a Betaproteobacteria genome from an enrichment reactor (ENR4). See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.

Load the mmgenome package

In case you haven’t installed the mmgenome package, see the Load data example.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport function. However, the preprocessed data can also be downloaded directly from figshare: Daims_ENR4. Hence, here we import the prepocessed data from figshare instead.

load("Daims_ENR4.RData")

Data overview

The object d contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns ENR4A, ENR4E, ENR4F and ENR6 contain the coverage information from 4 different samples; PC1, PC2 and PC3 contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential contain information taxonomic information for each scaffold based on classification on essential genes; rRNA contain taxonomic information on scaffolds that have an associated 16S rRNA gene.

colnames(d$scaffolds)
##  [1] "scaffold"  "length"    "gc"        "ENR4A"     "ENR4E"    
##  [6] "ENR4F"     "ENR6"      "PC1"       "PC2"       "PC3"      
## [11] "essential" "rRNA16S"

The basic statistics of the full dataset can be summarised using the mmstats function.

mmstats(d, ncov = 4)
##                General Stats
## n.scaffolds           161.00
## GC.mean                65.60
## N50                198700.00
## Length.total     12846008.00
## Length.max        1306334.00
## Length.mean         79788.90
## Coverage.ENR4A        157.66
## Coverage.ENR4E        139.64
## Coverage.ENR4F        255.62
## Coverage.ENR6         150.12
## Ess.total             419.00
## Ess.unique            107.00

Extract the Betaproteobacteria

The combination of the coverage of sample ENR4A and ENR4E provides the cleanest separation of the genomes and are used for binning. A subspace is defined that clearly seperates the Betaproteobacteria from the three other species.

p <- mmplot(data = d, x = "ENR4A", y = "ENR4E", log.x = T, log.y = T, color = "essential") 

#p
#sel <- mmplot_locator(p)

sel <- data.frame(ENR4A  =  c(56.8, 160, 281, 302, 122, 57.9),
                  ENR4E  =  c(131, 332, 385, 316, 83.1, 96.2))

mmplot_selection(p, sel)

dA <- mmextract(d, sel)
mmstats(dA, ncov = 3)
##                General Stats
## n.scaffolds           105.00
## GC.mean                66.70
## N50                 63393.00
## Length.total      3496184.00
## Length.max         206674.00
## Length.mean         33297.00
## Coverage.ENR4A        105.33
## Coverage.ENR4E        149.93
## Coverage.ENR4F        283.36
## Ess.total             106.00
## Ess.unique            105.00

Using paried-end connections

Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also been missed.

The function mmplot_network can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.

mmplot_network(data = dA, network = pe, nconnections = 5, color = "ENR4A", scale.links = 0.5)

Most scaffolds seem to be connected and we might be able to improve the assembly substantial in the refinement stage. However, there are a high number of repeats that do not seem solveable using just the PE information.

To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dB using mmextract_network. Only scaffolds directly connected to the subset is extracted.

dB <- mmextract_network(subset = dA, original = d, network = pe, nconnections = 5, type = "direct")

… and then plot the new subset. It seems like we included a high coverage scaffold that was missed initially.

mmplot_network(data = dB, network = pe, nconnections = 5, color = "ENR4A", scale.links = 0.5)

mmstats(dB, ncov = 4)
##                General Stats
## n.scaffolds           107.00
## GC.mean                66.70
## N50                 69704.00
## Length.total      3567859.00
## Length.max         206674.00
## Length.mean         33344.50
## Coverage.ENR4A        103.97
## Coverage.ENR4E        147.89
## Coverage.ENR4F        279.60
## Coverage.ENR6         220.04
## Ess.total             114.00
## Ess.unique            105.00

Final subset

Given the complexity of the network plot, the extracted scaffolds are re-plotted in coverage space to see what was extracted. It seems like there by chance was a PE connection to a scaffold from the Actinobacteria, which is removed by defining a subspace.

p <- mmplot(data = dB, x = "ENR4A", y = "ENR4E", log.x = T, log.y = T, color = "essential") 

#p
#sel <- mmplot_locator(p)

sel <- data.frame(ENR4A  =  c(63, 93.9, 762, 924, 851, 106, 66.5),
                  ENR4E  =  c(109, 355, 1190, 1180, 685, 79.3, 85.6))

mmplot_selection(p, sel)

dC <- mmextract(dB, sel)
mmstats(dC, ncov = 3)
##                General Stats
## n.scaffolds           106.00
## GC.mean                66.70
## N50                 63393.00
## Length.total      3497219.00
## Length.max         206674.00
## Length.mean         32992.60
## Coverage.ENR4A        105.54
## Coverage.ENR4E        150.21
## Coverage.ENR4F        283.93
## Ess.total             106.00
## Ess.unique            105.00

Export the scaffolds

Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport. The genome were then re-assembled using the reads in the genome bin using the SPAdes assembler.

mmexport(data=dC, assembly=assembly, file = "Betaproteobacteria.fa")