This report documents the binning of a Alphaproteobacterial genome from an enrichment reactor (ENR4).See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.
In case you haven’t installed the mmgenome package, see the Load data example.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport
function. However, the preprocessed data can also be downloaded directly from figshare: Daims_ENR4. Hence, here we import the prepocessed data from figshare instead.
load("Daims_ENR4.RData")
The object d
contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns ENR4A
, ENR4E
, ENR4F
and ENR6
contain the coverage information from 4 different samples; PC1
, PC2
and PC3
contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential
contain information taxonomic information for each scaffold based on classification on essential genes; rRNA
contain taxonomic information on scaffolds that have an associated 16S rRNA gene.
colnames(d$scaffolds)
## [1] "scaffold" "length" "gc" "ENR4A" "ENR4E"
## [6] "ENR4F" "ENR6" "PC1" "PC2" "PC3"
## [11] "essential" "rRNA16S"
The basic statistics of the full dataset can be summarised using the mmstats
function.
mmstats(d, ncov = 4)
## General Stats
## n.scaffolds 161.00
## GC.mean 65.60
## N50 198700.00
## Length.total 12846008.00
## Length.max 1306334.00
## Length.mean 79788.90
## Coverage.ENR4A 157.66
## Coverage.ENR4E 139.64
## Coverage.ENR4F 255.62
## Coverage.ENR6 150.12
## Ess.total 419.00
## Ess.unique 107.00
The combination of the coverage of sample ENR4A
and ENR4E
provides the cleanest separation of the genomes and are used for binning. A subspace is defined that clearly seperates the Alphaproteobacteria from the three other species.
p <- mmplot(data = d, x = "ENR4A", y = "ENR4E", log.x = T, log.y = T, color = "essential")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(ENR4A = c(4.54, 7.03, 13, 14, 8.58, 5.06),
ENR4E = c(9.97, 19.8, 21.8, 12.1, 7.55, 8.2))
mmplot_selection(p, sel)
dA <- mmextract(d, sel)
mmstats(dA, ncov = 3)
## General Stats
## n.scaffolds 33.00
## GC.mean 67.60
## N50 186013.00
## Length.total 3878078.00
## Length.max 458643.00
## Length.mean 117517.50
## Coverage.ENR4A 7.06
## Coverage.ENR4E 10.82
## Coverage.ENR4F 14.75
## Ess.total 106.00
## Ess.unique 105.00
Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also been missed.
The function mmplot_network
can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.
mmplot_network(data = dA, network = pe, nconnections = 5, color = "essential", scale.links = 0.5)
All scaffolds seem to be nicely connected. However, it seems like a few repeats will prevent much further improvement to the assembly. To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dB using mmextract_network. Only scaffolds directly connected to the subset is extracted.
dB <- mmextract_network(subset = dA, original = d, network = pe, nconnections = 5, type = "direct")
… and then plot the new subset. It seems like all scaffolds were already included in the initial subset.
mmplot_network(data = dB, network = pe, nconnections = 5, color = "ENR4A", scale.links = 0.5)
… this can also be seen by looking at the stats.
mmstats(dB, ncov = 4)
## General Stats
## n.scaffolds 33.00
## GC.mean 67.60
## N50 186013.00
## Length.total 3878078.00
## Length.max 458643.00
## Length.mean 117517.50
## Coverage.ENR4A 7.06
## Coverage.ENR4E 10.82
## Coverage.ENR4F 14.75
## Coverage.ENR6 0.00
## Ess.total 106.00
## Ess.unique 105.00
Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport
. The genome were then re-assembled using the reads in the genome bin using the SPAdes assembler.
mmexport(data=dB, assembly=assembly, file = "Alphaproteobacteria.fa")