This report documents the binning of a Nitrospira Commamox genome from a full-scale wastewater treatment plant (VetMed). See Daims et al., 2015: Complete Nitrification by Nitrospira Bacteria for further details.
The metagenome data is analysed using the mmgenome package.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport
function. However, the preprocessed data can also be downloaded directly from figshare: Daims_VetMed. Hence, here we import the prepocessed data from figshare instead.
load("Daims_VetMed.RData")
The combination of the coverage of sample VM23
and VMPS
are used to make an intial extraction that includes the Nitrospira genome.
p <- mmplot(data = d,
x = "VM23",
y = "VMPS",
color = "essential",
minlength = 5000,
factor.shape = "solid") +
xlim(0,20) +
ylim(0,5)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(VM23 = c(8.75, 10.1, 15.2, 18.5, 18.9, 16.3, 11),
VMPS = c(1.67, 2.44, 2.94, 2.74, 1.83, 1.07, 1.06))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dA <- mmextract(d, sel)
The mmstats
function applies to any extracted object. Hence, it can be used directly on the subset.
mmstats(dA, ncov = 2)
## General Stats
## n.scaffolds 5076.00
## GC.mean 62.10
## N50 4525.00
## Length.total 17448875.00
## Length.max 43902.00
## Length.mean 3437.50
## Coverage.VM23 12.22
## Coverage.VMPS 1.67
## Ess.total 370.00
## Ess.unique 107.00
There is still a lot of contaminating scaffolds in the bin. Hence, we look if they can be seperated using another combination of datasets.
mmplot_pairs(data = dA,
variables = c("VM23", "VMPS","gc", "PC1", "PC2", "PC3"),
log = c("VM23", "VMPS"),
minlength = 5000,
color = "essential")
There do not seem to be enough resolution to cleanly seperate the Nitrospira from the rest of the contamination. There might be something in using the coverage from the VMPS
sample and the gc
content. However, we can try to use ESOM to further seperate the scaffolds based on tetranucleotide frequency.
mmplot(data = dA,
x = "esomx",
y = "esomy",
log.x = F,
log.y = F,
color = "essential",
point.size = 4,
esom.map = esom_map,
esom.color = "grey10",
esom.cut = 0.15,
factor.shape = "solid")
The ESOM map of tetranucleotides of the whole dataset seem to have better resolution than the PCA. However, there is still not a clean seperation to the Planctomycetes.
As a final try we do a ESOM on the subset itself. First the scaffolds in the subset is exported and an ESOM is calculated based on this subset outside R using the databionic software. Note that these steps are disabled in the guide as the data is provided in the initial data("Daims_VetMed.RData")
file
mmexport(data = dA, assembly = assembly, file = "vetmet_dA.fa")
Then we import the ESOM data.
esom_names <- read.delim("data/esom_dAp.names", skip = 1, header = F)
esom_bm <- read.delim("data/esom_dAp.bm", skip = 2, header = F)
esom_umx <- read.delim("data/esom_dA.umx", skip = 1, header = F)
esom <- mmformat_esom(names = esom_names,
bm = esom_bm,
umx = esom_umx,
xname = "esomx_dA",
yname = "esomy_dA")
esom_map_dA <- esom$esom_map
esom_coord_dA <- esom$esom_coord
And finally add it to the excisting data frame.
dA <- mmadd(data = dA,
newdata = esom_coord_dA)
For completeness we also import PhyloPythiaS+ taxonomic classification of the scaffolds.
ppsp_raw <- read.delim(file = "data/ppsp.txt", header = T, skip = 2)
ppsp <- mmformat_ppsp(ppsp_raw,
ranks = "phylum",
ranknames = "ppsp_phylum")
And add it to the excising data frame.
dA <- mmadd(data = dA,
newdata = ppsp)
The ESOM looks better now and it seems like we can remove most of the Planctomycetes contamination.
mmplot(data = dA,
x = "esomx_dA",
y = "esomy_dA",
color = "essential",
point.size = 4,
esom.map = esom_map_dA,
esom.color = "grey10",
esom.cut = 0.15,
factor.shape = "solid")
Using the local ESOM map
and guided by the coverage in the VMPS
sampe it is possible to exclude the Planctomycetes
relatively easy.
p <- mmplot(data = dA,
x = "esomx_dA",
y = "esomy_dA",
color = "VMPS",
point.size = 3,
esom.map = esom_map_dA,
esom.color = "grey10",
esom.cut = 0.15)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(esomx_dA = c(42.7, 42.7, 36.7, 35.4, 32.9, 30.3, 23.6, 26.2, 30, 23.6, 35.4, 39.3, 41.6, 109, 111, 101, 89.1, 76.5, 74.7, 80.6, 87, 90.4, 106, 108),
esomy_dA = c(86.5, 77, 63.3, 55.8, 49.7, 45.8, 39.6, 29.2, 24.2, 17.9, 9.71, 0.487, -2.44, -2.29, 4.29, 15, 21.3, 29.9, 37.8, 45.3, 50.7, 64.9, 78, 87.5))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function.
dB <- mmextract(dA, sel)
From the stats it can be seen that we actually have at least 2 Nitrospira species in the bin. Even though it seems like they can be split using the ESOM map it is not possible to obtain a clean seperation (I’ve tried..).
mmstats(dB, ncov = 2)
## General Stats
## n.scaffolds 2008.0
## GC.mean 56.9
## N50 4202.0
## Length.total 6607137.0
## Length.max 30150.0
## Length.mean 3290.4
## Coverage.VM23 12.6
## Coverage.VMPS 1.8
## Ess.total 145.0
## Ess.unique 94.0
To be conservative, we clean the Nitrospira bin further using the complete ESOM map. Here, we select the scaffolds we want to remove.
p <- mmplot(data = dB,
x = "esomx",
y = "esomy",
color = "essential",
esom.map = esom_map,
esom.color = "grey10",
esom.cut = 0.15,
factor.shape = "solid")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(esomx = c(20.2, 45.1, 63.2, 83.8, 92.4, 110, 116, 191, 211, 210, 232, 421, 491, 491, 434, 440, 219, 133, 119, 104, 15.9, 20.2, 105, 46),
esomy = c(166, 188, 183, 180, 216, 217, 172, 188, 212, 234, 249, 248, 195, 29.2, 9.8, -4.05, -4.6, 38.1, 15.9, -7.93, -4.6, 22, 54.7, 103))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function and afterwards we exclude them from the dB
subset.
exclude <- mmextract(dB, sel)
dC <- mmsubset(dB, !(scaffold %in% exclude$scaffolds$scaffold))
mmstats(dC, ncov = 2)
## General Stats
## n.scaffolds 1926.0
## GC.mean 56.8
## N50 4276.0
## Length.total 6444050.0
## Length.max 30150.0
## Length.mean 3345.8
## Coverage.VM23 12.6
## Coverage.VMPS 1.8
## Ess.total 138.0
## Ess.unique 94.0
Until now we have used coverage and tetranucleotide profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage (or tetranucleotide frequency) profile than the rest of the genome and thereby also be missed.
To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dC
using mmextract_network
. Only scaffolds directly connected to the subset is extracted.
dD <- mmextract_network(subset = dC,
original = d,
network = pe,
nconnections = 1,
type = "direct")
mmstats(dD, ncov = 2)
## General Stats
## n.scaffolds 2779.0
## GC.mean 57.1
## N50 3731.0
## Length.total 8259931.0
## Length.max 30150.0
## Length.mean 2972.3
## Coverage.VM23 12.6
## Coverage.VMPS 1.8
## Ess.total 164.0
## Ess.unique 101.0
The 16S and 23S rRNA sequences from Nitrospira was not extracted using the above approach. Hence, we include them manually.
dE <- mmextract(data = dD,
original = d,
include = c("6813", "25466"))
Finally we take a look at what we extracted. It consists of atleast 2 Nitrospira species.
mmplot(data = dE,
x = "VM23",
y = "VMPS",
log.x = T,
log.y = T,
color = "essential",
factor.shape = "solid")
Finally the binned scaffolds are exported.
mmexport(data = dE,
assembly = assembly,
file = "VetMed_Nitrospira_it1.fa")