Introduction

This report documents the initial genome extraction of Candidatus Accumulibacter aalborgensis in Albertsen et al., 2016: “Candidatus Propionivibrio aalborgensis”: a novel glycogen accumulating organism abundant in full-scale enhanced biological phosphorus removal plants.

Load the mmgenome package

In case you haven’t installed the mmgenome package, see the Load data example.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the loading of the data and can be imported using the mmimport function. However, the preprocessed data can also be downloaded directly from figshare: Holmes. Hence, here we import the prepocessed data from figshare instead.

load("Holmes.RData")

Data overview

The object d contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns H09.06, H11.05, H11.25, H12.13 and H12.09 contain the coverage information from 5 different samples; PC1, PC2 and PC3 contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential contain information taxonomic information for each scaffold based on classification on essential genes; rRNA contain taxonomic information on scaffolds that have an associated 16S rRNA gene; pps_xxx contain taxonomic information obtained using PhyloPythiaS+.

colnames(d$scaffolds)
##  [1] "scaffold"   "length"     "gc"         "H09.06"     "H11.05"    
##  [6] "H11.25"     "H12.13"     "H12.19"     "PC1"        "PC2"       
## [11] "PC3"        "essential"  "rRNA16S"    "pps_phylum" "pps_class" 
## [16] "pps_order"  "pps_family" "pps_genus"

The basic statistics of the full dataset can be summarised using the mmstats function.

mmstats(d, ncov = 5)
##                 General Stats
## n.scaffolds          30725.00
## GC.mean                 48.20
## N50                   2951.00
## Length.total      50450465.00
## Length.max          230584.00
## Length.mean           1642.00
## Coverage.H09.06          0.31
## Coverage.H11.05          1.92
## Coverage.H11.25        253.90
## Coverage.H12.13         48.17
## Coverage.H12.19         28.32
## Ess.total              794.00
## Ess.unique             108.00

Accumulibacter

The combination of the coverage of sample H11.25 and H11.05 provides the cleanest separation of the two genomes and are used for binning.

p <- mmplot(data = d, x = "H11.25", y = "H11.05", log.x = T, log.y = T, color = "essential", minlength = 3000)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(H11.25  =  c(413, 641, 1350, 1370, 569),
                  H11.05  =  c(1.03, 1.99, 1.57, 0.614, 0.506))

mmplot_selection(p, sel) +
  theme(axis.line.x = element_line(), 
        axis.line.y = element_line())

The scaffolds included in the defined subspace are extracted using the mmextract function.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA, ncov = 5)
##                 General Stats
## n.scaffolds            143.00
## GC.mean                 62.20
## N50                  71862.00
## Length.total       4649134.00
## Length.max          230584.00
## Length.mean          32511.40
## Coverage.H09.06          0.38
## Coverage.H11.05          1.06
## Coverage.H11.25        777.72
## Coverage.H12.13        160.98
## Coverage.H12.19        156.38
## Ess.total              107.00
## Ess.unique             105.00

Using paried-end connections

Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also be missed.

The function mmplot_network can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.

mmplot_network(data = dA, network = pe, color = "H11.25", nconnections = 10, log.color = T)

To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dB using mmextract_network. Only scaffolds directly connected to the subset is extracted.

dB <- mmextract_network(subset = dA, original = d, network = pe, nconnections = 10, type = "direct")

Subspace extraction 2

We finally remove the low abundant scaffolds that were included using the paired-end data.

p <- mmplot(data = dB, x = "H11.25", y = "H12.13", log.x = T, log.y = T, color = "essential")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(H11.25  =  c(420, 535, 1600, 17400, 15500, 3150, 415),
                  H12.13  =  c(137, 65.1, 98.8, 2010, 4000, 1580, 240))

mmplot_selection(p, sel) +
  theme(axis.line.x = element_line(), 
        axis.line.y = element_line())

Extract the scaffolds in the selection.

dC <- mmextract(dB, sel)

Look at the basic stats.

mmstats(dC, ncov = 5)
##                 General Stats
## n.scaffolds            189.00
## GC.mean                 62.20
## N50                  70641.00
## Length.total       4798751.00
## Length.max          230584.00
## Length.mean          25390.20
## Coverage.H09.06          0.41
## Coverage.H11.05          1.46
## Coverage.H11.25        839.39
## Coverage.H12.13        172.67
## Coverage.H12.19        164.06
## Ess.total              107.00
## Ess.unique             105.00

Export the scaffolds

Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport.

mmexport(data=dC, assembly=assembly, file = "Accumulibacter.fa")