Introduction

This report documents the binning of a Nitrospira Comammox genome from an enrichment reactor. See vanKessel et al., 2015: Complete nitrification by a single microorganism for further details.

Load the mmgenome package

In case you haven’t installed the mmgenome package, see the Load data example.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.

load("vanKessel.RData")

Data overview

The object d contains information on scaffolds and essential genes within the scaffolds. For each scaffold the dataset contains the following information: The columns CTAB, KITpe, KITmp and KIT contain the coverage information from 4 different samples; PC1, PC2 and PC3 contain coordinates of the three first principal components from a PCA analysis on tetranucleotide frequencies; essential contain information taxonomic information for each scaffold based on classification on essential genes; rRNA16S contain taxonomic information on scaffolds that have an associated 16S rRNA gene.

colnames(d$scaffolds)
##  [1] "scaffold"  "length"    "gc"        "CTAB"      "KITpe"    
##  [6] "KITmp"     "KIT"       "PC1"       "PC2"       "PC3"      
## [11] "essential" "rRNA16S"

The basic statistics of the full dataset can be summarised using the mmstats function.

mmstats(d, ncov = 4)
##                General Stats
## n.scaffolds         47584.00
## GC.mean                53.90
## N50                  6256.00
## Length.total    178755569.00
## Length.max        1528908.00
## Length.mean          3756.60
## Coverage.CTAB           5.81
## Coverage.KITpe          4.55
## Coverage.KITmp          6.76
## Coverage.KIT           11.31
## Ess.total            3158.00
## Ess.unique            109.00

Nitrospira 1

Initial subspace extraction (Figure ED2a)

In general, the metagenome assembly is very nice and the two Nitrospira genomes can easily be identified. The first Nitrospira genome is located in the same coverage space as a genome from the Phylum Planctomycetes. However, that is removed in the subsequent steps.

p <- mmplot(data = d, 
            x = "CTAB", 
            y = "KITpe", 
            color = "essential", 
            minlength = 10000) 

#p
#sel <- mmplot_locator(p)

sel <- data.frame(CTAB  =  c(10.2, 11.3, 14.3, 16.7, 15.6, 14.1, 10.5),
                  KITpe  =  c(9.82, 12.6, 14.1, 11.5, 9.39, 6.93, 7.12))

mmplot_selection(p, sel)  +
  scale_x_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
  scale_y_log10(limits = c(1,50), breaks = c(1, 2, 5, 10, 25, 50)) +
  scale_size_area(breaks = c(10000, 50000,  100000, 500000, 1000000), 
                  max_size = 20, labels = c(10, 50, 100, 500, 1000), 
                  name = "Scaffold Length (Kbp)") +
  scale_color_discrete(name = "Taxonomy") +
  xlab("Coverage (CTAB)") +
  ylab("Coverage (Kit)") +
  theme_classic()

The scaffolds included in the defined subspace are extracted using the mmextract function. Note that all scaffolds in the defined subspace is extracted and not just the scaffolds over 10 kbp that was plotted.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA)
##                General Stats
## n.scaffolds           311.00
## GC.mean                50.70
## N50                152547.00
## Length.total      6975425.00
## Length.max        1073143.00
## Length.mean         22429.00
## Coverage.CTAB          13.69
## Coverage.KITpe         10.07
## Ess.total             152.00
## Ess.unique            106.00

Identify the next relevant variables for subsetting

The function mmplot_pairs can visualize a number of different variables at the same time. In this case PC2 and PC3 seem like a good choice.

mmplot_pairs(data = dA, variables = c("CTAB", "KIT", "gc", "PC1", "PC2", "PC3"))