Load the mmgenome package

In case you haven’t installed the mmgenome package, see the Load data example.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.

#mmimport(file = "Load_data.Rmd")

However, in this example we load the attached example dataset rocco.

data(rocco)

Data overview

The object d contains information on scaffolds and essential genes within the scaffolds. The rocco dataset contains the following information on each scaffold.

colnames(d$scaffolds)
##  [1] "scaffold"   "length"     "gc"         "C13.11.14"  "C13.11.25" 
##  [6] "C13.12.03"  "C14.01.09"  "PC1"        "PC2"        "PC3"       
## [11] "essential"  "rRNA16S"    "pps_phylum" "pps_class"  "pps_order" 
## [16] "pps_family" "pps_genus"

The basic statistics of the full dataset can be summarised using the mmstats function.

mmstats(d, ncov = 4)
##                    General Stats
## n.scaffolds              97285.0
## GC.mean                     52.2
## N50                       5291.0
## Length.total         331908376.0
## Length.max             1446979.0
## Length.mean               3411.7
## Coverage.C13.11.14           2.6
## Coverage.C13.11.25          16.1
## Coverage.C13.12.03           2.9
## Coverage.C14.01.09          11.6
## Ess.total                 7138.0
## Ess.unique                 109.0

Initial scaffold extraction

The mmplot function is used to generate a differential coverage plot. The plot is stored in the object p and then plotted using p. mmplot_locator is used to interactively define a subspace to extract on the plot. For complete reproducibility the coordinates of the subspace is also stored manually (the coordinates are written to the console). mmplot_selection is used plot the subspace in the final markdown report. To use mmplot_locator, simply plot the data using p and then run mmplot_locator(p) to interactively define the subspace, press finish when done.

Here we choose an Actinobacteria that seems like an easy target.

p <- mmplot(data = d, x = "C13.11.25", y = "C14.01.09", log.x = T, log.y = T, color = "essential", minlength = 3000)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(C13.11.25  =  c(7.2, 16.2, 25.2, 23.3, 10.1),
                  C14.01.09  =  c(47, 77, 52.8, 29.5, 22.1))

mmplot_selection(p, sel)

plot of chunk zoomA

The scaffolds included in the defined subspace are extracted using the mmextract function. Note: all scaffolds in the subspace are extracted even though only scaffolds over 3000 bp were used for plotting in the example above. mmextract also contains a minlength parameter if needed.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA, ncov = 4)
##                    General Stats
## n.scaffolds                 61.0
## GC.mean                     71.0
## N50                     199460.0
## Length.total           4452901.0
## Length.max              454223.0
## Length.mean              72998.4
## Coverage.C13.11.14           0.6
## Coverage.C13.11.25          13.3
## Coverage.C13.12.03           6.9
## Coverage.C14.01.09          38.8
## Ess.total                  109.0
## Ess.unique                 105.0

To get a more clear look on what was in the initial subset we simply plot it using the mmplot function. Note that the dA object is used as input.

mmplot(data = dA, x = "C13.11.25", y = "C14.01.09", log.x = T,  log.y = T, color = "essential")

plot of chunk replotA

Finding the relevant data for next subset

Although it seems like we would be able to extract the Actinobacteria using the initial two coverage datasets, we’ll take a look at the additional data we have available.

The mmplot_pairs function can be used to plot a number of variables against each other, to get a quick overview of which might be used to separate the genomes of interest. The scaffolds are colored with the PhyloPythiaS+ taxonomic classification.

mmplot_pairs(data = dA,
             variables = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09", "gc", "PC2"), 
             log = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09"),
             color = "pps_phylum",
             textsize = 5
             )