Example: Genome Extraction

Load the mmgenome package

In case you haven’t installed the mmgenome package, see the Load data example.

library("mmgenome")

Import data

The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.

#mmimport(file = "Load_data.Rmd")

However, in this example we load the attached example dataset rocco.

data(rocco)

Data overview

The object d contains information on scaffolds and essential genes within the scaffolds. The rocco dataset contains the following information on each scaffold.

colnames(d$scaffolds)

##  [1] "scaffold"   "length"     "gc"         "C13.11.14"  "C13.11.25" 
##  [6] "C13.12.03"  "C14.01.09"  "PC1"        "PC2"        "PC3"       
## [11] "essential"  "rRNA16S"    "pps_phylum" "pps_class"  "pps_order" 
## [16] "pps_family" "pps_genus"

The basic statistics of the full dataset can be summarised using the mmstats function.

mmstats(d, ncov = 4)

##                    General Stats
## n.scaffolds              97285.0
## GC.mean                     52.2
## N50                       5291.0
## Length.total         331908376.0
## Length.max             1446979.0
## Length.mean               3411.7
## Coverage.C13.11.14           2.6
## Coverage.C13.11.25          16.1
## Coverage.C13.12.03           2.9
## Coverage.C14.01.09          11.6
## Ess.total                 7138.0
## Ess.unique                 109.0

Initial scaffold extraction

The mmplot function is used to generate a differential coverage plot. The plot is stored in the object p and then plotted using p. mmplot_locator is used to interactively define a subspace to extract on the plot. For complete reproducibility the coordinates of the subspace is also stored manually (the coordinates are written to the console). mmplot_selection is used plot the subspace in the final markdown report. To use mmplot_locator, simply plot the data using p and then run mmplot_locator(p) to interactively define the subspace, press finish when done.

Here we choose an Actinobacteria that seems like an easy target.

p <- mmplot(data = d, x = "C13.11.25", y = "C14.01.09", log.x = T, log.y = T, color = "essential", minlength = 3000)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(C13.11.25  =  c(7.2, 16.2, 25.2, 23.3, 10.1),
                  C14.01.09  =  c(47, 77, 52.8, 29.5, 22.1))

mmplot_selection(p, sel)

plot of chunk zoomA

The scaffolds included in the defined subspace are extracted using the mmextract function. Note: all scaffolds in the subspace are extracted even though only scaffolds over 3000 bp were used for plotting in the example above. mmextract also contains a minlength parameter if needed.

dA <- mmextract(d, sel)

The mmstats function applies to any extracted object. Hence, it can be used directly on the subset.

mmstats(dA, ncov = 4)

##                    General Stats
## n.scaffolds                 61.0
## GC.mean                     71.0
## N50                     199460.0
## Length.total           4452901.0
## Length.max              454223.0
## Length.mean              72998.4
## Coverage.C13.11.14           0.6
## Coverage.C13.11.25          13.3
## Coverage.C13.12.03           6.9
## Coverage.C14.01.09          38.8
## Ess.total                  109.0
## Ess.unique                 105.0

To get a more clear look on what was in the initial subset we simply plot it using the mmplot function. Note that the dA object is used as input.

mmplot(data = dA, x = "C13.11.25", y = "C14.01.09", log.x = T,  log.y = T, color = "essential")

plot of chunk replotA

Finding the relevant data for next subset

Although it seems like we would be able to extract the Actinobacteria using the initial two coverage datasets, we’ll take a look at the additional data we have available.

The mmplot_pairs function can be used to plot a number of variables against each other, to get a quick overview of which might be used to separate the genomes of interest. The scaffolds are colored with the PhyloPythiaS+ taxonomic classification.

mmplot_pairs(data = dA,
             variables = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09", "gc", "PC2"), 
             log = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09"),
             color = "pps_phylum",
             textsize = 5
             )

plot of chunk pairsA

Using the C13.11.14 and C13.12.03 coverage profiles might enable removal of the leftover contaminants. Hence, they are used for a new subspace extraction.

Subspace extraction 2

We use the same procedure as with the initial subset. However, now the dA subset is used as input data.

p <- mmplot(data = dA, x = "C13.12.03", y = "C13.11.14", log.x = T, log.y = T, color = "pps_phylum")

#p
#sel <- mmplot_locator(p)

sel <- data.frame(C13.12.03  =  c(3.48, 4.95, 6.97, 13.6, 15.7, 9.68, 4.48),
                  C13.11.14  =  c(0.407, 1.72, 2.92, 1.45, 0.264, 0.17, 0.163))

mmplot_selection(p, sel)

plot of chunk zoomB

The scaffolds in the subspace are extracted and stored in the object dB. Note that dA is now used as the input dataset to mmextract.

dB <- mmextract(dA, sel)

… and mmstats is used to check the stats of extracted bin dB.

mmstats(dB, ncov = 4)

##                    General Stats
## n.scaffolds                 45.0
## GC.mean                     71.1
## N50                     199460.0
## Length.total           4384536.0
## Length.max              454223.0
## Length.mean              97434.1
## Coverage.C13.11.14           0.5
## Coverage.C13.11.25          13.2
## Coverage.C13.12.03           6.9
## Coverage.C14.01.09          38.8
## Ess.total                  108.0
## Ess.unique                 105.0

Using paried-end connections

Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also been missed.

The function mmplot_network can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.

mmplot_network(data = dB, network = pe, nconnections = 1, color = "essential")

plot of chunk networkB

Most scaffolds seem to be nicely connected and we should be able to improve the assembly substantial in the refinement stage.

To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dB using mmextract_network. Only scaffolds directly connected to the subset is extracted.

dC <- mmextract_network(subset = dB, original = d, network = pe, nconnections = 1, type = "direct")

… and then plot the new subset. After the network extraction we have 4 separate clusters of scaffolds that are connected. This means that we had some scaffolds in the dB subset that didn’t belong to our target genome. However, we have also included a few correct repeats that belongs to the target Actinobacteria. Compared to the previous network plot it can also be seen that a number of small scaffolds are now connected to the Actinobacteria bin. This is low abundant micro-diversity.

mmplot_network(data = dC, network = pe, nconnections = 1, color = "pps_phylum")

plot of chunk networkC

Subspace extraction 3

To exclude the contamination and include the correct repeats we make another subspace extraction. To make it more easy to spot the small scaffolds we remove the length scale from the scaffolds by setting a fixed point.size.

p <- mmplot(data = dC, x = "C14.01.09", y = "C13.11.25", log.x = T, log.y = T, color = "pps_phylum", point.size = 5)

#p
#sel <- mmplot_locator(p)

sel <- data.frame(C14.01.09  =  c(19, 15.5, 69.1, 103, 106, 54.9),
                  C13.11.25  =  c(7.28, 17.2, 163, 165, 21.4, 7.76))

mmplot_selection(p, sel)

plot of chunk zoomC

Now, before we go ahead and extract the scaffolds within the subspace we have to remove the contaminating scaffolds that were present inside our bin. As we are using the same variables for subsetting as before, the contaminants will still be there. One way to remove them is simply to identify their scaffold names.

This can be done by plotting the network graph of the current subset dC and labelling the scaffolds using the previous subset dB. The contaminating scaffolds are 10932, 29668 and 50917.

mmplot_network(data = dC, network = pe, nconnections = 1, color = "pps_phylum", highlight = dB)

plot of chunk networkC2

The mmextract function is used to extract the defined subset and at the same time exclude the manually specified scaffolds.

dD <- mmextract(data = dC, selection = sel, exclude = c("10932", "29668", "50917"))

Final overview

The statistics of the final subset can be seen using mmstats.

mmstats(dD, ncov = 4)

##                    General Stats
## n.scaffolds                 45.0
## GC.mean                     71.1
## N50                     199460.0
## Length.total           4387186.0
## Length.max              454223.0
## Length.mean              97493.0
## Coverage.C13.11.14           0.6
## Coverage.C13.11.25          13.4
## Coverage.C13.12.03           6.9
## Coverage.C14.01.09          38.8
## Ess.total                  108.0
## Ess.unique                 105.0

The extracted scaffolds are connected quite nicely by PE reads and it might even be possible to close the genome using the MP data.

mmplot_network(data = dD, network = pe, nconnections = 1, color = "pps_phylum")

plot of chunk networkD

The extracted subset can also be highlighted in context of the original the dataset. Note that all plots are ggplot objects and hence can be manipulated in almost any way you can imagine. Here, the limits of the axis is adjusted.

mmplot(data = d, 
       x = "C14.01.09", 
       y = "C13.11.25", 
       log.x = T, 
       log.y = T, 
       color = "none", 
       highlight = dD, 
       minlength = 3000) +
  scale_x_log10(limits=c(0.1,300)) +
  scale_y_log10(limits=c(0.05,200))

Evaluate completeness

We can also compare the number of essential genes to all complete Actinobacteria in NCBI using the mmref function. There are a few duplicated “single copy” essential genes. However, the duplicated genes are also seen in some of the complete Actinobacteria. Hint: Try using tax.aggregate = “Genus”.

mmref(data=dD, tax.level = "Phylum", tax.compare="Actinobacteria")

plot of chunk mmref

Export the scaffolds

Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport.

mmexport(data=dD, assembly=assembly, file = "Awesome_actinobacteria.fa")