In case you haven’t installed the mmgenome package, see the Load data example.
library("mmgenome")
The Rmarkdown file Load_data.Rmd describes the data that is to be loaded. The data is then loaded using the mmimport
function. The data loading and genome extraction is split to enable cleaner workflows. I.e. you load data once, but extract multiple genomes in their separate Rmarkdown file.
#mmimport(file = "Load_data.Rmd")
However, in this example we load the attached example dataset rocco
.
data(rocco)
The object d
contains information on scaffolds and essential genes within the scaffolds. The rocco
dataset contains the following information on each scaffold.
colnames(d$scaffolds)
## [1] "scaffold" "length" "gc" "C13.11.14" "C13.11.25"
## [6] "C13.12.03" "C14.01.09" "PC1" "PC2" "PC3"
## [11] "essential" "rRNA16S" "pps_phylum" "pps_class" "pps_order"
## [16] "pps_family" "pps_genus"
The basic statistics of the full dataset can be summarised using the mmstats
function.
mmstats(d, ncov = 4)
## General Stats
## n.scaffolds 97285.0
## GC.mean 52.2
## N50 5291.0
## Length.total 331908376.0
## Length.max 1446979.0
## Length.mean 3411.7
## Coverage.C13.11.14 2.6
## Coverage.C13.11.25 16.1
## Coverage.C13.12.03 2.9
## Coverage.C14.01.09 11.6
## Ess.total 7138.0
## Ess.unique 109.0
The mmplot
function is used to generate a differential coverage plot. The plot is stored in the object p
and then plotted using p
. mmplot_locator
is used to interactively define a subspace to extract on the plot. For complete reproducibility the coordinates of the subspace is also stored manually (the coordinates are written to the console). mmplot_selection
is used plot the subspace in the final markdown report. To use mmplot_locator
, simply plot the data using p
and then run mmplot_locator(p)
to interactively define the subspace, press finish
when done.
Here we choose an Actinobacteria that seems like an easy target.
p <- mmplot(data = d, x = "C13.11.25", y = "C14.01.09", log.x = T, log.y = T, color = "essential", minlength = 3000)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(C13.11.25 = c(7.2, 16.2, 25.2, 23.3, 10.1),
C14.01.09 = c(47, 77, 52.8, 29.5, 22.1))
mmplot_selection(p, sel)
The scaffolds included in the defined subspace are extracted using the mmextract
function. Note: all scaffolds in the subspace are extracted even though only scaffolds over 3000 bp were used for plotting in the example above. mmextract
also contains a minlength
parameter if needed.
dA <- mmextract(d, sel)
The mmstats
function applies to any extracted object. Hence, it can be used directly on the subset.
mmstats(dA, ncov = 4)
## General Stats
## n.scaffolds 61.0
## GC.mean 71.0
## N50 199460.0
## Length.total 4452901.0
## Length.max 454223.0
## Length.mean 72998.4
## Coverage.C13.11.14 0.6
## Coverage.C13.11.25 13.3
## Coverage.C13.12.03 6.9
## Coverage.C14.01.09 38.8
## Ess.total 109.0
## Ess.unique 105.0
To get a more clear look on what was in the initial subset we simply plot it using the mmplot
function. Note that the dA
object is used as input.
mmplot(data = dA, x = "C13.11.25", y = "C14.01.09", log.x = T, log.y = T, color = "essential")
Although it seems like we would be able to extract the Actinobacteria using the initial two coverage datasets, we’ll take a look at the additional data we have available.
The mmplot_pairs
function can be used to plot a number of variables against each other, to get a quick overview of which might be used to separate the genomes of interest. The scaffolds are colored with the PhyloPythiaS+ taxonomic classification.
mmplot_pairs(data = dA,
variables = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09", "gc", "PC2"),
log = c("C13.11.14","C13.11.25","C13.12.03", "C14.01.09"),
color = "pps_phylum",
textsize = 5
)
Using the C13.11.14
and C13.12.03
coverage profiles might enable removal of the leftover contaminants. Hence, they are used for a new subspace extraction.
We use the same procedure as with the initial subset. However, now the dA
subset is used as input data.
p <- mmplot(data = dA, x = "C13.12.03", y = "C13.11.14", log.x = T, log.y = T, color = "pps_phylum")
#p
#sel <- mmplot_locator(p)
sel <- data.frame(C13.12.03 = c(3.48, 4.95, 6.97, 13.6, 15.7, 9.68, 4.48),
C13.11.14 = c(0.407, 1.72, 2.92, 1.45, 0.264, 0.17, 0.163))
mmplot_selection(p, sel)
The scaffolds in the subspace are extracted and stored in the object dB
. Note that dA
is now used as the input dataset to mmextract
.
dB <- mmextract(dA, sel)
… and mmstats
is used to check the stats of extracted bin dB
.
mmstats(dB, ncov = 4)
## General Stats
## n.scaffolds 45.0
## GC.mean 71.1
## N50 199460.0
## Length.total 4384536.0
## Length.max 454223.0
## Length.mean 97434.1
## Coverage.C13.11.14 0.5
## Coverage.C13.11.25 13.2
## Coverage.C13.12.03 6.9
## Coverage.C14.01.09 38.8
## Ess.total 108.0
## Ess.unique 105.0
Until now we have just used coverage profiles to extract scaffolds related to our genome of interest. However, some scaffolds might be present in many copies (repeats) and hence have a much higher coverage than the rest of the genome. In addition, some scaffolds will by chance have a slightly different coverage profile than the rest of the genome and thereby also been missed.
The function mmplot_network
can be used to generate a network plot of scaffolds connected by paired-end reads. We start by plotting the scaffolds we have in our current subset.
mmplot_network(data = dB, network = pe, nconnections = 1, color = "essential")
Most scaffolds seem to be nicely connected and we should be able to improve the assembly substantial in the refinement stage.
To include repeats and other missed scaffolds we simply extract all scaffolds that are directly connected by paired-end reads to our current subset dB
using mmextract_network
. Only scaffolds directly connected to the subset is extracted.
dC <- mmextract_network(subset = dB, original = d, network = pe, nconnections = 1, type = "direct")
… and then plot the new subset. After the network extraction we have 4 separate clusters of scaffolds that are connected. This means that we had some scaffolds in the dB
subset that didn’t belong to our target genome. However, we have also included a few correct repeats that belongs to the target Actinobacteria. Compared to the previous network plot it can also be seen that a number of small scaffolds are now connected to the Actinobacteria bin. This is low abundant micro-diversity.
mmplot_network(data = dC, network = pe, nconnections = 1, color = "pps_phylum")
To exclude the contamination and include the correct repeats we make another subspace extraction. To make it more easy to spot the small scaffolds we remove the length scale from the scaffolds by setting a fixed point.size
.
p <- mmplot(data = dC, x = "C14.01.09", y = "C13.11.25", log.x = T, log.y = T, color = "pps_phylum", point.size = 5)
#p
#sel <- mmplot_locator(p)
sel <- data.frame(C14.01.09 = c(19, 15.5, 69.1, 103, 106, 54.9),
C13.11.25 = c(7.28, 17.2, 163, 165, 21.4, 7.76))
mmplot_selection(p, sel)
Now, before we go ahead and extract the scaffolds within the subspace we have to remove the contaminating scaffolds that were present inside our bin. As we are using the same variables for subsetting as before, the contaminants will still be there. One way to remove them is simply to identify their scaffold names.
This can be done by plotting the network graph of the current subset dC
and labelling the scaffolds using the previous subset dB
. The contaminating scaffolds are 10932
, 29668
and 50917
.
mmplot_network(data = dC, network = pe, nconnections = 1, color = "pps_phylum", highlight = dB)
The mmextract
function is used to extract the defined subset and at the same time exclude the manually specified scaffolds.
dD <- mmextract(data = dC, selection = sel, exclude = c("10932", "29668", "50917"))
The statistics of the final subset can be seen using mmstats
.
mmstats(dD, ncov = 4)
## General Stats
## n.scaffolds 45.0
## GC.mean 71.1
## N50 199460.0
## Length.total 4387186.0
## Length.max 454223.0
## Length.mean 97493.0
## Coverage.C13.11.14 0.6
## Coverage.C13.11.25 13.4
## Coverage.C13.12.03 6.9
## Coverage.C14.01.09 38.8
## Ess.total 108.0
## Ess.unique 105.0
The extracted scaffolds are connected quite nicely by PE reads and it might even be possible to close the genome using the MP data.
mmplot_network(data = dD, network = pe, nconnections = 1, color = "pps_phylum")
The extracted subset can also be highlighted in context of the original the dataset. Note that all plots are ggplot
objects and hence can be manipulated in almost any way you can imagine. Here, the limits of the axis is adjusted.
mmplot(data = d,
x = "C14.01.09",
y = "C13.11.25",
log.x = T,
log.y = T,
color = "none",
highlight = dD,
minlength = 3000) +
scale_x_log10(limits=c(0.1,300)) +
scale_y_log10(limits=c(0.05,200))
We can also compare the number of essential genes to all complete Actinobacteria in NCBI using the mmref
function. There are a few duplicated “single copy” essential genes. However, the duplicated genes are also seen in some of the complete Actinobacteria. Hint: Try using tax.aggregate = “Genus”.
mmref(data=dD, tax.level = "Phylum", tax.compare="Actinobacteria")
Now that we are happy with the genome bin, the scaffolds can be exported to a separate fasta file using mmexport
.
mmexport(data=dD, assembly=assembly, file = "Awesome_actinobacteria.fa")