View on GitHub


Recovery of complete genomes from metagenomes

Download this project as a .zip file Download this project as a tar.gz file

Frequently Asked Questions

Can you make the re-assembly of the Wrighton data available?

The re-assembly of the Wrighton et al., 2012 data can be found in the Additional Data section.

I can not match your sequences on NCBI with the data in your script?

The sequences on NCBI are contigs. They have a header structure that can be used to relate them to the parrent scaffold. E.g. HPminus3.2 is the second contig in scaffold number 3. Scaffold number 3 is the ID used in the R data to associate e.g. coverage to the scaffold.

Alternatively you can download the scaffolds here.

I do not have access to CLC - how do I map reads to the assembly and generate the .csv file used in R?

There are plenty of free great short read mapping software available. We usually use bowtie2 if we need an alternative to CLC. To map reads to an assembly you first need to generate an indexed version of the de novo assembly (assembly.fa) using bowtie2-build.

bowtie2-build assembly.fa assembly

You can now map the reads to the assembly and store the alignment in SAM format using bowtie2.

bowtie2 -x assembly -U reads.fastq -S alignment.sam

From the alignment (alignment.sam) there are many ways to extract the coverage of each scaffold. One option is to use the depth function in samtools.

Using samtools first convert the .sam file to .bam format and then sort it. Now the depth function is used to generate a position based coverage profile of each scaffold (depth.txt).

samtools view -bS alignment.sam > alignment.bam
samtools sort alignment.bam alignment.sorted
samtools depth alignment.sorted.bam > depth.txt

To generate single average coverage vaule for each scaffold you can use the small perl script in the folder multi-metagenome/misc.scripts. It generates coverage.length.csv. which is equivalent to the .csv file obtained using CLC.

perl -i depth.txt -o coverage.length.csv

Alternative you can load depth.txt directly in R and make the calculations in there. However, the size of the depth.txt file might be quite large as it stores a line for each base in the assembly. Anyway if you want to handle the data in R you could do something like this:

cov <- read.delim("depth.txt", header = F)
colnames(cov) <- c("scaffold", "position", "coverage")
cov.scaffold <-$coverage, cov$scaffold, mean))
length.scaffold <-$position, cov$scaffold, max))

I do not have access to CLC - can you reccomend other metagenome assemblers?

There are several free de novo assembly programs that are able to handle metagenome data. See e.g. SOAPdenovo, metAMOS, Ray Meta or IDBA-UD.

Alternatively the program khmer can be used as a preprocessing step to enable de novo assembly of very large datasets.