Frequently Asked Questions
Can you make the re-assembly of the Wrighton data available?
The re-assembly of the Wrighton et al., 2012 data can be found in the Additional Data section.
I can not match your sequences on NCBI with the data in your script?
The sequences on NCBI are contigs. They have a header structure that can be used to relate them to the parrent scaffold. E.g. HPminus3.2 is the second contig in scaffold number 3. Scaffold number 3 is the ID used in the R data to associate e.g. coverage to the scaffold.
Alternatively you can download the scaffolds here.
I do not have access to CLC - how do I map reads to the assembly and generate the .csv file used in R?
There are plenty of free great short read mapping software available. We usually use bowtie2 if we need an alternative to CLC. To map reads to an assembly you first need to generate an indexed version of the de novo assembly (assembly.fa
) using bowtie2-build
.
bowtie2-build assembly.fa assembly
You can now map the reads to the assembly and store the alignment in SAM format using bowtie2.
bowtie2 -x assembly -U reads.fastq -S alignment.sam
From the alignment (alignment.sam
) there are many ways to extract the coverage of each scaffold. One option is to use the depth
function in samtools.
Using samtools first convert the .sam
file to .bam
format and then sort it. Now the depth
function is used to generate a position based coverage profile of each scaffold (depth.txt
).
samtools view -bS alignment.sam > alignment.bam
samtools sort alignment.bam alignment.sorted
samtools depth alignment.sorted.bam > depth.txt
To generate single average coverage vaule for each scaffold you can use the small perl script calc.coverage.in.bam.depth.pl
in the folder multi-metagenome/misc.scripts
. It generates coverage.length.csv.
which is equivalent to the .csv
file obtained using CLC.
perl calc.coverage.in.bam.depth.pl -i depth.txt -o coverage.length.csv
Alternative you can load depth.txt
directly in R and make the calculations in there. However, the size of the depth.txt
file might be quite large as it stores a line for each base in the assembly. Anyway if you want to handle the data in R you could do something like this:
cov <- read.delim("depth.txt", header = F)
colnames(cov) <- c("scaffold", "position", "coverage")
cov.scaffold <- as.data.frame(tapply(cov$coverage, cov$scaffold, mean))
length.scaffold <- as.data.frame(tapply(cov$position, cov$scaffold, max))
I do not have access to CLC - can you reccomend other metagenome assemblers?
There are several free de novo assembly programs that are able to handle metagenome data. See e.g. SOAPdenovo, metAMOS, Ray Meta or IDBA-UD.
Alternatively the program khmer can be used as a preprocessing step to enable de novo assembly of very large datasets.