Frequently Asked Questions
Can you make the re-assembly of the Wrighton data available?
I can not match your sequences on NCBI with the data in your script?
The sequences on NCBI are contigs. They have a header structure that can be used to relate them to the parrent scaffold. E.g. HPminus3.2 is the second contig in scaffold number 3. Scaffold number 3 is the ID used in the R data to associate e.g. coverage to the scaffold.
Alternatively you can download the scaffolds here.
I do not have access to CLC - how do I map reads to the assembly and generate the .csv file used in R?
There are plenty of free great short read mapping software available. We usually use bowtie2 if we need an alternative to CLC. To map reads to an assembly you first need to generate an indexed version of the de novo assembly (
bowtie2-build assembly.fa assembly
You can now map the reads to the assembly and store the alignment in SAM format using bowtie2.
bowtie2 -x assembly -U reads.fastq -S alignment.sam
From the alignment (
alignment.sam) there are many ways to extract the coverage of each scaffold. One option is to use the
depth function in samtools.
Using samtools first convert the
.sam file to
.bam format and then sort it. Now the
depth function is used to generate a position based coverage profile of each scaffold (
samtools view -bS alignment.sam > alignment.bam samtools sort alignment.bam alignment.sorted samtools depth alignment.sorted.bam > depth.txt
To generate single average coverage vaule for each scaffold you can use the small perl script
calc.coverage.in.bam.depth.pl in the folder
multi-metagenome/misc.scripts. It generates
coverage.length.csv. which is equivalent to the
.csv file obtained using CLC.
perl calc.coverage.in.bam.depth.pl -i depth.txt -o coverage.length.csv
Alternative you can load
depth.txt directly in R and make the calculations in there. However, the size of the
depth.txt file might be quite large as it stores a line for each base in the assembly. Anyway if you want to handle the data in R you could do something like this:
cov <- read.delim("depth.txt", header = F) colnames(cov) <- c("scaffold", "position", "coverage") cov.scaffold <- as.data.frame(tapply(cov$coverage, cov$scaffold, mean)) length.scaffold <- as.data.frame(tapply(cov$position, cov$scaffold, max))
I do not have access to CLC - can you reccomend other metagenome assemblers?
Alternatively the program khmer can be used as a preprocessing step to enable de novo assembly of very large datasets.