Assembly
Captus’ first module is called captus_assembly
but can also be executed as simply captus
and it aims to create marker alignments across samples starting from the raw sequencing reads of each sample (or alternatively from previously cleaned reads or even previously assembled reads).
To accomplish this the module has four commands: clean
, assemble
, extract
, and align
which are tipically run in that order:
1. clean
This command will perform adaptor trimming (plus poly-A trimming if you are cleaning RNAseq reads) followed by quality trimming using bbduk.sh
from the BBTools suite. Once the cleaning is completed, FastQC
or Falco
is run on the raw and cleaned reads and a HTML report is generated summarizing the results from all the samples.
2. assemble
Using the cleaned reads produced by the previous step, Captus
will perform de novo assembly using MEGAHIT
. The default assembly parameters are tuned for hybridization capture or genome skimming data or a combination of both (CAPSKIM
preset), additionally we provide two more presets for RNA-Seq (RNA
) or high-coverage Whole Genome Shotgun (WGS
) data. MEGAHIT
only outputs a rough estimation of depth of coverage for the assembled contigs, therefore we run Salmon
right after assembly to rapidly and accurately estimate contig depth of coverage and automatically remove contigs with depth <1.5x (but this default threshold can be changed). Captus
can also remove contigs exceeding a given percentage of GC content. This is particularly useful if, for example, you are working with Eukaryotes and want to remove bacterial contamination, whose contigs typically have GC contents above 60 %. An HTML report summarizing several assembly statistics is also produced after this step. After examining the report, the filtering by GC content and/or minimum depth of coverage can be repeated multiple times without the need to reassemble the reads. Even though we recommend using the cleaned reads produced by the clean
command you can also provide your own previously cleaned reads.
3. extract
During this step Captus
will search the assemblies produced by the previous step for the loci contained in the provided reference target sequence datasetsets (aminoacids or nucleotides) and then extract them. Proteins can be provided in either aminoacid or nucleotide, these are searched and extracted using Scipio
. Additionally, you can provide as reference targets any other DNA sequences (e.g., ribosomal genes, individual exons, entire genes with introns, non-coding regions, RAD loci, etc.), in this case Captus
uses BLAT
for searching and our own code for extracting and stitching partial hits if needed. Finally, since most of the contigs in the assembly are tipically not used because the targets will not be found in most contigs, we provide the option of clustering those unused contigs across samples using MMseqs2
in order to discover new putative homologous regions that can be used for phylogenomics. If you have your own assemblies in FASTA format you can use them instead of the assemblies produced by the assemble
command (e.g., downloaded genomes from NCBI). Like in the previous steps, Captus
will produce an HTML report summarizing the marker recovery statistics across all samples and extracted markers.
4. align
In this step Captus
will process the results from the extract
command. First, it will collect all the markers across samples and create a separate FASTA file per marker. Then, the reference sequences used for extraction will be added to their corresponding FASTA marker file to aid as an alignment guide. This is followed by alignment using MAFFT
or MUSCLE
. If you are aligning coding sequences, Captus
will codon-align the nucleotide version using as template the aminoacid alignment of the locus. Captus
extracts all the copies (hits) of a marker that are found in the assembly and ranks them by their similarity to the reference sequence, once the sequences are aligned, the program filters the paralogs using either the naive
method which retains the best hit as the ortholog or the informed
method which takes into account the references and the frequency with which they were selected across all samples to decide which of the copies most likely represents the ortholog (which is not necessarily the best hit). After paralogs have been filtered, the references used for guiding the alignment are removed. Finally, the alignments are trimmed using the recently published package ClipKIT
. As in previous steps, Captus
will summarize the alignment statistics of all the markers (e.g. length, mean pairwise identity, missingness, number of informative sites, etc.) and produce an HTML report.
To show the main help of the captus_assembly
module just type captus --help
:
(captus)$ captus --help
usage: captus command [options]
Captus 1.1.0: Assembly of Phylogenomic Datasets from High-Throughput Sequencing data
Captus-assembly commands:
command Program commands (in typical order of execution)
clean = Trim adaptors and quality filter reads with BBTools,
run FastQC on the raw and cleaned reads
assemble = Perform de novo assembly with MEGAHIT and estimate
contig depth of coverage with Salmon: Assembling
reads that were cleaned with the 'clean' command is
recommended, but reads cleaned elsewhere are also
allowed
extract = Recover targeted markers with BLAT and Scipio:
Extracting markers from the assembly obtained with
the 'assemble' command is recommended, but any other
assemblies in FASTA format are also allowed
align = Align extracted markers across samples with MAFFT or
MUSCLE: Marker alignment depends on the directory
structure created by the 'extract' command. This step
also performs paralog filtering and alignment trimming
using ClipKIT
Help:
-h, --help Show this help message and exit
--version Show Captus' version number
For help on a particular command: captus_assembly command -h
So, for example, if you want to show the help of the extract
command you can type:
captus extract --help
Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Edgardo M. Ortiz (19.12.2024)