Assembly
Captus’ first module is called captus_assembly
and it aims to create marker alignments across samples starting from the raw sequencing reads of each sample (or alternatively from previously cleaned reads or even previously assembled reads).
To accomplish this the module has four commands: clean
, assemble
, extract
, and align
which are tipically run in that order:
1. clean
This command will perform adaptor trimming (plus poly-A trimming if you are cleaning RNAseq reads) followed by quality trimming using bbduk.sh
from the BBTools suite. Once the cleaning is completed, Falco
or FastQC
is run on the raw and cleaned reads and a HTML report is generated summarizing the results from all the samples.
2. assemble
Using the cleaned reads produced by the previous step, Captus
will perform de novo assembly using MEGAHIT
. The default assembly parameters are tuned for hybridization capture or genome skimming data or a combination of both (CAPSKIM
preset), additionally we provide two more presets for RNA-Seq (RNA
) or high-coverage Whole Genome Shotgun (WGS
) data. Following assembly, Captus
can also remove contigs exceeding a given percentage of GC content. This is particularly useful if, for example, you are working with Eukaryotes and want to remove bacterial contamination, whose contigs typicallly have GC contents above 60 %. An HTML report summarizing several assembly statistics is also produced after this step. Even though we recommend using the cleaned reads produced by the clean
command you can also provide your own previously cleaned reads.
3. extract
During this step Captus
will search the assemblies produced by the previous step for the loci contained in the provided reference sequence datasetsets (aminoacids or nucleotides) and then extract them. Proteins can be provided in either aminoacid or nucleotide, these are searched and extracted using Scipio
. Additionally, you can provide as references any other DNA sequences (e.g., ribosomal genes, individual exons, entire genes with introns, non-coding regions, RAD loci, etc.), in this case Captus
uses BLAT
for searching and our own code for extracting and stitching partial hits if needed. Finally, since most of the assembly is tipically not used because the references will not be found in most contigs, we provide the option of clustering those unused contigs across samples using MMseqs2
in order to discover new homologous regions that can be used for phylogenomics. If you have your own assemblies in FASTA format you can use them instead of the assemblies produced by the assemble
command (e.g., downloaded genomes from NCBI). Like in the previous steps, Captus
will produce an HTML report summarizing the marker recovery statistics across all samples and extracted markers.
4. align
In this step Captus
will process the results from the extract
command. First, it will collect all the markers across samples and create a separate FASTA file per marker. Then, the reference sequences used for extraction will be added to their corresponding FASTA marker file to aid as an alignment guide. This is followed by alignment using MAFFT
or MUSCLE
. If you are aligning coding sequences, Captus
will codon-align the nucleotide version using as template the aminoacid alignment of the locus. Captus
extracts all the copies (hits) of a marker that are found in the assembly and ranks them by their similarity to the reference sequence, once the sequences are aligned, the program filters the paralogs using either the naive
method which retains the best hit as the ortholog or the informed
method which takes into account the references and the frequency with which they were selected across all samples to decide which of the copies most likely represents the ortholog (which is not necessarily the best hit). After paralogs have been filtered, the references used for guiding the alignment are removed. Finally, the alignments are trimmed using the recently published package ClipKIT
. As in previous steps, Captus
will summarize the alignment statistics of all the markers (e.g. length, mean pairwise identity, missingness, number of informative sites, etc.) and produce an HTML report.
To show the main help of the captus_assembly
module just type captus_assembly --help
:
(captus)$ captus_assembly --help
usage: captus_assembly command [options]
Captus 0.9.89: Assembly of Phylogenomic Datasets from High-Throughput Sequencing data
Captus-assembly commands:
command Program commands (in typical order of execution)
clean = Trim adaptors and quality filter reads with BBTools, run
FastQC on the raw and cleaned reads
assemble = Perform de novo assembly with MEGAHIT: Assembling reads
that were cleaned with the 'clean' command is
recommended, but reads cleaned elsewhere are also allowed
extract = Recover targeted markers with BLAT and Scipio: Extracting
markers from the assembly obtained with the 'assemble'
command is recommended, but any other assemblies in FASTA
format are also allowed.
align = Align extracted markers across samples with MAFFT or MUSCLE:
Marker alignment depends on the directory structure created
by the 'extract' command. This step also performs paralog
filtering and alignment trimming using ClipKIT
Help:
-h, --help Show this help message and exit
--version Show Captus' version number
For help on a particular command: captus_assembly command -h
So, for example, if you want to show the help of the extract
command you can type:
captus_assembly extract --help
Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Edgardo M. Ortiz (30.05.2022)