Output Files
For this example we will use the directory 02_assemblies
previously created with the assemble
module. We run the following Captus
command to search and extract our reference markers from the assemblies:
captus_assembly extract -a 02_assemblies \
-n Angiosperms353 \
-p SeedPlantsPTD \
-m SeedPlantsMIT \
-d noncoding_DNA.fasta \
-c \
--max_loci_files 500
Notice the addition of option --max_loci_files 500
, which is used only for showing the output directories containing separate FASTA file per marker (3’, 4’, 5’, 6’, 13’, and 14’ in the image below, not created by default), we don’t recommend using this option in your runs since it will unnecesarily create large numbers of small FASTA files which would have to be concatenated again anyways during the alignment step.
You can read more about the option here: –max_loci_files
After the run is finished we should see a new directory called 03_extractions
with the following structure and files:
1. [SAMPLE_NAME]__captus-ext
A subdirectory ending in __captus-ext
is created to contain the extracted markers of each sample separately (S1, S2, S3, and S4 in the image).
2. 01_coding_NUC
, 02_coding_PTD
, 03_coding_MIT
These directories contain the extracted coding markers from the NUClear, PlasTiDial, and MITochondrial genomes respectively.
The markers are presented in four formats: protein sequence (coding_AA), coding sequence in nucleotide (coding_NT), exons and introns concatenated (genes), and the concatenation of exons and introns flanked by a fixed length of sequence (genes_flanked):
3. [MARKER_TYPE]_coding_AA.faa
, 01_AA
Coding sequence in aminoacids. Prefixes can be NUC
, PTD
, or MIT
. For details on sequence headers see FASTA headers explanation.
4. [MARKER_TYPE]_coding_NT.fna
, 02_NT
Coding sequence in nucleotides, a.k.a. CDS. Prefixes can be NUC
, PTD
, or MIT
. For details on sequence headers see FASTA headers explanation.
5. [MARKER_TYPE]_genes.fna
, 03_genes
Gene sequence (exons in capital letters + introns in lowercase letters) in nucleotides. A contig connector of 50 n
characters is included when the protein match spans more than a single contig. Prefixes can be NUC
, PTD
, or MIT
. For details on sequence headers see FASTA headers explanation.
6. [MARKER_TYPE]_genes_flanked.fna
, 04_genes_flanked
Gene sequence (exons in capital letters + introns in lowercase letters) plus additional flanking sequence in lowercase nucleotides. A contig connector of 50 n
characters is included when the protein match spans more than a single contig. Prefixes can be NUC
, PTD
, or MIT
. For details on sequence headers see FASTA headers explanation.
7. [MARKER_TYPE]_contigs_list.txt
List of contig names that had protein hits. Prefixes can be NUC
, PTD
, or MIT
.
8. [MARKER_TYPE]_contigs.gff
Annotation track in GFF format for protein hits to contigs in assembly. Prefixes can be NUC
, PTD
, or MIT
.
9. [MARKER_TYPE]_recovery_stats.tsv
Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be NUC
, PTD
, or MIT
.
For more information on the table see 26. captus-assembly_extract.stats.tsv
10. [MARKER_TYPE]_scipio_final.log
Log of the second Scipio’s run, where best references have already been selected (when using multi-sequence per locus references) and only the contigs that had hits durin Scipio’s initial run are used. Prefixes can be NUC
, PTD
, or MIT
.
11. 00_initial_scipio_[MARKER_TYPE]
Directory for Scipio’s initial run results. The directory contains the set of filtered protein references [MARKER_TYPE]_best_proteins.faa
(when using multi-sequence per locus references) and the log of Scipio’s initial run [MARKER_TYPE]_scipio_initial.log
. Suffixes can be NUC
, PTD
, or MIT
.
12. 04_misc_DNA
, 05_clusters
These directories contain the extracted miscellaneous DNA markers, either from a DNA custom set of references or from the CLusteRing resulting from using the option --cluster_leftovers
.
The markers are presented in two formats: matching DNA segments (matches), and the matched segments including flanks and other intervening segments not present in the reference (matches_flanked).
13. [MARKER_TYPE]_matches.fna
, 01_matches
Matches per miscellaneous DNA marker in nucleotides. Prefixes can be DNA
or CLR
. For details on sequence headers see FASTA headers explanation.
14. [MARKER_TYPE]_matches_flanked.fna
, 02_matches_flanked
Matches plus additional flanking sequence per miscellaneous DNA marker in nucleotides. Prefixes can be DNA
or CLR
. For details on sequence headers see FASTA headers explanation.
15. [MARKER_TYPE]_contigs_list.txt
List of contig names that had miscellaneous DNA marker hits. Prefixes can be DNA
or CLR
.
16. [MARKER_TYPE]_contigs.gff
Annotation track in GFF format for miscellaneous DNA marker hits to contigs in assembly. Prefixes can be DNA
or CLR
.
17. [MARKER_TYPE]_recovery_stats.tsv
Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be DNA
or CLR
.
For more information on the table see 26. captus-assembly_extract.stats.tsv
18. [MARKER_TYPE]_blat_search.log
Log of BLAT’s run. Prefixes can be DNA
or CLR
.
19. 06_assembly_annotated
The main outputs of this directory are a FASTA file containing all the contigs that had hits to the reference markers called [SAMPLE_NAME]_hit_contigs.fasta
as well as an annotation track for those markers called [SAMPLE_NAME]_hit_contigs.gff
. You can visualize the annotations in Geneious
, for example, by importing the FASTA file and then dropping the GFF file on top:
20. [SAMPLE_NAME]_hit_contigs.fasta
This file contains the subset of the contigs assembled by MEGAHIT
that had hit to the reference markers. See the red rectangles in 19. 06_assembly_annotated.
21. [SAMPLE_NAME]_hit_contigs.gff
Unified annotation track in GFF format for ALL the marker types found in the assembly’s contigs. See the red rectangles in 19. 06_assembly_annotated.
22. [SAMPLE_NAME]_recovery_stats.tsv
Unified tab-separated-values table with marker recovery statistics from ALL the marker types found in the sample, these are concatenated across samples and summarized in the final Marker Recovery report.
For more information on the table see 26. captus-assembly_extract.stats.tsv
23. leftover_contigs.fasta.gz
This file contains the subset of the contigs assembled by MEGAHIT
that had no hit to the reference markers. The file is compressed to save space. These are the contigs that are used for clustering across samples in order to discover additional homologous markers.
24. leftover_contigs_after_custering.fasta.gz
This file contains the subset of the contigs assembled by MEGAHIT
that had no hit to the reference markers or even to the newly discovered markers derived from clusterin. The file is compressed to save space.
25. captus-assembly_extract.refs.json
This file stores the paths to all the references used for extraction. This file is necessary so the alignment step can correctly add the references to the final alignments to be used as guides.
26. captus-assembly_extract.stats.tsv
Unified tab-separated-values table with marker recovery statistics from ALL the markers found in ALL the samples, this table is used to create the final Marker Recovery report. Even though the report is quite useful for visualization you might need to do more complex statistical analysis, this table is the most appropriate output file for such analyses.
27. captus-assembly_extract.report.html
This is the final Marker Recovery report, summarizing marker extraction statistics across all samples and marker types.
28. captus-assembly_extract.log
This is the log from Captus
, it contains the command used and all the information shown during the run. If the option --show_less
was enabled, the log will also contain all the extra detailed information that was hidden during the run.
29. clust_id##.##_cov##.##_captus_clusters_refs.fasta
This FASTA file contains the cluster representatives that will be searched and extracted across samples (prefix CLR
), the loci names are called captus_#
. These represent newly discovered homologus markers in contigs that had no hits to other reference proteins or miscellaneous DNA markers.
FASTA headers explanation
All the FASTA files produced by Captus
containing extracted markers follow the same header style:
Paralogs are ranked according to their wscore
which, in turn, is calculated from the percent identity (ident
) as well as the percent coverage (cover
) with respect to the selected reference sequence (query
). The best hit is always ranked 00
and secondary hits start at 01
. When a single hit is found for a marker (like in locus 5859
in the image) the ranking 00
is not included in the sequence name, only when multiple hits are found for a marker (like in locus 5865
in the image) the ranking 00
or 01
is included in the sequence name to make them unique. The description field frameshifts
is only present when the output sequence has corrected frameshifts and the numbers indicate their position in the output sequence.
As you can see, Sample name, Locus name, and Paralog ranking are separated by double underscores (__
). This is the reason why we don’t recommend using __
inside your sample names (see sample naming convention)
Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Edgardo M. Ortiz (03.06.2022)