Output Files

For this example we will use the directory 02_assemblies previously created with the assemble module. We run the following Captus command to search and extract our reference markers from the assemblies:

captus_assembly extract -a 02_assemblies \
-n Angiosperms353 \
-p SeedPlantsPTD \
-m SeedPlantsMIT \
-d noncoding_DNA.fasta \
-c \
--max_loci_files 500
Warning

Notice the addition of option --max_loci_files 500, which is used only for showing the output directories containing separate FASTA file per marker (3’, 4’, 5’, 6’, 13’, and 14’ in the image below, not created by default), we don’t recommend using this option in your runs since it will unnecesarily create large numbers of small FASTA files which would have to be concatenated again anyways during the alignment step.
You can read more about the option here: –max_loci_files

After the run is finished we should see a new directory called 03_extractions with the following structure and files:

Extractions

1. [SAMPLE_NAME]__captus-ext

A subdirectory ending in __captus-ext is created to contain the extracted markers of each sample separately (S1, S2, S3, and S4 in the image).


2. 01_coding_NUC, 02_coding_PTD, 03_coding_MIT

These directories contain the extracted coding markers from the NUClear, PlasTiDial, and MITochondrial genomes respectively.
The markers are presented in four formats: protein sequence (coding_AA), coding sequence in nucleotide (coding_NT), exons and introns concatenated (genes), and the concatenation of exons and introns flanked by a fixed length of sequence (genes_flanked):

Protein extraction formats


3. [MARKER_TYPE]_coding_AA.faa, 01_AA

Coding sequence in aminoacids. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


4. [MARKER_TYPE]_coding_NT.fna, 02_NT

Coding sequence in nucleotides, a.k.a. CDS. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


5. [MARKER_TYPE]_genes.fna, 03_genes

Gene sequence (exons in capital letters + introns in lowercase letters) in nucleotides. A contig connector of 50 n characters is included when the protein match spans more than a single contig. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


6. [MARKER_TYPE]_genes_flanked.fna, 04_genes_flanked

Gene sequence (exons in capital letters + introns in lowercase letters) plus additional flanking sequence in lowercase nucleotides. A contig connector of 50 n characters is included when the protein match spans more than a single contig. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


7. [MARKER_TYPE]_contigs_list.txt

List of contig names that had protein hits. Prefixes can be NUC, PTD, or MIT.

Example

8. [MARKER_TYPE]_contigs.gff

Annotation track in GFF format for protein hits to contigs in assembly. Prefixes can be NUC, PTD, or MIT.

See 19. 06_assembly_annotated


9. [MARKER_TYPE]_recovery_stats.tsv

Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be NUC, PTD, or MIT.

For more information on the table see 26. captus-assembly_extract.stats.tsv


10. [MARKER_TYPE]_scipio_final.log

Log of the second Scipio’s run, where best references have already been selected (when using multi-sequence per locus references) and only the contigs that had hits durin Scipio’s initial run are used. Prefixes can be NUC, PTD, or MIT.


11. 00_initial_scipio_[MARKER_TYPE]

Directory for Scipio’s initial run results. The directory contains the set of filtered protein references [MARKER_TYPE]_best_proteins.faa (when using multi-sequence per locus references) and the log of Scipio’s initial run [MARKER_TYPE]_scipio_initial.log. Suffixes can be NUC, PTD, or MIT.


12. 04_misc_DNA, 05_clusters

These directories contain the extracted miscellaneous DNA markers, either from a DNA custom set of references or from the CLusteRing resulting from using the option --cluster_leftovers.
The markers are presented in two formats: matching DNA segments (matches), and the matched segments including flanks and other intervening segments not present in the reference (matches_flanked).

Miscellaneous DNA extraction formats


13. [MARKER_TYPE]_matches.fna, 01_matches

Matches per miscellaneous DNA marker in nucleotides. Prefixes can be DNA or CLR. For details on sequence headers see FASTA headers explanation.


14. [MARKER_TYPE]_matches_flanked.fna, 02_matches_flanked

Matches plus additional flanking sequence per miscellaneous DNA marker in nucleotides. Prefixes can be DNA or CLR. For details on sequence headers see FASTA headers explanation.


15. [MARKER_TYPE]_contigs_list.txt

List of contig names that had miscellaneous DNA marker hits. Prefixes can be DNA or CLR.

Example

16. [MARKER_TYPE]_contigs.gff

Annotation track in GFF format for miscellaneous DNA marker hits to contigs in assembly. Prefixes can be DNA or CLR.

See 19. 06_assembly_annotated


17. [MARKER_TYPE]_recovery_stats.tsv

Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be DNA or CLR.

For more information on the table see 26. captus-assembly_extract.stats.tsv


18. [MARKER_TYPE]_blat_search.log

Log of BLAT’s run. Prefixes can be DNA or CLR.


19. 06_assembly_annotated

The main outputs of this directory are a FASTA file containing all the contigs that had hits to the reference markers called [SAMPLE_NAME]_hit_contigs.fasta as well as an annotation track for those markers called [SAMPLE_NAME]_hit_contigs.gff. You can visualize the annotations in Geneious, for example, by importing the FASTA file and then dropping the GFF file on top: Assembly annotated


20. [SAMPLE_NAME]_hit_contigs.fasta

This file contains the subset of the contigs assembled by MEGAHIT that had hit to the reference markers. See the red rectangles in 19. 06_assembly_annotated.


21. [SAMPLE_NAME]_hit_contigs.gff

Unified annotation track in GFF format for ALL the marker types found in the assembly’s contigs. See the red rectangles in 19. 06_assembly_annotated.


22. [SAMPLE_NAME]_recovery_stats.tsv

Unified tab-separated-values table with marker recovery statistics from ALL the marker types found in the sample, these are concatenated across samples and summarized in the final Marker Recovery report.

For more information on the table see 26. captus-assembly_extract.stats.tsv


23. leftover_contigs.fasta.gz

This file contains the subset of the contigs assembled by MEGAHIT that had no hit to the reference markers. The file is compressed to save space. These are the contigs that are used for clustering across samples in order to discover additional homologous markers.


24. leftover_contigs_after_custering.fasta.gz

This file contains the subset of the contigs assembled by MEGAHIT that had no hit to the reference markers or even to the newly discovered markers derived from clusterin. The file is compressed to save space.


25. captus-assembly_extract.refs.json

This file stores the paths to all the references used for extraction. This file is necessary so the alignment step can correctly add the references to the final alignments to be used as guides.

Example

26. captus-assembly_extract.stats.tsv

Unified tab-separated-values table with marker recovery statistics from ALL the markers found in ALL the samples, this table is used to create the final Marker Recovery report. Even though the report is quite useful for visualization you might need to do more complex statistical analysis, this table is the most appropriate output file for such analyses.

Information included in the table

27. captus-assembly_extract.report.html

This is the final Marker Recovery report, summarizing marker extraction statistics across all samples and marker types.


28. captus-assembly_extract.log

This is the log from Captus, it contains the command used and all the information shown during the run. If the option --show_less was enabled, the log will also contain all the extra detailed information that was hidden during the run.


29. clust_id##.##_cov##.##_captus_clusters_refs.fasta

This FASTA file contains the cluster representatives that will be searched and extracted across samples (prefix CLR), the loci names are called captus_#. These represent newly discovered homologus markers in contigs that had no hits to other reference proteins or miscellaneous DNA markers.

Example


FASTA headers explanation

Info

All the FASTA files produced by Captus containing extracted markers follow the same header style: FASTA header Paralogs are ranked according to their wscore which, in turn, is calculated from the percent identity (ident) as well as the percent coverage (cover) with respect to the selected reference sequence (query). The best hit is always ranked 00 and secondary hits start at 01. When a single hit is found for a marker (like in locus 5859 in the image) the ranking 00 is not included in the sequence name, only when multiple hits are found for a marker (like in locus 5865 in the image) the ranking 00 or 01 is included in the sequence name to make them unique. The description field frameshifts is only present when the output sequence has corrected frameshifts and the numbers indicate their position in the output sequence.
As you can see, Sample name, Locus name, and Paralog ranking are separated by double underscores (__). This is the reason why we don’t recommend using __ inside your sample names (see sample naming convention)


Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Edgardo M. Ortiz (03.06.2022)