Output Files

For this example we will use the directory 02_assemblies previously created with the assemble module. We run the following Captus command to search and extract our reference markers from the assemblies:

captus_assembly extract -a 02_assemblies \
-n Angiosperms353 \
-p SeedPlantsPTD \
-m SeedPlantsMIT \
-d noncoding_DNA.fasta \
-c \
--max_loci_files 500
Warning

Notice the addition of option --max_loci_files 500, which is used only for showing the output directories containing separate FASTA file per marker (3’, 4’, 5’, 6’, 13’, and 14’ in the image below, not created by default), we don’t recommend using this option in your runs since it will unnecesarily create large numbers of small FASTA files which would have to be concatenated again anyways during the alignment step.
You can read more about the option here: –max_loci_files

After the run is finished we should see a new directory called 03_extractions with the following structure and files:

Extractions Extractions

1. [SAMPLE_NAME]__captus-ext

A subdirectory ending in __captus-ext is created to contain the extracted markers of each sample separately (S1, S2, S3, and S4 in the image).


2. 01_coding_NUC, 02_coding_PTD, 03_coding_MIT

These directories contain the extracted coding markers from the NUClear, PlasTiDial, and MITochondrial genomes respectively.
The markers are presented in four formats: protein sequence (coding_AA), coding sequence in nucleotide (coding_NT), exons and introns concatenated (genes), and the concatenation of exons and introns flanked by a fixed length of sequence (genes_flanked):

Protein extraction formats Protein extraction formats


3. [MARKER_TYPE]_coding_AA.faa, 01_AA

Coding sequence in aminoacids. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


4. [MARKER_TYPE]_coding_NT.fna, 02_NT

Coding sequence in nucleotides, a.k.a. CDS. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


5. [MARKER_TYPE]_genes.fna, 03_genes

Gene sequence (exons in capital letters + introns in lowercase letters) in nucleotides. A contig connector of 50 n characters is included when the protein match spans more than a single contig. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


6. [MARKER_TYPE]_genes_flanked.fna, 04_genes_flanked

Gene sequence (exons in capital letters + introns in lowercase letters) plus additional flanking sequence in lowercase nucleotides. A contig connector of 50 n characters is included when the protein match spans more than a single contig. Prefixes can be NUC, PTD, or MIT. For details on sequence headers see FASTA headers explanation.


7. [MARKER_TYPE]_contigs_list.txt

List of contig names that had protein hits. Prefixes can be NUC, PTD, or MIT.

NODE_138_length_18304_cov_18.0000_k_175_flag_1
NODE_635_length_5337_cov_17.0000_k_175_flag_1
NODE_46_length_16959_cov_19.0000_k_175_flag_1
NODE_321_length_4728_cov_10.0000_k_175_flag_1
NODE_760_length_19021_cov_17.9856_k_175_flag_0
NODE_621_length_11511_cov_11.9845_k_175_flag_0
NODE_965_length_6331_cov_15.0000_k_175_flag_1
NODE_948_length_26295_cov_58.0000_k_175_flag_0
NODE_1726_length_1438_cov_9.0000_k_175_flag_1
NODE_210_length_2896_cov_9.9471_k_175_flag_0
NODE_677_length_10733_cov_14.0000_k_175_flag_1
NODE_996_length_2375_cov_11.0000_k_175_flag_1
NODE_1837_length_366_cov_3.0000_k_175_flag_1
NODE_1647_length_529_cov_7.0000_k_175_flag_1
NODE_792_length_4384_cov_17.0000_k_175_flag_1
NODE_1378_length_5491_cov_21.0000_k_175_flag_1
NODE_602_length_14961_cov_47.0000_k_175_flag_1
NODE_1293_length_649_cov_3.0000_k_175_flag_1
NODE_949_length_2240_cov_50.0000_k_175_flag_1
NODE_751_length_2777_cov_34.7909_k_175_flag_0

8. [MARKER_TYPE]_contigs.gff

Annotation track in GFF format for protein hits to contigs in assembly. Prefixes can be NUC, PTD, or MIT.

See 19. 06_assembly_annotated


9. [MARKER_TYPE]_recovery_stats.tsv

Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be NUC, PTD, or MIT.

For more information on the table see 26. captus-assembly_extract.stats.tsv


10. [MARKER_TYPE]_scipio_final.log

Log of the second Scipio’s run, where best references have already been selected (when using multi-sequence per locus references) and only the contigs that had hits durin Scipio’s initial run are used. Prefixes can be NUC, PTD, or MIT.


11. 00_initial_scipio_[MARKER_TYPE]

Directory for Scipio’s initial run results. The directory contains the set of filtered protein references [MARKER_TYPE]_best_proteins.faa (when using multi-sequence per locus references) and the log of Scipio’s initial run [MARKER_TYPE]_scipio_initial.log. Suffixes can be NUC, PTD, or MIT.


12. 04_misc_DNA, 05_clusters

These directories contain the extracted miscellaneous DNA markers, either from a DNA custom set of references or from the CLusteRing resulting from using the option --cluster_leftovers.
The markers are presented in two formats: matching DNA segments (matches), and the matched segments including flanks and other intervening segments not present in the reference (matches_flanked).

Miscellaneous DNA extraction formats Miscellaneous DNA extraction formats


13. [MARKER_TYPE]_matches.fna, 01_matches

Matches per miscellaneous DNA marker in nucleotides. Prefixes can be DNA or CLR. For details on sequence headers see FASTA headers explanation.


14. [MARKER_TYPE]_matches_flanked.fna, 02_matches_flanked

Matches plus additional flanking sequence per miscellaneous DNA marker in nucleotides. Prefixes can be DNA or CLR. For details on sequence headers see FASTA headers explanation.


15. [MARKER_TYPE]_contigs_list.txt

List of contig names that had miscellaneous DNA marker hits. Prefixes can be DNA or CLR.

NODE_1858_length_3636_cov_14.0000_k_175_flag_1
NODE_2876_length_3179_cov_25.0000_k_175_flag_1
NODE_502_length_5771_cov_37.0000_k_175_flag_1
NODE_347_length_475_cov_2.0000_k_175_flag_1
NODE_1393_length_297_cov_4.0000_k_175_flag_1
NODE_3093_length_960_cov_17.0000_k_175_flag_1
NODE_3265_length_1041_cov_18.0000_k_175_flag_1

16. [MARKER_TYPE]_contigs.gff

Annotation track in GFF format for miscellaneous DNA marker hits to contigs in assembly. Prefixes can be DNA or CLR.

See 19. 06_assembly_annotated


17. [MARKER_TYPE]_recovery_stats.tsv

Tab-separated-values table with marker recovery statistics, these are concatenated across marker types and samples and summarized in the final Marker Recovery report. Prefixes can be DNA or CLR.

For more information on the table see 26. captus-assembly_extract.stats.tsv


18. [MARKER_TYPE]_blat_search.log

Log of BLAT’s run. Prefixes can be DNA or CLR.


19. 06_assembly_annotated

The main outputs of this directory are a FASTA file containing all the contigs that had hits to the reference markers called [SAMPLE_NAME]_hit_contigs.fasta as well as an annotation track for those markers called [SAMPLE_NAME]_hit_contigs.gff. You can visualize the annotations in Geneious, for example, by importing the FASTA file and then dropping the GFF file on top: Assembly annotated Assembly annotated


20. [SAMPLE_NAME]_hit_contigs.fasta

This file contains the subset of the contigs assembled by MEGAHIT that had hit to the reference markers. See the red rectangles in 19. 06_assembly_annotated.


21. [SAMPLE_NAME]_hit_contigs.gff

Unified annotation track in GFF format for ALL the marker types found in the assembly’s contigs. See the red rectangles in 19. 06_assembly_annotated.


22. [SAMPLE_NAME]_recovery_stats.tsv

Unified tab-separated-values table with marker recovery statistics from ALL the marker types found in the sample, these are concatenated across samples and summarized in the final Marker Recovery report.

For more information on the table see 26. captus-assembly_extract.stats.tsv


23. leftover_contigs.fasta.gz

This file contains the subset of the contigs assembled by MEGAHIT that had no hit to the reference markers. The file is compressed to save space. These are the contigs that are used for clustering across samples in order to discover additional homologous markers.


24. leftover_contigs_after_custering.fasta.gz

This file contains the subset of the contigs assembled by MEGAHIT that had no hit to the reference markers or even to the newly discovered markers derived from clusterin. The file is compressed to save space.


25. captus-assembly_extract.refs.json

This file stores the paths to all the references used for extraction. This file is necessary so the alignment step can correctly add the references to the final alignments to be used as guides.

{
    "NUC": {
        "AA_path": "/Users/emortiz/software/GitHub/Captus/data/Angiosperms353.FAA",
        "AA_msg": "Angiosperms353 /Users/emortiz/software/GitHub/Captus/data/Angiosperms353.FAA",
        "NT_path": "/Users/emortiz/software/GitHub/Captus/data/Angiosperms353.FNA",
        "NT_msg": "Angiosperms353 /Users/emortiz/software/GitHub/Captus/data/Angiosperms353.FNA"
    },
    "PTD": {
        "AA_path": "/Users/emortiz/software/GitHub/Captus/data/SeedPlantsPTD.FAA",
        "AA_msg": "SeedPlantsPTD /Users/emortiz/software/GitHub/Captus/data/SeedPlantsPTD.FAA",
        "NT_path": "/Users/emortiz/software/GitHub/Captus/data/SeedPlantsPTD.FNA",
        "NT_msg": "SeedPlantsPTD /Users/emortiz/software/GitHub/Captus/data/SeedPlantsPTD.FNA"
    },
    "MIT": {
        "AA_path": "/Users/emortiz/software/GitHub/Captus/data/SeedPlantsMIT.FAA",
        "AA_msg": "SeedPlantsMIT /Users/emortiz/software/GitHub/Captus/data/SeedPlantsMIT.FAA",
        "NT_path": "/Users/emortiz/software/GitHub/Captus/data/SeedPlantsMIT.FNA",
        "NT_msg": "SeedPlantsMIT /Users/emortiz/software/GitHub/Captus/data/SeedPlantsMIT.FNA"
    },
    "DNA": {
        "AA_path": null,
        "AA_msg": null,
        "NT_path": "/Volumes/Shuttle500G/for_docs_output/nrDNA.fasta",
        "NT_msg": "/Volumes/Shuttle500G/for_docs_output/nrDNA.fasta"
    },
    "CLR": {
        "AA_path": null,
        "AA_msg": null,
        "NT_path": "/Volumes/Shuttle500G/for_docs_output/03_extractions/01_clustering_data/clust_id79.20_cov80.00_captus_clusters_refs.fasta",
        "NT_msg": "/Volumes/Shuttle500G/for_docs_output/03_extractions/01_clustering_data/clust_id79.20_cov80.00_captus_clusters_refs.fasta"
    }
}

26. captus-assembly_extract.stats.tsv

Unified tab-separated-values table with marker recovery statistics from ALL the markers found in ALL the samples, this table is used to create the final Marker Recovery report. Even though the report is quite useful for visualization you might need to do more complex statistical analysis, this table is the most appropriate output file for such analyses.

Column Description
sample_name Name of the sample.
marker_type Type of marker. Possible values are NUC, PTD, MIT, DNA, or CLR.
locus Name of the locus.
ref_name Name of the reference selected for the locus. Relevant when the reference contains multiple sequences per locus like in Angiosperms353 for example.
ref_coords Match coordinates with respect to the reference, each segment is expressed as [start]-[end], segments within the same contig are separated by ,, and segments in different contigs are separated by ;. For example: 1-47;48-354,355-449 indicates that a contig contained a segment matching reference coordinates 1-49 and a different contig matched two segments, 48-354 and 355-449 respectively.
ref_type Whether the reference is an aminoacid (prot) or nucleotide (nucl) sequence.
ref_len_matched Number of residues matched in the reference.
hit Paralog ranking, 00 is assigned to the best hit, secondary hits start at 01.
pct_recovered Percentage of the total length of the reference sequence that was matched.
pct_identity Percentage of sequence identity between the hit and the reference sequence.
score Inspired by Scipio’s score: (matches - mismatches) / reference length.
wscore Weighted score. When the reference contains multiple sequences per locus, the best-matching reference is decided after normalizing their recovered length across references in the locus and multiplying that value by their respective score, thus producing the wscore. Finally wscore is also penalized by the number of frameshifts (if the marker is coding) and number of contigs used in the assembly of the hit.
hit_len Number of residues matched in the sample’s contig(s) plus the length of the flanking sequence.
cds_len If ref_type is prot this number represents the number of residues corresponding to coding sequence (i.e. exons). If the ref_type is nucl this field shows NA.
intron_len If ref_type is prot this number represents the number of residues corresponding to intervening non-coding sequence segments (i.e. introns). If the ref_type is nucl this field shows NA.
flanks_len Number of residues included in the flanking sequence.
frameshifts Positions of the corrected frameshifts in the output sequence. If the ref_type is nucl this field shows NA.
hit_contigs Number of contigs used to assemble the hit.
hit_l50 Least number of contigs in the hit that contain 50% of the recovered length.
hit_l90 Least number of contigs in the hit that contain 90% of the recovered length.
hit_lg50 Least number of contigs in the hit that contain 50% of the reference locus length.
hit_lg90 Least number of contigs in the hit that contain 90% of the reference locus length.
ctg_names Name of the contigs used in the reconstruction of the hit. Example: NODE_6256_length_619_cov_3.0000_k_169_flag_1;NODE_3991_length_1778_cov_19.0000_k_169_flag_1, for a hit where two contigs were used.
ctg_strands Contig strands (+ or -) provided in the same order as ctg_names. Example: +;- indicates that the contig NODE_6256_length_619_cov_3.0000_k_169_flag_1 was matched in the positive strand while the contig NODE_3991_length_1778_cov_19.0000_k_169_flag_1 was matched in the ngeative strand.
ctg_coords Match coordinates with respect to the contigs in the sample’s assembly. Each segment is expressed as [start]-[end], segments within the same contig are separated by ,, and segments in different contigs are separated by ; which are provided in the same order as ctg_names and ctg_strands. Example: 303-452;694-1626,301-597 indicates that a single segment was matched in contig NODE_6256_length_619_cov_3.0000_k_169_flag_1 in the + strand with coordinates 303-452, while two segments were matched in contig NODE_3991_length_1778_cov_19.0000_k_169_flag_1 in the - strand with coordinates 694-1626 and 301-597 respectively.

27. captus-assembly_extract.report.html

This is the final Marker Recovery report, summarizing marker extraction statistics across all samples and marker types.


28. captus-assembly_extract.log

This is the log from Captus, it contains the command used and all the information shown during the run. If the option --show_less was enabled, the log will also contain all the extra detailed information that was hidden during the run.


29. clust_id##.##_cov##.##_captus_clusters_refs.fasta

This FASTA file contains the cluster representatives that will be searched and extracted across samples (prefix CLR), the loci names are called captus_#. These represent newly discovered homologus markers in contigs that had no hits to other reference proteins or miscellaneous DNA markers.

>GenusA_speciesA_CAP-captus_1
GACTTGAGCCCCAAAACTAGGTTGGGTGCAGGGGGTCGATCTTGATTTTATTACTCAGGGTGCTTCAGATCAGGTTCTTGCAGCTGAACATGCTTCGGGACATCGACCCTATGGTCAGAATCTTCAATCTGGAAGCTCTGCTGGTGCATCAAGCCAACAAGACATGTCCAAGATCATATGCCAGTAAGAAAACAAGGTATGTAAATACCTCTGCGGGTGGTGCTTTACCTGGAACAGCGTCTGAATCGACTGGCGATATCCCTGCTACTGATGATACCCCCCTGATGGATGCTGCAGGGTAAACAGGTTAATATAGATCTGCTATCTCGAGCCCTGTAACCTGGCTATGCTCTCTTTGCGTCTGCTACTAGATATGCTTCTGGCTTTGCCATTTCTATGAATTCTTATAGAATTTTGATTCAGTCACTATGCAAATTTTATTTGTTATGTTTATAAGGTCTCTTGATTGCGCTCGGGCTCCTGTCTCGGGGGAGCCCTGCGCTCCCTACGCTTACTAGAATATTCATTCTCCCCTATTAAAATAAAAATTTATCGAGATAAAAAAAGAA
>GenusA_speciesA_CAP-captus_2
ACCAGATTCCTCCATTGTACAGACAACCATAGGGTCCGACTTTGCAAGACGTTTAAGGCCTTCCACAAGCTTGGGAAGGTCAAATGCCACCTTGCACTGGACAGCCACACGTACCACAGGAGATACAGAAAATTTCATGGCTCGAATTGGATGGGCATCAACTTCCTTCTCGTTTGTTAGAGTGGCATTCTTGGTGATAAATTGGTCCAACCCAACCATGGCAACAGTATTACCACAAGGCACATCCTCCACCGTCTCTTGCCTCTTCCCCATCCAAATAACAGTTCTCTGGACACTCTTCACATACAGGTCTTTCTTCTCTCCAGGAACATAGTTTGGACCCATGATTCTAACTTTCAAACCGGTTGAAACTTTCCCAGAAAAGACTCGACCAAAGGCAAAGAACCTGCCCTTGTCTGATGCAGGAATCATCTTTGAGACATAGAGCATAAGAGGTCCCTCAGGGTCACAATTTCTAATGGCACTAGCATATGCGTCATCTAGTGGACCCTCATACAAATTCTCAACACGAT
>GenusB_speciesB_CAP-captus_3
TCCATACATATCAGAATCAGCATCCTTTTTAGGTCTGTATAGGGTTGAAAGAGTGGGCTGAGCAGTAAATAAACCCTTGTCGTATATGTTGTACTGGTCGTCGTTGGCAAATCCAGAGTCCATTCCTTTATCTTGGTTAAATAGCCTCTGGTCATACATAACCTCTCCCTCTCTACCTGCTCCTGTAGAAGCCATGCCAAGTGCAACCTTTTCACTGATATCACGATCTCTGTCTCTTGTGATCTTACTCTTCTTTCCCATTGCAGCATCTTTAGCTTCCAACCTTCTTTCCCTTTCCCTCTCTCGACGTCGTTCTTCACGAATCTTCTCCCTTTGCAACCGCTCTTCCCTTTCCTCCCTTGTCTCCTTTGGTAAATCCTTCTCTTTTTCTCTTACACGCTCAAATTCTCCTTTCTTCTCAGAAGTATCTACTGTGTTTCTATCAGATGGATAAAGAACTGAAGAAGGCGGTGCAGCTCCTGTTCTCTCAGACCGGGCTTTCTGCGCTAATGCTCGGAGCTCCAGCTCCTTCTTCTCCTTCTGCTTCATAAGCATTTCTTTCTGAACCTTGGATCTCATTGCAACTGCTTCTCTTGCTTTCTGTTCTGCAACATACAGAGCTTCAGATAGCTTTGCAAAGTTATCGTTAATTTGAACTTCTTGAAGGCCTCTCCCATCAGCTGCAAGGCGCTTGTCAAGTGGGATCGTATAACCTTTTGGATTCTTCCAATTTGAAATACAAGGAGGAATCTTCCAATCCTGCTGATCCTTCACAGTGACAGGACGGGGAGGGGAATGCATAACAGGCACAGGTGGAGACCCCGAAGCTCTTGGAACACGTTTATGCTTGAACTTTGGAGGCTCAAGTGGATCAACTGGCATCTCCACCATTCTAATTATTCTCTCCTTGGCGCCTGAATTAAATGCAGCTGATTGTTGAGATGGCTTATACTTGATAAA


FASTA headers explanation

Info

All the FASTA files produced by Captus containing extracted markers follow the same header style: FASTA header FASTA header Paralogs are ranked according to their wscore which, in turn, is calculated from the percent identity (ident) as well as the percent coverage (cover) with respect to the selected reference sequence (query). The best hit is always ranked 00 and secondary hits start at 01. When a single hit is found for a marker (like in locus 5859 in the image) the ranking 00 is not included in the sequence name, only when multiple hits are found for a marker (like in locus 5865 in the image) the ranking 00 or 01 is included in the sequence name to make them unique. The description field frameshifts is only present when the output sequence has corrected frameshifts and the numbers indicate their position in the output sequence.
As you can see, Sample name, Locus name, and Paralog ranking are separated by double underscores (__). This is the reason why we don’t recommend using __ inside your sample names (see sample naming convention)


Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Edgardo M. Ortiz (03.06.2022)