Clean

Unless you are starting with your own assemblies, cleaning will be the first step in the analysis. Captus makes it easy to process many samples in a consistent manner, automatically, and providing a comprehensive Quality Control HTML report.

Captus allows you the flexibility to provide reads cleaned elsewhere. however, we recommend our cleaning method for its accuracy in removing adaptors, which in turn improves the chances of getting a higher quality assembly.

Note

In case you still want to start with previously cleaned reads, you can jump ahead to the assemble command page.

Further reading: Why did we choose BBTools for data cleaning?

We needed something FAST, but most importantly ACCURATE and RELIABLE. Years ago, Brian Bushnell, the author of BBTools, posted a comparison between his own bbduk.sh, and the popular Cutadapt and Trimmomatic. You can read the experiment setup and his results here.

Since then, there have been new versions of these programs, and a new one, fastp, was published. Therefore, we updated the experiment including fastp, increasing the minimum length to 21 bp, and processing paired-end reads which is closer to the real use of these tools.

Another popular software, Trim Galore, was not considered because it is only a wrapper around Cutadapt, additionally, it displays some unusual behavior when used in Macs.

First, we created a conda environment for the four selected programs:

conda create -n rem_adaptors -c bioconda -c conda-forge bbmap cutadapt fastp trimmomatic
conda activate rem_adaptors

Then, we grabbed a million reads (PE 2x150 bp) from a plant genome project, reformat.sh is also part of BBTools:

reformat.sh in=SB12_R#.fq.gz out=raw_R#.fq.gz reads=1000000

These are the same “rotated” (A->T, C->A, G->C, T->G) adaptors that were included in the file gruseq.fa that Bushnell made for his test:

>Gruseq_Adapter_Index_1_6
CTGACCTTCTCATATACGAGCTTAGAATCGATATGATACTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_2
CTGACCTTCTCATATACGAGCTTAGAATCGATAACTGCGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_3
CTGACCTTCTCATATACGAGCTTAGAATCGATAGGTCCATGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_4
CTGACCTTCTCATATACGAGCTTAGAATCGATAGCTAATTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_5
CTGACCTTCTCATATACGAGCTTAGAATCGATATATCGCTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_6
CTGACCTTCTCATATACGAGCTTAGAATCGATACAATTGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_7
CTGACCTTCTCATATACGAGCTTAGAATCGATAATCTGATGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_8
CTGACCTTCTCATATACGAGCTTAGAATCGATATAGGCTTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_9
CTGACCTTCTCATATACGAGCTTAGAATCGATACTGATCTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_10
CTGACCTTCTCATATACGAGCTTAGAATCGATAGTCAGGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_11
CTGACCTTCTCATATACGAGCTTAGAATCGATACCAGTATGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_12
CTGACCTTCTCATATACGAGCTTAGAATCGATAAGGCGTTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_13
CTGACCTTCTCATATACGAGCTTAGAATCGATATCGATTATTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_14
CTGACCTTCTCATATACGAGCTTAGAATCGATATCGGAACGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_15
CTGACCTTCTCATATACGAGCTTAGAATCGATATGCGATCTTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_16
CTGACCTTCTCATATACGAGCTTAGAATCGATAAACGAAACTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_18_7
CTGACCTTCTCATATACGAGCTTAGAATCGATACGAACATATGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_19
CTGACCTTCTCATATACGAGCTTAGAATCGATACGCTTTACTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_20
CTGACCTTCTCATATACGAGCTTAGAATCGATACGCCAAGGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_21
CTGACCTTCTCATATACGAGCTTAGAATCGATACGGGACCTTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_22
CTGACCTTCTCATATACGAGCTTAGAATCGATAACGTACGTTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_23
CTGACCTTCTCATATACGAGCTTAGAATCGATACTCGCCTGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_25
CTGACCTTCTCATATACGAGCTTAGAATCGATATAGCTGTGTGAGACGTGCAACGAGGAGCAGGC
>Gruseq_Adapter_Index_27
CTGACCTTCTCATATACGAGCTTAGAATCGATATGGAAGGGTGAGACGTGCAACGAGGAGCAGGC

Now, we add those adaptors to the raw reads with another program from BBTools:

addadapters.sh in=raw_R#.fq.gz out=dirty_R#.fq.gz qout=33 ref=gruseq.fa right

addadapters.sh modifies the read headers adding the exact information of where was the adaptor inserted. For example, the adaptors were added from position 124 in this read:

@4_150_124_150_124 /1
ANTAAAAAGAGTAGTGTCAGATAGCTTATATGGAGAAAGCCATAGCAATTTTATCAGTGCTGTAGAGGAATTAAAAATAGAATATGCAGTGGGAATCTGGAGCAATCATGGGGTCTGGCTTCCACTGACCTTCTCATATACGAGCTTAGA
+
F!FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:F:FFFFFFFFFFF

Now we proceed to remove these synthetic adaptors, these are the commands used for each software:

# Cutadapt 3.4:
time cutadapt -m 21 -j 0 -b "file:gruseq.fa" -B "file:gruseq.fa" -o cutadapt_R1.fq.gz -p cutadapt_R2.fq.gz dirty_R1.fq.gz dirty_R2.fq.gz

# Trimmomatic 0.39:
time trimmomatic PE -phred33 dirty_R1.fq.gz dirty_R2.fq.gz trimmomatic_R1.fq.gz trimmomatic_U1.fq.gz trimmomatic_R2.fq.gz trimmomatic_U2.fq.gz ILLUMINACLIP:gruseq.fa:2:28:10:2:keepBothReads MINLEN:21

# fastp 0.22.0:
time fastp -w 8 -Q -l 21 --adapter_fasta gruseq.fa --detect_adapter_for_pe --in1 dirty_R1.fq.gz --in2 dirty_R2.fq.gz --out1 fastp_R1.fq.gz --out2 fastp_R2.fq.gz

# bbduk.sh 38.92:
time bbduk.sh in=dirty_R#.fq.gz out=bbduk_R#.fq.gz ref=gruseq.fa ktrim=r mink=12 hdist=1 minlen=21 tpe tbo

# bbduk.sh 38.92 in Captus:
time bbduk.sh ktrim=r minlength=21 interleaved=f tpe tbo ref=gruseq.fa in=dirty_R#.fq.gz out=stdout.fq k=21 mink=11 hdist=2 | bbduk.sh ktrim=r minlength=21 interleaved=f tpe tbo ref=gruseq.fa in=stdin.fq out=captus_R#.fq.gz k=19 mink=9 hdist=1

Finally, we use addapters.sh to grade the presence of adaptor of each set of reads:

addadapters.sh in=dirty_R#.fq.gz grade
addadapters.sh in=cutadapt_R#.fq.gz grade
addadapters.sh in=trimmomatic_R#.fq.gz grade
addadapters.sh in=fastp_R#.fq.gz grade
addadapters.sh in=bbduk_R#.fq.gz grade
addadapters.sh in=captus_R#.fq.gz grade

The tests were made in a MacbookPro 2015 with 4 CPU cores (8 threads) and 16 GB of RAM, we can see in this table the summary of the times and the results from the grading of addadapters.sh:

Metric	dirty	Cutadapt	Trimmomatic	fastp	bbduk	Captus
Time to clean	NA	3m42.848s	1m11.250s	1m42.455s	0m9.574s	0m15.249s
Reads retained	100.000	93.345	92.514	92.997	93.002	92.994
Bases retained	100.000	74.53	74.436	73.970	74.268	74.186
Perfectly correct (Reads)	49.970	97.35	80.922	96.256 *	94.849	95.784
Perfectly correct (Bases)	49.970	96.92	86.426	96.035 *	93.900	95.099
Incorrect (Reads)	50.030	2.65	19.078	3.744 *	5.151	4.216
Incorrect (Bases)	50.030	3.08	13.574	3.965 *	6.100	4.901
Adaptors remaining (Reads)	50.030	2.41	5.846	1.830 *	3.866	2.798
Adaptors remaining (Bases)	25.182	0.28	0.422	0.049 *	0.193	0.105
Non-adaptor removed (Reads)	0.000	1.53	13.231	1.914	1.285	1.418
Non-adaptor removed (Bases)	0.000	0.04	0.218	0.566	0.308	0.325

All numbers (except times) represent percentages, Reads and Bases retained are shown as percentage of the input “dirty” reads, while all the rest are as percentage of each method’s number of reads and bases retained. Best values are in bold and italics and the second best only in bold.

Trimmomatic is by far the least accurate in general, the rest have comparable high accuracies.

The most accurate method measured by most metrics [except Adaptors remaining (Bases) and Non-adaptor removed (Reads)] is Cutadapt, but at the cost of being the slowest. It is >23x slower than bbduk.sh and >14x slower than Captus’ settings for bbduk.sh.

Even though fastp is slightly more accurate than bbduk.sh [as measured by all metrics, except Non-adaptor removed (Reads or Bases)], it takes >10x longer to finish. More importantly, we put an * next to fastp’s winning values because when cleaning real unaltered data (without extra adapters added) we noticed that fastp sometimes mistakes real genomic sequence as adaptor, and we are not the only that noticed this, you can read about it in this GitHub Issue. In the future, if the developers of fastp fix the false detection problem we could incorporate it in the pipeline.

The fastest times correspond in both cases to bbduk.sh. Notably, the second most effective method in removing adaptor bases is bbduk.sh with Captus’s settings [see Adaptors remaining (Bases) after cleaning].

Our preference for bbduk.sh is justified at this point by the balance between its high accuracy (>95%) and its great speed (>10x faster than the second fastest), but mainly because of its efficiency in removing adaptor bases from the reads.

Created by Edgardo M. Ortiz (06.08.2021)
Last modified by Gentaro Shigita (12.05.2025)