Practical comparison of NGS adapter trimming tools

1st June 2016

I used to work with sequencing providers who were giving me fairly clean data.
It was already barcode-separated, and had no over-represented adapter sequences.
The only thing I had to do was to (optionally) quality-trim the reads, and check for biological contamination.

Recently, however, I have come across some real-world data, which not only had contamination in it, but also quite a noticeable percentage of adapters.
I did a quick test of multiple tools to see if they fit my requirements:

should be easy/logical to use: no arcane/convoluted command lines or config files
should detect adapters automatically, either using its own database or a provided plain FASTA file
should be reasonably fast
must leave no adapter traces behind: I prefer aggressive trimming

I have tried the following tools:

fastq-mcf from the ea-tools package
skewer
TrimmomaticPE
cutadapt: haven’t used it directly, but it is used by some of the compared tools
bbduk from BBMAP
autoadapt
TrimGalore!

As input, I have used 2 FASTQ files, each about 8.4 gigabytes
(or 3 785 687 KBytes together in 2 bzip2-compressed files, or 129 753 452 lines / 32 438 363 reads per file).
Time was measured with bash’s built-in time.
The all_adapters.txt is a plain FASTA file I took from FastQC distribution a long while ago,
and possibly added some more adapter sequences scavenged from the internet.

fastq-mcf (ea-tools)
fastq-mcf ~/bin/all_adapters.txt -o R1.clip.fastq -o R2.clip.fastq input_R1.fastq input_R2.fastq

non-obvious way to specify 2 outputs for 2 inputs, but not complicated either
can be given a file with dozens of adapters: will auto-identify which adapters to trim
single-threaded, uses 315M RES and 380M VIRT
83.5 minutes on a loaded system

Reads too short after clip: 137 684
Clipped ‘end’ reads (input_R1.fastq): Count 895 775, Mean: 24.36, Sd: 17.32
Trimmed 2 072 551 reads (input_R1.fastq) by an average of 4.46 bases on quality < 7 Clipped 'end' reads (input_R2.fastq): Count 850 718, Mean: 25.70, Sd: 17.19 Trimmed 8 729 083 reads (input_R2.fastq) by an average of 4.44 bases on quality < 7

skewer
skewer -x ~/bin/all_adapters.txt --mode pe --threads 8 input_R1.fastq input_R2.fastq

looks much fancier: uses colors and has a text-mode progress bar
is multi-threaded, but appears to be extremely slow – much slower than single-threaded fastq-mcf – update: it is incredibly fast if instead of 96 adapters you just give it 3 or so;
can read up to 96 adapters from the file… should be fine for most purposes
uses very little RAM (~4 megabytes RES, ~450M VIRT)
really slow: real 177m52.933s , user 1212m3.644s (7 threads)

32 438 363 read pairs processed; of these:
12 339 ( 0.04%) short read pairs filtered out after trimming by size control
94 409 ( 0.29%) empty read pairs filtered out after trimming by size control
32 331 615 (99.67%) read pairs available; of these:
934 379 ( 2.89%) trimmed read pairs available after processing
31 397 236 (97.11%) untrimmed read pairs available after processing

TrimmomaticPE
TrimmomaticPE -threads 8 -trimlog trimmomatic.log input_R1.fastq.bz2 input_R2.fastq.bz2 lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq3-PE-2.fa:2:40:15

failed to start without seemingly optional arguments to ILLUMINACLIP with an uninformative error message
uses 1.5+GB RES, 7.8GB VIRT, and does not fully utilize all 8 threads (CPU load only at around 500%, where 100% means 1 core)
does not seem to be I/O bound, but log file is huge: contains all read identifiers
- it might be better to disable log file (do not specify -trimlog) for higher I/O speed
comes bundled with some adapters already, but:
- does not detect adapters itself: you have to know which file to choose
- adapter files are structured in a way preventing merging them into a single file: adapter names have special meaning to Trimmomatic
real 19m39.431s, user 71m8.600s , sys 23m44.556s: much faster than either skewer or fastq-mcf

Input Read Pairs: 32438363
Both Surviving: 31591307 (97.39%)
Forward Only Surviving: 750772 (2.31%)
Reverse Only Surviving: 8023 (0.02%)
Dropped: 88261 (0.27%)

NOT trying cutadapt:

looks great based on reading the manual
only accepts adapters on the command-line, and does not come with adapter files to use
is in Python/Python3, so could be easier re-used from Python programs

BBMAP
bbduk.sh in=input_R1.fastq.bz2 in2=input_R2.fastq.bz2 out=bbduk_clean_1.fastq out2=bbduk_clean_2.fastq ref=~/bin/all_adapters.txt

refused to load some JNI library:
Error: Could not find or load main class utilities.bbmap.jni.
changed into bbmap/jni and ran export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ; make -f makefile.linux, but this didn’t help
failed to run

autoadapt (relies on FastQC and cutadapt)
autoadapt.pl --threads=8 input_R1.fastq autoadapt_clean_1.fastq input_R2.fastq autoadapt_clean_2.fastq

first runs FastQC to a temporary file (0.5GB RES, 4.8GB VIRT)
- fastqc is started with --threads 8, but only 1 file is fed to fastqc…
auto-detected adapters, from FastQC’s output:
Detected the following known contaminant sequences:
Illumina Single End PCR Primer 1 (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)
TruSeq Adapter, Index 7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG)
used over 15 GB RAM! + swap!
this is too much, killed and re-starting with 1 thread

uses cutadapt (<8M RES, <31M VIRT), looking for adapters anywhere (and not only at 3' like TrimGalore does); here's the generated command sample:

cutadapt --format fastq --match-read-wildcards --times 2 --error-rate 0.2
--minimum-length 18 --quality-cutoff 20 --quality-base 33
--anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
--anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
--anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
--anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--paired-output autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
-o autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp
input_R1.fastq input_R2.fastq && cutadapt --format fastq --match-read-wildcards
--times 2 --error-rate 0.2 --minimum-length 18 --quality-cutoff 20 --quality-base 33
--anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
--anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
--anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
--anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--paired-output autoadapt_R1.fastq -o autoadapt_R2.fastq
autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp

uses its own directory for intermediate/temporary files, then moves to destination – not good…
- the problem is that program’s partition may not have enough space for all the intermediate data
- actually, cutadapt is run twice:
  - first to the temporary directory
  - then to the final destination, using temporary/intermediate files as inputs
ran out of space in /home… created a copy of cutadapt under ~/data volume
/usr/bin/time -f '%C: %e s, %M Kb' ~/data/autoadapt-tmp-copy/autoadapt.pl --threads=1 input_R1.fastq autoadapt_R1.fastq input_R2.fastq autoadapt_R2.fastq
over 1h CPU time already, and still about half-done… should try with --threads=2 or 4, maybe RAM use will be somewhat better?
total time 9979.87 seconds (2.8 hours), max RAM 235 480 Kb
trying in 4 threads: again 15+ Gb RAM and 7+Gb swap, killed at this point;
- the problem seems to be somewhere in the read splitting code – apparently, it keeps reads in RAM (???) while splitting…
- looking at the split files: they are all partial, so autoadapt.pl somehow attempts to parallel-split into all thread segments at once
trying to edit splitFile() function to use GNU split command; hopefully, mergeFile() does not use gigabytes of RAM…
- for testing: hard-code tmp dir name; skip actual fastqc
- this now works great! let’s wait for merging…
- mergeFile() still eats ~2.5Gb of RES
because of all the splitting, temporary directory size easily jumps to about 3x the original file size, or ~48 GB for ~16 GB of input files
3007.68 s (50 minutes – this does not include the initial FastQC run), 2 523 952 Kb (this is mostly the file merging operation)
it does not show any stats at the end

Trim Galore!
trim_galore --fastqc --path-to-cutadapt /usr/bin/cutadapt3 --paired input_R1.fastq input_R2.fastq

the trim_galore perl wrapper itself consumes just a few megabytes of RAM
uses cutadapt for actual work
auto-detects adapters, although somehow the Illumina adapter found is only a substring of what was found by autoadapt/FastQC…
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 17429 AGATCGGAAGAGC 1000000 1.74
Nextera 0 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 17429). Second best hit was Nextera (count: 0)
can run FastQC itself on the processed data, if so instructed by a command-line option
trims and summarizes each file separately

Total reads processed: 32,438,363
Reads with adapters: 6,878,225 (21.2%)
Reads written (passing filters): 32,438,363 (100.0%)
Total basepairs processed: 3,276,274,663 bp
Quality-trimmed: 11,132,367 bp (0.3%)
Total written (filtered): 3,226,980,229 bp (98.5%)

cutadapt processes about 4 million reads/minute on my work PC i7

Total reads processed: 32,438,363
Reads with adapters: 6,030,241 (18.6%)
Reads written (passing filters): 32,438,363 (100.0%)
Total basepairs processed: 3,276,274,663 bp
Quality-trimmed: 40,530,133 bp (1.2%)
Total written (filtered): 3,199,297,597 bp (97.7%)

length is checked after cutadapt:
Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 145312 (0.45%)
1955.49 s (32.6 minutes), 228 592 Kb (this is likely FastQC’s top RAM use)

How do I evaluate the quality of trimming?
Notably, all trimmers removed the “Adapters detected” section from FastQC’s output.
For now, I’m simply choosing the smallest pair of processed read files
(under the assumption that the smallest is the most aggressively trimmed).

File sizes after trimming, R1+R2
16’750’631 trimmomatic
16’770’603 autoadapt, threads=1
16’771’639 autoadapt, threads=8 // after swapping Perl splitter function for GNU split
16’924’934 trimGalore
16’963’937 fastq-mcf
17’057’065 skewer

Looking at FastQC plots, major differences can be seen in read lengths distribution (which depends on how much of the sequence tail/head was trimmed),
per-tile quality (trimmomatic and skewer do not perform any kind of quality trimming by default, others do), and k-mer content.
For k-mer content, trimmomatic, trimGalore, and skewer look the most natural: there is a background of random-looking lesser spikes (up to 2-4),
and one or two bigger spikes (up to 12). For other tools (autoadapt, fastq-mcf) k-mer content looks like a flat line (but likely also 2-4)
with several huge spikes (up to 35-40). In fact, only autoadapt, trimgalore, and skewer got a “warning” on k-mer content – all others got an “error”.

Overall, Trimmomatic and trimGalore appear to be the two best adapter trimmers, both by aggressiveness+FastQC reports and by speed.
But trimGalore detected significantly shorter adapter, and also Trimmomatic produced a smaller, more aggressively trimmed file.
On the downside, Trimmomatic does not auto-detect adapters! This can be alleviated by first running FastQC on the input files,
then checking /usr/share/trimmomatic/ for matching adapter files – those which contain both adapters detected by FastQC.

Will use Trimmomatic for now.

Important update:

It is possible to (quite easily) construct a file with all the adapters for Trimmomatic, and it will happily try to trim anything from that file; Trimmomatic is now my sledgehammer – give it anything, and it will crush it.
I have just used cutadapt directly, on a peculiar case of Nextera transposon contamination throughout the length of reads. The advantage of cutadapt is that you can specify how many times to trim the adapter – by default it is just 1, but I’ve set it to 20 and got rid of all Nextera leftovers. cutadapt is now my scalpel – I use it in pathological cases, when I know what (and how much of it) to cut out.
Specifically for Nextera, I’m now using NxTrim – a tool from Illumina, which examines the reads and splits them into several categories: proper MP, PE, single-end/overlapping reads, and unknown. After NxTrim, individual reads should still have other sequencing adapters clipped.

This entry was posted on Wednesday, June 1st, 2016 at 21:23 and is filed under Bioinformatics. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

6 Responses to “Practical comparison of NGS adapter trimming tools”

caio Says:
November 22nd, 2016 at 17:10
Could you share your adapters file for Trimmomatic?
Cheers
Bogdan Says:
December 11th, 2016 at 1:47
Hi, sorry for a very slow reply…
Do you still need the adapters file?
There are 2 caveats:
- I think Illumina forbids/discourages publishing adapter sequences, so I can only email those to you;
- do not assume that adapters in that file are all hand-picked and definitely correct – they were assembled from different sources and are more reminiscent of a “heap of things” than of a proper collection.
Joyce Says:
August 22nd, 2017 at 10:17
Hi,
I’m new to using trimmomatic. Been trying to get it to remove adapter sequences but it keeps missing out on the ones I specified in my adapter fasts file. Can you share with me the naming conventions needed for the trimmer to work?
Thanks!
Bogdan Says:
August 29th, 2017 at 21:48
Hi Joyce, I haven’t added any custom adapters to the file provided with Trimmomatic, so I don’t remember the naming conventions – but they are described in the Trimmomatic manual.
Brian Bushnell Says:
September 7th, 2017 at 7:19
Hi!
I’m not really sure why BBDuk did not work for you, but you can try this –
At the bottom of bbduk.sh is this line:
local CMD=”java -Djava.library.path=$NATIVELIBDIR $EA $z $z2 -cp $CP jgi.BBDukF $@”
Change that to:
local CMD=”java $EA $z $z2 -cp $CP jgi.BBDukF $@”
…and then it should not encounter any JNI-related problems.
Bogdan Says:
September 14th, 2017 at 22:43
Hi Brian,
thank you for stopping by and offering a fix! I wonder how you found this page
As you probably noticed, this is not a proper comparison with any degree of rigor – but rather a quick pick of “whatever seems to work”.
This is why I have not invested much effort into making BBduk run…
I’m definitely still interested in using the entire BBMAP suite – the comparison plots on sourceforge page are quite attractive.
As I’m still working with all kinds of NGS data, I’ll give BBMAP a proper try in one of my next projects.

« Nobody wants higher-quality, complete bacterial genomes

TSW-friendly task and note management software »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta