Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

  • Related entries

    No related content found.

    • Archives

    • Recent comments

    Practical comparison of NGS adapter trimming tools

    1st June 2016

    I used to work with sequencing providers who were giving me fairly clean data.
    It was already barcode-separated, and had no over-represented adapter sequences.
    The only thing I had to do was to (optionally) quality-trim the reads, and check for biological contamination.

    Recently, however, I have come across some real-world data, which not only had contamination in it, but also quite a noticeable percentage of adapters.
    I did a quick test of multiple tools to see if they fit my requirements:

    • should be easy/logical to use: no arcane/convoluted command lines or config files
    • should detect adapters automatically, either using its own database or a provided plain FASTA file
    • should be reasonably fast
    • must leave no adapter traces behind: I prefer aggressive trimming

    I have tried the following tools:

    • fastq-mcf from the ea-tools package
    • skewer
    • TrimmomaticPE
    • cutadapt: haven’t used it directly, but it is used by some of the compared tools
    • bbduk from BBMAP
    • autoadapt
    • TrimGalore!

    As input, I have used 2 FASTQ files, each about 8.4 gigabytes
    (or 3 785 687 KBytes together in 2 bzip2-compressed files, or 129 753 452 lines / 32 438 363 reads per file).
    Time was measured with bash’s built-in time.
    The all_adapters.txt is a plain FASTA file I took from FastQC distribution a long while ago,
    and possibly added some more adapter sequences scavenged from the internet.

    fastq-mcf (ea-tools)
    fastq-mcf ~/bin/all_adapters.txt -o R1.clip.fastq -o R2.clip.fastq input_R1.fastq input_R2.fastq

    • non-obvious way to specify 2 outputs for 2 inputs, but not complicated either
    • can be given a file with dozens of adapters: will auto-identify which adapters to trim
    • single-threaded, uses 315M RES and 380M VIRT
    • 83.5 minutes on a loaded system
    • Reads too short after clip: 137 684
      Clipped ‘end’ reads (input_R1.fastq): Count 895 775, Mean: 24.36, Sd: 17.32
      Trimmed 2 072 551 reads (input_R1.fastq) by an average of 4.46 bases on quality < 7 Clipped 'end' reads (input_R2.fastq): Count 850 718, Mean: 25.70, Sd: 17.19 Trimmed 8 729 083 reads (input_R2.fastq) by an average of 4.44 bases on quality < 7

    skewer
    skewer -x ~/bin/all_adapters.txt --mode pe --threads 8 input_R1.fastq input_R2.fastq

    • looks much fancier: uses colors and has a text-mode progress bar
    • is multi-threaded, but appears to be extremely slow – much slower than single-threaded fastq-mcfupdate: it is incredibly fast if instead of 96 adapters you just give it 3 or so;
    • can read up to 96 adapters from the file… should be fine for most purposes
    • uses very little RAM (~4 megabytes RES, ~450M VIRT)
    • really slow: real 177m52.933s , user 1212m3.644s (7 threads)
    • 32 438 363 read pairs processed; of these:
      12 339 ( 0.04%) short read pairs filtered out after trimming by size control
      94 409 ( 0.29%) empty read pairs filtered out after trimming by size control
      32 331 615 (99.67%) read pairs available; of these:
      934 379 ( 2.89%) trimmed read pairs available after processing
      31 397 236 (97.11%) untrimmed read pairs available after processing

    TrimmomaticPE
    TrimmomaticPE -threads 8 -trimlog trimmomatic.log input_R1.fastq.bz2 input_R2.fastq.bz2 lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq3-PE-2.fa:2:40:15

    • failed to start without seemingly optional arguments to ILLUMINACLIP with an uninformative error message
    • uses 1.5+GB RES, 7.8GB VIRT, and does not fully utilize all 8 threads (CPU load only at around 500%, where 100% means 1 core)
    • does not seem to be I/O bound, but log file is huge: contains all read identifiers
      • it might be better to disable log file (do not specify -trimlog) for higher I/O speed
    • comes bundled with some adapters already, but:
      • does not detect adapters itself: you have to know which file to choose
      • adapter files are structured in a way preventing merging them into a single file: adapter names have special meaning to Trimmomatic
    • real 19m39.431s, user 71m8.600s , sys 23m44.556s: much faster than either skewer or fastq-mcf
    • Input Read Pairs: 32438363
      Both Surviving: 31591307 (97.39%)
      Forward Only Surviving: 750772 (2.31%)
      Reverse Only Surviving: 8023 (0.02%)
      Dropped: 88261 (0.27%)

    NOT trying cutadapt:

    • looks great based on reading the manual
    • only accepts adapters on the command-line, and does not come with adapter files to use
    • is in Python/Python3, so could be easier re-used from Python programs

    BBMAP
    bbduk.sh in=input_R1.fastq.bz2 in2=input_R2.fastq.bz2 out=bbduk_clean_1.fastq out2=bbduk_clean_2.fastq ref=~/bin/all_adapters.txt

    • refused to load some JNI library:
      Error: Could not find or load main class utilities.bbmap.jni.

    • changed into bbmap/jni and ran export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ; make -f makefile.linux, but this didn’t help
    • failed to run

    autoadapt (relies on FastQC and cutadapt)
    autoadapt.pl --threads=8 input_R1.fastq autoadapt_clean_1.fastq input_R2.fastq autoadapt_clean_2.fastq

    • first runs FastQC to a temporary file (0.5GB RES, 4.8GB VIRT)
      • fastqc is started with --threads 8, but only 1 file is fed to fastqc…
    • auto-detected adapters, from FastQC’s output:

      Detected the following known contaminant sequences:
      Illumina Single End PCR Primer 1 (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)
      TruSeq Adapter, Index 7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG)

    • used over 15 GB RAM! + swap!
    • this is too much, killed and re-starting with 1 thread
    • uses cutadapt (<8M RES, <31M VIRT), looking for adapters anywhere (and not only at 3' like TrimGalore does); here's the generated command sample:
      cutadapt --format fastq --match-read-wildcards --times 2 --error-rate 0.2
      --minimum-length 18 --quality-cutoff 20 --quality-base 33
      --anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
      --anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
      --anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
      --anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
      --paired-output autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
      -o autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp
      input_R1.fastq input_R2.fastq && cutadapt --format fastq --match-read-wildcards
      --times 2 --error-rate 0.2 --minimum-length 18 --quality-cutoff 20 --quality-base 33
      --anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
      --anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
      --anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
      --anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
      --paired-output autoadapt_R1.fastq -o autoadapt_R2.fastq
      autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
      autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp

    • uses its own directory for intermediate/temporary files, then moves to destination – not good…
      • the problem is that program’s partition may not have enough space for all the intermediate data
      • actually, cutadapt is run twice:
        • first to the temporary directory
        • then to the final destination, using temporary/intermediate files as inputs
    • ran out of space in /home… created a copy of cutadapt under ~/data volume
    • /usr/bin/time -f '%C: %e s, %M Kb' ~/data/autoadapt-tmp-copy/autoadapt.pl --threads=1 input_R1.fastq autoadapt_R1.fastq input_R2.fastq autoadapt_R2.fastq
    • over 1h CPU time already, and still about half-done… should try with --threads=2 or 4, maybe RAM use will be somewhat better?
    • total time 9979.87 seconds (2.8 hours), max RAM 235 480 Kb
    • trying in 4 threads: again 15+ Gb RAM and 7+Gb swap, killed at this point;
      • the problem seems to be somewhere in the read splitting code – apparently, it keeps reads in RAM (???) while splitting…
      • looking at the split files: they are all partial, so autoadapt.pl somehow attempts to parallel-split into all thread segments at once
    • trying to edit splitFile() function to use GNU split command; hopefully, mergeFile() does not use gigabytes of RAM…
      • for testing: hard-code tmp dir name; skip actual fastqc
      • this now works great! let’s wait for merging…
      • mergeFile() still eats ~2.5Gb of RES
    • because of all the splitting, temporary directory size easily jumps to about 3x the original file size, or ~48 GB for ~16 GB of input files
    • 3007.68 s (50 minutes – this does not include the initial FastQC run), 2 523 952 Kb (this is mostly the file merging operation)
    • it does not show any stats at the end

    Trim Galore!
    trim_galore --fastqc --path-to-cutadapt /usr/bin/cutadapt3 --paired input_R1.fastq input_R2.fastq

    • the trim_galore perl wrapper itself consumes just a few megabytes of RAM
    • uses cutadapt for actual work
    • auto-detects adapters, although somehow the Illumina adapter found is only a substring of what was found by autoadapt/FastQC…

      Found perfect matches for the following adapter sequences:
      Adapter type Count Sequence Sequences analysed Percentage
      Illumina 17429 AGATCGGAAGAGC 1000000 1.74
      Nextera 0 CTGTCTCTTATA 1000000 0.00
      smallRNA 0 TGGAATTCTCGG 1000000 0.00
      Using Illumina adapter for trimming (count: 17429). Second best hit was Nextera (count: 0)

    • can run FastQC itself on the processed data, if so instructed by a command-line option
    • trims and summarizes each file separately
    • Total reads processed: 32,438,363
      Reads with adapters: 6,878,225 (21.2%)
      Reads written (passing filters): 32,438,363 (100.0%)
      Total basepairs processed: 3,276,274,663 bp
      Quality-trimmed: 11,132,367 bp (0.3%)
      Total written (filtered): 3,226,980,229 bp (98.5%)

    • cutadapt processes about 4 million reads/minute on my work PC i7
    • Total reads processed: 32,438,363
      Reads with adapters: 6,030,241 (18.6%)
      Reads written (passing filters): 32,438,363 (100.0%)
      Total basepairs processed: 3,276,274,663 bp
      Quality-trimmed: 40,530,133 bp (1.2%)
      Total written (filtered): 3,199,297,597 bp (97.7%)

    • length is checked after cutadapt:

      Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 145312 (0.45%)

    • 1955.49 s (32.6 minutes), 228 592 Kb (this is likely FastQC’s top RAM use)

    How do I evaluate the quality of trimming?
    Notably, all trimmers removed the “Adapters detected” section from FastQC’s output.
    For now, I’m simply choosing the smallest pair of processed read files
    (under the assumption that the smallest is the most aggressively trimmed).

    File sizes after trimming, R1+R2
    16’750’631 trimmomatic
    16’770’603 autoadapt, threads=1
    16’771’639 autoadapt, threads=8 // after swapping Perl splitter function for GNU split
    16’924’934 trimGalore
    16’963’937 fastq-mcf
    17’057’065 skewer

    Looking at FastQC plots, major differences can be seen in read lengths distribution (which depends on how much of the sequence tail/head was trimmed),
    per-tile quality (trimmomatic and skewer do not perform any kind of quality trimming by default, others do), and k-mer content.
    For k-mer content, trimmomatic, trimGalore, and skewer look the most natural: there is a background of random-looking lesser spikes (up to 2-4),
    and one or two bigger spikes (up to 12). For other tools (autoadapt, fastq-mcf) k-mer content looks like a flat line (but likely also 2-4)
    with several huge spikes (up to 35-40). In fact, only autoadapt, trimgalore, and skewer got a “warning” on k-mer content – all others got an “error”.

    Overall, Trimmomatic and trimGalore appear to be the two best adapter trimmers, both by aggressiveness+FastQC reports and by speed.
    But trimGalore detected significantly shorter adapter, and also Trimmomatic produced a smaller, more aggressively trimmed file.
    On the downside, Trimmomatic does not auto-detect adapters! This can be alleviated by first running FastQC on the input files,
    then checking /usr/share/trimmomatic/ for matching adapter files – those which contain both adapters detected by FastQC.

    Will use Trimmomatic for now.

    Important update:

    • It is possible to (quite easily) construct a file with all the adapters for Trimmomatic, and it will happily try to trim anything from that file; Trimmomatic is now my sledgehammer – give it anything, and it will crush it.
    • I have just used cutadapt directly, on a peculiar case of Nextera transposon contamination throughout the length of reads. The advantage of cutadapt is that you can specify how many times to trim the adapter – by default it is just 1, but I’ve set it to 20 and got rid of all Nextera leftovers. cutadapt is now my scalpel – I use it in pathological cases, when I know what (and how much of it) to cut out.
    • Specifically for Nextera, I’m now using NxTrim – a tool from Illumina, which examines the reads and splits them into several categories: proper MP, PE, single-end/overlapping reads, and unknown. After NxTrim, individual reads should still have other sequencing adapters clipped.
    Share

    6 Responses to “Practical comparison of NGS adapter trimming tools”

    1. caio Says:

      Could you share your adapters file for Trimmomatic?
      Cheers

    2. Bogdan Says:

      Hi, sorry for a very slow reply…
      Do you still need the adapters file?
      There are 2 caveats:
      - I think Illumina forbids/discourages publishing adapter sequences, so I can only email those to you;
      - do not assume that adapters in that file are all hand-picked and definitely correct – they were assembled from different sources and are more reminiscent of a “heap of things” than of a proper collection.

    3. Joyce Says:

      Hi,
      I’m new to using trimmomatic. Been trying to get it to remove adapter sequences but it keeps missing out on the ones I specified in my adapter fasts file. Can you share with me the naming conventions needed for the trimmer to work?

      Thanks!

    4. Bogdan Says:

      Hi Joyce, I haven’t added any custom adapters to the file provided with Trimmomatic, so I don’t remember the naming conventions – but they are described in the Trimmomatic manual.

    5. Brian Bushnell Says:

      Hi!

      I’m not really sure why BBDuk did not work for you, but you can try this –

      At the bottom of bbduk.sh is this line:
      local CMD=”java -Djava.library.path=$NATIVELIBDIR $EA $z $z2 -cp $CP jgi.BBDukF $@”

      Change that to:
      local CMD=”java $EA $z $z2 -cp $CP jgi.BBDukF $@”

      …and then it should not encounter any JNI-related problems.

    6. Bogdan Says:

      Hi Brian,

      thank you for stopping by and offering a fix! I wonder how you found this page :)

      As you probably noticed, this is not a proper comparison with any degree of rigor – but rather a quick pick of “whatever seems to work”.
      This is why I have not invested much effort into making BBduk run…

      I’m definitely still interested in using the entire BBMAP suite – the comparison plots on sourceforge page are quite attractive.

      As I’m still working with all kinds of NGS data, I’ll give BBMAP a proper try in one of my next projects.

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>