How to use mkfifo named pipes with prinseq-lite.pl
24th February 2016
prinseq-lite.pl is a utility written in Perl for preprocessing NGS reads, also in FASTQ format.
It can read sequences both from files and from stdin (if you only have 1 sequence).
I wanted to use it with compressed (gzipped/bzipped2) FASTQ input files.
As I do not need to store decompressed input files, the most efficient solution is to use pipes.
This works well for a single file, but not for 2 files (paired-end reads).
For 2 files, named pipes (also known as FIFOs) can be used.
You can create a named pipe in Linux with the help of mkfifo
command, for example mkfifo R1_decompressed.fastq
.
To use it, start decompressing something into it (either in a different terminal, or in background), for example zcat R1.fastq.gz > R1_decompressed.fastq &
;
we can call this a writing/generating process, because it writes into a pipe.
(If you are writing software to use named pipes, any processes writing into them should be started in a new thread, as they will block until all the data is consumed.)
Now if you give the R1_decompressed.fastq as a file argument to some other program, it will see decompressed content (e.g. wc -l R1_decompressed.fastq
will tell you the number of lines in the decompressed file); we can call program reading from the named pipe a reading/consuming process.
As soon as a consuming process had consumed (read) all of the data, the writing/generating process will finally exit.
This, however, does not work with prinseq-lite.pl (version 0.20.4 or earlier), with a broken pipe error.
Named pipes are very similar to usual files, with two major differences:
- named pipes are not seekable: you cannot move file pointer (at least not backwards, not sure about skipping forward);
- you cannot arbitrarily close/re-open a named pipe from the consuming end: closing a pipe on the consuming end also closes it for the writing/generating process.
The reason why prinseq-lite.pl does not work with named pipes is that it performs file format checking first – by opening the file, reading the first 3 lines, and closing it.
Closing a named pipe causes broken pipe for the writing process, and when prinseq-lite.pl attempts to open the pipe again – it succeeds, but there is no data there anymore, so it just sits and waits for data
I’m ok with a quick and dirty solution, so here it is: prinseq-lite.pl patch to enable mkfifo named pipes as input files (also local prinseq-lite.pl.patch).
WARNING: this patch simply disables file format checking!