Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    The favourite file compressor: gzip, bzip2, or 7z?

    17th October 2013

    Here comes a heap of assorted web-links!

    I had personally settled on using pbzip2 for these simple reasons:

    • performance scales quasi-linearly with the number of CPU cores (until one hits an I/O bottleneck);
    • when archive is damaged, you are only guaranteed to loose the damaged block(s) of size 100-900 KiB – remaining information might be salvable.

    Compared to pbzip2, neither gzip nor 7z (lzma) offer quasi-linear speedups proportional to the number of CPU cores.
    pigz, the parallel gzip, does parallelize compression, but gzip compresses not as good as bzip2, and decompression is not parallel like in pbzip2.
    7z is multi-threaded, but it tops out at using 2 CPU cores (see links below for tests).

    pbzip2 is also quite a good choice for FASTQ data files: even if a few blocks get lost due to data corruption, this should not noticeably affect the entire dataset.
    Specialized tools for FASTQ compression do exist (see e.g. this article, also Fastqz, fqzcomp, and samcomp project pages.) I think I liked fastqz quite a bit, but I still have to examine data recoverability in the case of archive damage. It is possible to use external parity tools which support file repair using pre-calculated recovery files – like the linux par2 utility, also for bzip2 archives and any other files in general – but adding parity file may negate the higher compression ratio benefits. Also, if there is no independent block structure of the archive, insufficient parity file may lead to the loss of the entire archive. In other words, this still has to be tested.

    Now the long-promised web-links!


