17th October 2013
Here comes a heap of assorted web-links!
I had personally settled on using pbzip2 for these simple reasons:
- performance scales quasi-linearly with the number of CPU cores (until one hits an I/O bottleneck);
- when archive is damaged, you are only guaranteed to loose the damaged block(s) of size 100-900 KiB – remaining information might be salvable.
Compared to pbzip2, neither gzip nor 7z (lzma) offer quasi-linear speedups proportional to the number of CPU cores.
pigz, the parallel gzip, does parallelize compression, but gzip compresses not as good as bzip2, and decompression is not parallel like in pbzip2.
7z is multi-threaded, but it tops out at using 2 CPU cores (see links below for tests).
pbzip2 is also quite a good choice for FASTQ data files: even if a few blocks get lost due to data corruption, this should not noticeably affect the entire dataset.
Specialized tools for FASTQ compression do exist (see e.g. this article, also Fastqz, fqzcomp, and samcomp project pages.) I think I liked fastqz quite a bit, but I still have to examine data recoverability in the case of archive damage. It is possible to use external parity tools which support file repair using pre-calculated recovery files – like the linux par2 utility, also for bzip2 archives and any other files in general – but adding parity file may negate the higher compression ratio benefits. Also, if there is no independent block structure of the archive, insufficient parity file may lead to the loss of the entire archive. In other words, this still has to be tested.
Now the long-promised web-links!
- gzip vs bzip2: bzip2 is ~2.5x slower to compress, ~10x slower to decompress, and produces 26% smaller file
- gzip, bzip2, lzma: bzip2 best on low-entropy data; gzip is the fastest; lzma is the slowest, and also not the best
- gzip, bzip, 7z: 7z decompresses faster than bzip2
- 7z, bzip2: 7z only uses up to 2 cores, bzip2 can use all; for the same time of compression, 7z yields smaller file – e.g. 7z fast 926, bzip2 maximum 987MB, both took 7 minutes
- 7z, gzip, compress, bzip2 (graphs): bzip2 is at least 2x faster than 7z, and up to 2x slower than gzip; bzip2 is the slowest at decompression, 7z is up to 2x better, while gzip is easily 6x faster
- bzip2, gzip: superuser answer provides a simple 5-axis ranking:
decompression speed (fast > slow): gzip, zip > 7z > rar > bzip2
compression speed (fast > slow): gzip, zip > bzip2 > 7z > rar
compression ratio (better > worse): 7z > rar, bzip2 > gzip > zip
availability (unix): gzip > bzip2 > zip > 7z > rar
availability (windows): zip > rar > 7z > gzip, bzip2
- Gzip vs Bzip2 vs LZMA vs XZ vs LZ4 vs LZO (with nice tables): bzip2 is up to 3x slower to decompress; lzma may use up to 16x bzip2′s memory for decompression, which is 64 mb; bzip2 only needs up to 7.2 MB for compression; lzma needs up to 670 MB; fastest lzma is as fast as gzip, but produces smaller files; bzip2 compression time is nearly unaffected by -1..-9 settings; with the same running time, lzma-3 and bzip2-9 produced the same-size file
- lzop vs compress vs gzip vs bzip2 vs lzma vs lzma2/xz: xz-2 is comparable to bzip2-3 in file size, but is faster to compress and decompress; xz-0 is overall better than gzip-9
- gzip, bzip2: bzip2 is better at heterogeneous data
- gzip, bzip2, xz: bzip2 here is faster and smaller than gzip, while still slow to decompress