Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    • Archives

    • Recent comments

    The favourite file compressor: gzip, bzip2, or 7z?

    17th October 2013

    Here comes a heap of assorted web-links!

    I had personally settled on using pbzip2 for these simple reasons:

    • performance scales quasi-linearly with the number of CPU cores (until one hits an I/O bottleneck);
    • when archive is damaged, you are only guaranteed to loose the damaged block(s) of size 100-900 KiB – remaining information might be salvable.

    Compared to pbzip2, neither gzip nor 7z (lzma) offer quasi-linear speedups proportional to the number of CPU cores.
    pigz, the parallel gzip, does parallelize compression, but gzip compresses not as good as bzip2, and decompression is not parallel like in pbzip2.
    7z is multi-threaded, but it tops out at using 2 CPU cores (see links below for tests).

    pbzip2 is also quite a good choice for FASTQ data files: even if a few blocks get lost due to data corruption, this should not noticeably affect the entire dataset.
    Specialized tools for FASTQ compression do exist (see e.g. this article, also Fastqz, fqzcomp, and samcomp project pages.) I think I liked fastqz quite a bit, but I still have to examine data recoverability in the case of archive damage. It is possible to use external parity tools which support file repair using pre-calculated recovery files – like the linux par2 utility, also for bzip2 archives and any other files in general – but adding parity file may negate the higher compression ratio benefits. Also, if there is no independent block structure of the archive, insufficient parity file may lead to the loss of the entire archive. In other words, this still has to be tested.

    Now the long-promised web-links!
    Read the rest of this entry »

    Posted in *nix, Links, Notepad, Software | 1 Comment »

    Debian: how to whitelist IP addresses in tumgrey-SPF

    7th August 2013

    SPF is nice for protecting your mail server from spam, but sometimes there is a need to bypass SPF checking. For example, if you rely on 3rd party servers to do spam protection for you :)

    Current setup:

    • MX records point to the spam protection mail servers, which then
    • connect to my server and deliver (hopefully spam-free) mail.

    Problem: some senders (like last.fm) do have proper, strict SPF records. Tumgreyspf on my server then rejects emails relayed through the spam-protection service.

    If these spam protection relay servers are the only which send mail to your server, then it makes sense to fully disable/uninstall tumgreyspf. Putting tumgreyspf into the permanent “learning mode” (set defaultSeedOnly = 1 in /etc/tumgreyspf/tumgreyspf.conf) may not fix the SPF problem described above, as SeedOnly seems to only affect greylisting, and not rejecting unauthorized senders.

    Solution: whitelist relay server IPs.
    Read the rest of this entry »

    Posted in *nix, how-to, Software | No Comments »

    Pacific Rim: recommended

    22nd July 2013

    pacific-rim-posterYou don’t like FX-only movies, right? You can guess the plot from the trailer, and it then falls apart from the very first minutes – you just find it fairly hard to believe, without first turning off your brain – either willfully, if you are well-trained in mental/djedi techniques, or with the help of beer/other alcohol. Then, all you get are the special effects thrown in your face in the post-processing-added, ugly and eye-hurting 3D. There are a few “wow” moments, but that’s it – you leave the cinema with the mixed feelings of emptiness, lost time and money, and disappointment.

    If that sounds familiar – go watch Pacific Rim. Go with the above-described expectations. Do watch the HD trailer. Try not to read anything revealing the plot – that could be a serious spoiler. Just one hint: you may want to sit through the 3D credits (about 2 minutes) for one final movie scene. I’ve heard there’s also something after the non-3D credits, but wasn’t patient enough to verify that.

    Pacific Rim leaves a pleasant double-sided impression. Read the rest of this entry »

    Posted in Movies | 1 Comment »

    Graphs in Python

    13th July 2013

    directed graphSooner or later, everyone has to deal with graphs. Some people have to do programming with graphs, and a subset of those – do that in Python.

    NetworkX is a pure Python implementation, where anything can be nodes. Both nodes and edges have attributes. NetworkX supports directed graphs and multigraphs (where there are multiple edges between nodes). It might be slower than other implementations, but you may even not notice that – especially when working with smaller graphs, or not applying computationally-intensive algorithms to your graphs.

    graph-tool uses the Boost graph library (C++), so it should be really fast. It could be the only multi-threaded graph library for Python. It supports pickling the graphs, allows interactive graph drawing, and has well-illustrated documentation. If performance and efficiency are of utmost importance – could be the best choice.

    igraph is also really fast – just like graph-tool when using 1 CPU; graph-tool only wins conclusively when it is run on multiple CPUs/cores. igraph has an R package bindings to C.

    Pure python is also an option for really smaller cases.

    Finally, there’s a discussion around Python Graph API to simplify the inter-changeability and inter-operability of various existing Python graph modules. It also has a list of some less-known Python graph libraries, so check it out.

    Posted in Programming, Python | No Comments »

    MultiParanoid vs. QuickParanoid: pro et contra for each

    9th July 2013

    MultiParanoid

    Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups.

    QuickParanoid

    QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

    So… both use InParanoid… Are there any differences? Let me list those which I’ve found.

    Read the rest of this entry »

    Posted in *nix, Bioinformatics, Software | 2 Comments »

    Hands-on examination of Linux disk caching effects

    8th July 2013

    LinuxAteMyRAM :) (also as a PDF: Linux disk caching effects)

    To examine the behavior of your Linux box disk caching under specific loads, see Linux write cache mystery (PDF).

    To understand what is going on, see also The Linux Page Cache and pdflush (PDF) by the same author, Gregory Smith.

    Another useful resource is OpenSUSE’s Tuning the Memory Management Subsystem, which nicely explains some of the kernel cache/memory-related configuration options.

    Posted in *nix, Software | No Comments »

    Bicycle trip Saarbrücken – Baden-Baden (and back)

    5th July 2013

    On Monday, the 24th of June 2013, at 05:15 in the morning we (me and my bicycle-maniac coworker) started from Dudweiler, Saarbrücken towards Baden-Baden – on the bicycles.

    We thoroughly planned this trip:
    Read the rest of this entry »

    Posted in Life | No Comments »