Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    • Archives

    • Recent comments

    Archive for the 'Links' Category

    Interesting and relevant links I found.

    The favourite file compressor: gzip, bzip2, or 7z?

    17th October 2013

    Here comes a heap of assorted web-links!

    I had personally settled on using pbzip2 for these simple reasons:

    • performance scales quasi-linearly with the number of CPU cores (until one hits an I/O bottleneck);
    • when archive is damaged, you are only guaranteed to loose the damaged block(s) of size 100-900 KiB – remaining information might be salvable.

    Compared to pbzip2, neither gzip nor 7z (lzma) offer quasi-linear speedups proportional to the number of CPU cores.
    pigz, the parallel gzip, does parallelize compression, but gzip compresses not as good as bzip2, and decompression is not parallel like in pbzip2.
    7z is multi-threaded, but it tops out at using 2 CPU cores (see links below for tests).

    pbzip2 is also quite a good choice for FASTQ data files: even if a few blocks get lost due to data corruption, this should not noticeably affect the entire dataset.
    Specialized tools for FASTQ compression do exist (see e.g. this article, also Fastqz, fqzcomp, and samcomp project pages.) I think I liked fastqz quite a bit, but I still have to examine data recoverability in the case of archive damage. It is possible to use external parity tools which support file repair using pre-calculated recovery files – like the linux par2 utility, also for bzip2 archives and any other files in general – but adding parity file may negate the higher compression ratio benefits. Also, if there is no independent block structure of the archive, insufficient parity file may lead to the loss of the entire archive. In other words, this still has to be tested.

    Now the long-promised web-links!
    Read the rest of this entry »

    Share

    Posted in *nix, Links, Notepad, Software | 1 Comment »

    Free private git repository hosting

    29th August 2012

    Github is awesome and still improving, but sometimes I’d prefer to have some of my repositories hidden from the eyes of the public – not so much because of the code value (though that is also important sometimes), but rather because those repositories are all “work in progress” or “short-lived” and may have so much junk in them at some moments in time that it would simply be too embarrassing to publish this untidiness.

    Previously, I’ve used gitosis to setup git repository hosting on my server. I’m still using it for long-living projects, but I’m now lazy enough to dislike the steps needed to setup a new repo (and I’m creating more and more new repos, some of which are likely to die very young). Some kind of GUI would help, but gitweb seems not that useful to me (here’s how to make it work with gitosis, and another recipe, or maybe just try gitosis-web or gitosis-web-admin).

    Another downside is that gitosis is no longer actively maintained and was even removed from ubuntu repositories. Suggested course of action for gitosis users is to migrate to gitolite. However, basic design of gitolite is the same, so personally (looking for something easier to use) I see only minor gains in this migration (which I’ll have to perform anyway sooner or later).

    Another interesting self-hosted option is girocco. Too bad I have absolutely no experience with http://repo.or.cz/, so it’s hard to tell if girocco is convenient to use or not… Comments are welcome.

    Using dropbox for git repositories (also here) seems a nice and fairly easy option, with only a few downsides: it’ll eat your dropbox space (which is still much more than you get from free git hosters), and it isn’t that easy in a multi-user environment. Also, you will have to setup dropbox on your headless servers where you may want to run your code, which isn’t exactly what I’d want to do. Same arguments apply to git on google drive.

    An alternative to various self-hosted systems would be to use an existing system with free private projects. Git wiki has a list of hosts to start with.

    Here’s a brief summary of the options I’ve found relatively attractive (see below for my experience with 3 of the listed services). (See also this recent comparison.)

    Providers \ Features
    Repositories
    Users
    Space
    Paid plans?
    BitBucketUnlimited5Unlimited+
    AssemblaUnlimitedUnlimited1 GB+
    GIT EnterpriseUnlimited101 GB+
    ProjectLocker120.2 GB+

    Initially, I found GIT Enterprise and Assembla to be the most attractive options to try. After trying both, I found Assembla faster and generally more attractive to work with. It wasn’t immediately obvious how to create more than one source repository, but after figuring that out everything is smooth.

    However, after trying BitBucket, I had immediately switched all my assembla repositories to it :) BitBucket is just like github, but with free private repositories. It also has an issues tracker and a wiki. It even allows small teams to work on private repositories!

    Share

    Posted in *nix, Links, Software | 1 Comment »

    R functions for regression analysis cheat sheet

    29th May 2012

    Original PDF.
    My local copy.

    Share

    Posted in Bioinformatics, Links, Misc | No Comments »

    The genetics of orchids and dandelions

    1st May 2012

    Quite an interesting article on the genetics of behavior.

    Share

    Posted in Links, Misc | No Comments »

    Beanstalkd and related tools for easy parallelizing and backgrounding

    18th February 2012

    beanstalkd: a simple, fast work queue.
    Jack and the Beanstalkd: a web-app for basic work queue administration.
    beanstalkc: a simple beanstalkd client library for Python.
    queueit: a CLI interface tool which helps to integrate beanstalkd into shell scripts.

    Share

    Posted in Links, Programming, Python, Software | No Comments »

    Megahack of Stratfor

    9th January 2012

    If you haven’t heard yet – stratfor.com was hacked in December 2011, leaking full information about 75k credit cards (including owner’s addresses and CVV codes) and 860k (right, almost a million) user accounts. All Stratfor email archives were also reportedly stolen (around 160-200 GB of data), but those were not made publicly available on the internet – unlike the credit cards and user accounts information, which is still relatively easy to find and download.

    I do not really recollect anything that large. Well, not counting dropbox’s 4-hour window of “any password fits all accounts”, but that was different.

    Here are some of the news items about this seriously large hacking incident:

    Here come more technical reports:

    TheTechGerald’s analysis linked to above got my attention. Unfortunately, a while ago I’ve subscribed to stratfor’s “free intelligence mailing list”, and was wondering if my account information is now publicly available. I was the most worried about the password I’ve used to subscribe, because of the risk of using the same password somewhere else.

    Unlike TheTechGerald, I haven’t used any dictionaries – just the default configuration of a well-known tool for finding weak passwords. Within a single hour, ~100k passwords were decrypted (~12% of all). Till the end of the day, ~50k more passwords were decrypted (totalling 17.4% of 860k). At this point my password was still safe, and I’ve found a way to verify that it is not used anywhere else, so I’ve aborted further decryption.

    There are a few simple conclusions:

    • anybody who had a stratfor account must verify that he/she isn’t using that password anywhere else, because if 1 PC can get 17% of all the passwords in less than a day, it is only a matter of short time until all the leaked passwords will be decrypted and made publicly available in various “md5 decryption databases”
    • system owners should run periodic screenings for weak passwords (and implement policies to prevent creating obviously weak passwords from the very beginning)
    • md5 is very fast to decrypt/bruteforce – a much slower hashing function wouldn’t hurt; also, using a more complex hashing approach, maybe even with a closed-source shared library, could help
    • single-factor authentication (password-based) is likely to get replaced with 2-factor authentication in the nearest future
    • one may enjoy increased personal data safety by using throw-away passwords in conjunction with antispam mailboxes like spam.la and mailinator.com (at least 1600 users – 0.186% – did use these services).

    Read the rest of this entry »

    Share

    Posted in Links, Misc, Security, Software, Web | No Comments »

    Good advice: /bin/false is not security

    1st October 2011

    SSH Security and You – /bin/false is *not* security.

    Share

    Posted in *nix, Links, Security | No Comments »