19th October 2013
Right now, when I see that I have to often repeat/retype some sets and sequences of commands, I’m trying to wrap them up into some kind of a script, every time choosing the most appropriate language – shell when I need to start lots of existing command-line tools, Python when there’s some data handling and processing involved, and R when I’m invoking commands from R packages. So far I have been avoiding the fairly popular makefile-based approach to automating pipelines and workflows which rely heavily on existing tools. However, being curious, I’ve compiled a short list of modern make-like alternatives, to possibly explore… sometime later…
- First comes make itself – the oldest and the most widely used software build tool. Stable and powerful. Still, even people who got used to using make, have some gripes about it. The most detailed list of gripes is probably here.
- SCons is a build tool written in Python. I guess I like that “configuration files are Python scripts” – maybe knowing Python is enough to use SCons, which makes SCons a better choice than make for me. SCons seems to have gained some support (scroll down for comments/discussion). There were some doubts about SCons performance (1, 2, and 3); not sure where SCons is at right now in that regard.
- waf, a Python-based framework for configuring, compiling and installing applications.
- pyDoIt is a Python automation tool. It seems to use Python syntax. It aims at bringing the power of build-tools to execute any kind of task, where a task describes some computation to be done (actions), and contains some extra meta-data. Based on the description alone, I’m quite intrigued! I wonder if anyone had already worked with pyDoIt and can share experiences?…
- Rake – Ruby make – is a simple build program with capabilities similar to those of make. Had seen a lot of positive feedback about this one – mostly regarding simplicity of use. Still [py]DoIt so far looks more attractive to me personally.
- Ruffus is a lightweight python module for running computational pipelines. Sounds like some good competition to [py]DoIt!
- Anduril is an open source component-based workflow framework for scientific data analysis. Sounds promising, though the latest downloadable version is over 400 MBs… It probably already contains a bunch of binaries and maybe even data and complete workflows for data analysis. Probably worth a look, but may turn out a little overweight for simple pipelining.
- snakemake is a scalable bioinformatics workflow engine. I get the feeling that Python is truly dominating the pipelines/workflows world: snakemake, as even the name suggests, is in Python, too. The front-page example is so simple and clear, that snakemake immediately pushes DoIt down from the 1st place! Awesome.
- Paver is a yet-another Python-based software project scripting tool along the lines of Make or Rake, designed to help out with repetitive tasks with the convenience of Python’s syntax. Sounds similar to DoIt. Have no idea how they actually compare to each other.
That is it for now.
What were your experiences with automating repetitive tasks and building simple pipelines?
Posted in *nix, Notepad, Programming, Software | No Comments »
18th October 2013
The usual, or even classical way is to create the list of installed packages with sudo dpkg --get-selections > package_list, and then restore when/if necessary with cat package_list | xargs sudo apt-get -y install.
As VihangD points out in his serverfault answer, the same can be achieved with aptitude, while also excluding dependent, automatically installed packages (which are included by the classical method). To create the list of packages, run aptitude search -F '%p' '~i!~M' > package_list. Here, -F '%p' asks aptitude to only print package names (instead of the default output, which also contains package state and description); search term ‘~i!~M’ asks for all non-automatically installed packages.
To install packages using the created list, run xargs aptitude --schedule-only install < package_list; aptitude install. The first of these two commands instructs aptitude to mark all the packages from the list as scheduled for installation. The second command actually performs the installation.
Hamish Downer suggests an alternative way of getting the initial package_list: using the deborphan utility, deborphan -a --no-show-section > package_list. This command asks deborphan to show a list of packages, which have no dependencies on them. Sounds very similar to what we did with aptitude above, but using deborphan will most likely result in a much shorter list of packages (on my system, deborphan printed 291 package names, aptitude printed 847, and dpkg printed 3650 package names). One more potentially important difference between aptitude- and deborphan-produced package lists is that aptitude only specifies package architecture when it is different from native (e.g. 'googleearth:i386' on a 64-bit system), while deborphan specifies architectures for all the packages (resulting in e.g. 'google-talkplugin:amd64' and 'googleearth-package:all' on a 64-bit system).
Posted in *nix, how-to, Notepad | 2 Comments »
17th October 2013
I’ve tried [briefly] Cantor (which also supports Octave and KAlgebra as backends), rkward, deducer/JGR, R Commander, and RStudio.
My personal choice was RStudio: it is good-looking, intuitive, easy-to-use, while powerful.
Next step would be using some R-equivalent of the excellent ipython’s Mathematica-like Notebook webinterface…
Posted in *nix, Notepad, Programming, Science, Software | No Comments »
17th October 2013
In one of the previous posts I’ve mentioned that BitBucket is über-cool 
Redmine is also really cool, and is actually more feature-reach than what BitBucket has to offer, but maintaining it needs just a tiny bit more time and attention than I’m willing to spend these days. So, migration it is!
Redmine has issue 3647 titled “Data import/export system”; it is not resolved, but has a number of links to other resources. Like the redmine exporter at hostedredmine.com, which provides free hosted redmine service. Redmine itself has REST API, though I have no idea if it allows exporting all the data I may need. There’s also an XLS export plugin, but it has to be installed first, and I’m too lazy  There’s also TaskAdapter, but they do not support BitBucket (yet?).
 There’s also TaskAdapter, but they do not support BitBucket (yet?).
For the complete backup, I think of using the pure-ruby redmine project data export script. To migrate issues only, I’ll consider the redmine2bitbucket script.
P.S. Not implying anything (yet?), but my previous migration was from Trac to Redmine… At that time, Trac seemed to have less features than I wanted. And now I’m migrating back to “less features”, but with a benefit of no support required from me.
Posted in Links, Notepad, Programming | No Comments »
17th October 2013
Note: this is a draft post back from 2010. As it is still useful to me, I’ve decided to publish it as is.
I had already mused on the powers of rsync before.
This time, a reminder to self on how to resume copying broken scp/mc/fish transfers, using rsync.
First, an assortment of example commands.
 export RSYNC_RSH=ssh
 rsync --partial file_to_transfer user@remotehost:/path/remote_file
 rsync -av --partial --progress --inplace SRC DST
 rsync --partial --progress --rsh=ssh host:/work/source.tar.bz2 .
 rsync --partial --progress --rsh=ssh -r me@host.com:/datafiles/ ./
One could also try the --append option of rsync to base the transfer resumption on the sizes of the two files rather than verifying that their contents match.
Now a single command line explained in a little more details:
 rsync -vrPtz -e ssh host:/remote_path/* /local_path/
 Explained:
 -e ssh rsync will use ssh client instead of rsh, which makes data exchange encrypted
 -z compress file transfer
 -t preserve time (other attributes, such as owner and permissions are also possible)
 -P resume incomplete file transfer
 -r recurse into subdirectories
 -v verbose
To specify a port when using ssh you must add it to the ssh command.
 Example: rsync --partial --progress --rsh="ssh -p 16703" user@host:path
Posted in Notepad | No Comments »
17th October 2013
Here comes a heap of assorted web-links!
I had personally settled on using pbzip2 for these simple reasons:
- performance scales quasi-linearly with the number of CPU cores (until one hits an I/O bottleneck);
- when archive is damaged, you are only guaranteed to loose the damaged block(s) of size 100-900 KiB – remaining information might be salvable.
Compared to pbzip2, neither gzip nor 7z (lzma) offer quasi-linear speedups proportional to the number of CPU cores.
 pigz, the parallel gzip, does parallelize compression, but gzip compresses not as good as bzip2, and decompression is not parallel like in pbzip2.
 7z is multi-threaded, but it tops out at using 2 CPU cores (see links below for tests).
pbzip2 is also quite a good choice for FASTQ data files: even if a few blocks get lost due to data corruption, this should not noticeably affect the entire dataset.
 Specialized tools for FASTQ compression do exist (see e.g. this article, also Fastqz, fqzcomp, and samcomp project pages.) I think I liked fastqz quite a bit, but I still have to examine data recoverability in the case of archive damage. It is possible to use external parity tools which support file repair using pre-calculated recovery files – like the linux par2 utility, also for bzip2 archives and any other files in general – but adding parity file may negate the higher compression ratio benefits. Also, if there is no independent block structure of the archive, insufficient parity file may lead to the loss of the entire archive. In other words, this still has to be tested.
Now the long-promised web-links!
 Read the rest of this entry »
Posted in *nix, Links, Notepad, Software | 1 Comment »
6th October 2011
For some reason, I believed that MyISAM storage engine should be very fast – faster than InnoDB and Postgres. After all, MyISAM does not support transactions, has no logging, and is overall simpler than “true” storage engines/databases.
I was surprised to find out that this isn’t true, at least for the specific (simple!) query I’m interested in:
- SELECT primary_id FROM tablename WHERE indexed_varchar = %s AND intcol1 < = %d AND intcol2 > %d 
 Read the rest of this entry »
Posted in Notepad | No Comments »