Autarchy of the Private Cave » Science

The sugar conspiracy

Bogdan — Sun, 19 Jun 2016 10:27:14 +0000

A long but interesting read: The Sugar Conspiracy.

Practical comparison of NGS adapter trimming tools

Bogdan — Wed, 01 Jun 2016 19:23:14 +0000

I used to work with sequencing providers who were giving me fairly clean data.
It was already barcode-separated, and had no over-represented adapter sequences.
The only thing I had to do was to (optionally) quality-trim the reads, and check for biological contamination.

Recently, however, I have come across some real-world data, which not only had contamination in it, but also quite a noticeable percentage of adapters.
I did a quick test of multiple tools to see if they fit my requirements:

should be easy/logical to use: no arcane/convoluted command lines or config files
should detect adapters automatically, either using its own database or a provided plain FASTA file
should be reasonably fast
must leave no adapter traces behind: I prefer aggressive trimming

I have tried the following tools:

fastq-mcf from the ea-tools package
skewer
TrimmomaticPE
cutadapt: haven’t used it directly, but it is used by some of the compared tools
bbduk from BBMAP
autoadapt
TrimGalore!

As input, I have used 2 FASTQ files, each about 8.4 gigabytes
(or 3 785 687 KBytes together in 2 bzip2-compressed files, or 129 753 452 lines / 32 438 363 reads per file).
Time was measured with bash’s built-in time.
The all_adapters.txt is a plain FASTA file I took from FastQC distribution a long while ago,
and possibly added some more adapter sequences scavenged from the internet.

fastq-mcf (ea-tools)
fastq-mcf ~/bin/all_adapters.txt -o R1.clip.fastq -o R2.clip.fastq input_R1.fastq input_R2.fastq

non-obvious way to specify 2 outputs for 2 inputs, but not complicated either
can be given a file with dozens of adapters: will auto-identify which adapters to trim
single-threaded, uses 315M RES and 380M VIRT
83.5 minutes on a loaded system

Reads too short after clip: 137 684
Clipped ‘end’ reads (input_R1.fastq): Count 895 775, Mean: 24.36, Sd: 17.32
Trimmed 2 072 551 reads (input_R1.fastq) by an average of 4.46 bases on quality < 7 Clipped 'end' reads (input_R2.fastq): Count 850 718, Mean: 25.70, Sd: 17.19 Trimmed 8 729 083 reads (input_R2.fastq) by an average of 4.44 bases on quality < 7

skewer
skewer -x ~/bin/all_adapters.txt --mode pe --threads 8 input_R1.fastq input_R2.fastq

looks much fancier: uses colors and has a text-mode progress bar
is multi-threaded, but appears to be extremely slow – much slower than single-threaded fastq-mcf – update: it is incredibly fast if instead of 96 adapters you just give it 3 or so;
can read up to 96 adapters from the file… should be fine for most purposes
uses very little RAM (~4 megabytes RES, ~450M VIRT)
really slow: real 177m52.933s , user 1212m3.644s (7 threads)

32 438 363 read pairs processed; of these:
12 339 ( 0.04%) short read pairs filtered out after trimming by size control
94 409 ( 0.29%) empty read pairs filtered out after trimming by size control
32 331 615 (99.67%) read pairs available; of these:
934 379 ( 2.89%) trimmed read pairs available after processing
31 397 236 (97.11%) untrimmed read pairs available after processing

TrimmomaticPE
TrimmomaticPE -threads 8 -trimlog trimmomatic.log input_R1.fastq.bz2 input_R2.fastq.bz2 lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq3-PE-2.fa:2:40:15

failed to start without seemingly optional arguments to ILLUMINACLIP with an uninformative error message
uses 1.5+GB RES, 7.8GB VIRT, and does not fully utilize all 8 threads (CPU load only at around 500%, where 100% means 1 core)
does not seem to be I/O bound, but log file is huge: contains all read identifiers
- it might be better to disable log file (do not specify -trimlog) for higher I/O speed
comes bundled with some adapters already, but:
- does not detect adapters itself: you have to know which file to choose
- adapter files are structured in a way preventing merging them into a single file: adapter names have special meaning to Trimmomatic
real 19m39.431s, user 71m8.600s , sys 23m44.556s: much faster than either skewer or fastq-mcf

Input Read Pairs: 32438363
Both Surviving: 31591307 (97.39%)
Forward Only Surviving: 750772 (2.31%)
Reverse Only Surviving: 8023 (0.02%)
Dropped: 88261 (0.27%)

NOT trying cutadapt:

looks great based on reading the manual
only accepts adapters on the command-line, and does not come with adapter files to use
is in Python/Python3, so could be easier re-used from Python programs

BBMAP
bbduk.sh in=input_R1.fastq.bz2 in2=input_R2.fastq.bz2 out=bbduk_clean_1.fastq out2=bbduk_clean_2.fastq ref=~/bin/all_adapters.txt

refused to load some JNI library:
Error: Could not find or load main class utilities.bbmap.jni.
changed into bbmap/jni and ran export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ; make -f makefile.linux, but this didn’t help
failed to run

autoadapt (relies on FastQC and cutadapt)
autoadapt.pl --threads=8 input_R1.fastq autoadapt_clean_1.fastq input_R2.fastq autoadapt_clean_2.fastq

first runs FastQC to a temporary file (0.5GB RES, 4.8GB VIRT)
- fastqc is started with --threads 8, but only 1 file is fed to fastqc…
auto-detected adapters, from FastQC’s output:
Detected the following known contaminant sequences:
Illumina Single End PCR Primer 1 (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)
TruSeq Adapter, Index 7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG)
used over 15 GB RAM! + swap!
this is too much, killed and re-starting with 1 thread

uses cutadapt (<8M RES, <31M VIRT), looking for adapters anywhere (and not only at 3' like TrimGalore does); here's the generated command sample:

cutadapt --format fastq --match-read-wildcards --times 2 --error-rate 0.2
--minimum-length 18 --quality-cutoff 20 --quality-base 33
--anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
--anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
--anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
--anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--paired-output autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
-o autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp
input_R1.fastq input_R2.fastq && cutadapt --format fastq --match-read-wildcards
--times 2 --error-rate 0.2 --minimum-length 18 --quality-cutoff 20 --quality-base 33
--anywhere=GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
--anywhere=CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
--anywhere=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
--anywhere=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--paired-output autoadapt_R1.fastq -o autoadapt_R2.fastq
autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R2.fastq.tmp
autoadapt/autoadapt.tmp.f_zxQr95/autoadapt_R1.fastq.tmp

uses its own directory for intermediate/temporary files, then moves to destination – not good…
- the problem is that program’s partition may not have enough space for all the intermediate data
- actually, cutadapt is run twice:
  - first to the temporary directory
  - then to the final destination, using temporary/intermediate files as inputs
ran out of space in /home… created a copy of cutadapt under ~/data volume
/usr/bin/time -f '%C: %e s, %M Kb' ~/data/autoadapt-tmp-copy/autoadapt.pl --threads=1 input_R1.fastq autoadapt_R1.fastq input_R2.fastq autoadapt_R2.fastq
over 1h CPU time already, and still about half-done… should try with --threads=2 or 4, maybe RAM use will be somewhat better?
total time 9979.87 seconds (2.8 hours), max RAM 235 480 Kb
trying in 4 threads: again 15+ Gb RAM and 7+Gb swap, killed at this point;
- the problem seems to be somewhere in the read splitting code – apparently, it keeps reads in RAM (???) while splitting…
- looking at the split files: they are all partial, so autoadapt.pl somehow attempts to parallel-split into all thread segments at once
trying to edit splitFile() function to use GNU split command; hopefully, mergeFile() does not use gigabytes of RAM…
- for testing: hard-code tmp dir name; skip actual fastqc
- this now works great! let’s wait for merging…
- mergeFile() still eats ~2.5Gb of RES
because of all the splitting, temporary directory size easily jumps to about 3x the original file size, or ~48 GB for ~16 GB of input files
3007.68 s (50 minutes – this does not include the initial FastQC run), 2 523 952 Kb (this is mostly the file merging operation)
it does not show any stats at the end

Trim Galore!
trim_galore --fastqc --path-to-cutadapt /usr/bin/cutadapt3 --paired input_R1.fastq input_R2.fastq

the trim_galore perl wrapper itself consumes just a few megabytes of RAM
uses cutadapt for actual work
auto-detects adapters, although somehow the Illumina adapter found is only a substring of what was found by autoadapt/FastQC…
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 17429 AGATCGGAAGAGC 1000000 1.74
Nextera 0 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 17429). Second best hit was Nextera (count: 0)
can run FastQC itself on the processed data, if so instructed by a command-line option
trims and summarizes each file separately

Total reads processed: 32,438,363
Reads with adapters: 6,878,225 (21.2%)
Reads written (passing filters): 32,438,363 (100.0%)
Total basepairs processed: 3,276,274,663 bp
Quality-trimmed: 11,132,367 bp (0.3%)
Total written (filtered): 3,226,980,229 bp (98.5%)

cutadapt processes about 4 million reads/minute on my work PC i7

Total reads processed: 32,438,363
Reads with adapters: 6,030,241 (18.6%)
Reads written (passing filters): 32,438,363 (100.0%)
Total basepairs processed: 3,276,274,663 bp
Quality-trimmed: 40,530,133 bp (1.2%)
Total written (filtered): 3,199,297,597 bp (97.7%)

length is checked after cutadapt:
Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 145312 (0.45%)
1955.49 s (32.6 minutes), 228 592 Kb (this is likely FastQC’s top RAM use)

How do I evaluate the quality of trimming?
Notably, all trimmers removed the “Adapters detected” section from FastQC’s output.
For now, I’m simply choosing the smallest pair of processed read files
(under the assumption that the smallest is the most aggressively trimmed).

File sizes after trimming, R1+R2
16’750’631 trimmomatic
16’770’603 autoadapt, threads=1
16’771’639 autoadapt, threads=8 // after swapping Perl splitter function for GNU split
16’924’934 trimGalore
16’963’937 fastq-mcf
17’057’065 skewer

Looking at FastQC plots, major differences can be seen in read lengths distribution (which depends on how much of the sequence tail/head was trimmed),
per-tile quality (trimmomatic and skewer do not perform any kind of quality trimming by default, others do), and k-mer content.
For k-mer content, trimmomatic, trimGalore, and skewer look the most natural: there is a background of random-looking lesser spikes (up to 2-4),
and one or two bigger spikes (up to 12). For other tools (autoadapt, fastq-mcf) k-mer content looks like a flat line (but likely also 2-4)
with several huge spikes (up to 35-40). In fact, only autoadapt, trimgalore, and skewer got a “warning” on k-mer content – all others got an “error”.

Overall, Trimmomatic and trimGalore appear to be the two best adapter trimmers, both by aggressiveness+FastQC reports and by speed.
But trimGalore detected significantly shorter adapter, and also Trimmomatic produced a smaller, more aggressively trimmed file.
On the downside, Trimmomatic does not auto-detect adapters! This can be alleviated by first running FastQC on the input files,
then checking /usr/share/trimmomatic/ for matching adapter files – those which contain both adapters detected by FastQC.

Will use Trimmomatic for now.

Important update:

It is possible to (quite easily) construct a file with all the adapters for Trimmomatic, and it will happily try to trim anything from that file; Trimmomatic is now my sledgehammer – give it anything, and it will crush it.
I have just used cutadapt directly, on a peculiar case of Nextera transposon contamination throughout the length of reads. The advantage of cutadapt is that you can specify how many times to trim the adapter – by default it is just 1, but I’ve set it to 20 and got rid of all Nextera leftovers. cutadapt is now my scalpel – I use it in pathological cases, when I know what (and how much of it) to cut out.
Specifically for Nextera, I’m now using NxTrim – a tool from Illumina, which examines the reads and splits them into several categories: proper MP, PE, single-end/overlapping reads, and unknown. After NxTrim, individual reads should still have other sequencing adapters clipped.

Nobody wants higher-quality, complete bacterial genomes

Bogdan — Tue, 24 May 2016 15:18:07 +0000

This is a piece of rant.

Disclaimer

The story, all names, characters, genomes and incidents portrayed in this blog post are fictitious.
No identification with actual persons (living, dead or undead), places, companies, and processes is intended or should be inferred.
No animals were harmed in the making of this blog post.

Let’s try answering a question:

why are there many incomplete/draft bacterial genomes, and much fewer complete genomes?

The answer is simple: insufficient value/cost ratio.
This can also be summarized as the good enough principle: if something is good enough, it does not get improved.

Sample scenario 1.
Players: Principal Investigator (PI), Bacterial Genome (BG), Biologist (B), Sequencing Company (SC), (optional) Bioinformatician (oBI), Genomes Database (GD).

B is interested to work with BG, and gets PI‘s approval to sequence it.
Biomaterial is sent to SC, which sequences and even assembles the BG.
BG looks overall great and comes in just a handful fragments.
oBI is (optionally) involved, to annotate and describe the BG.
B works happily with the BG, describing and characterizing all the interesting biosynthetic features it contains.
An article is prepared, and oBI is (optionally) involved again, to prepare and submit the BG to the GD.
Preparing the BG, oBI has to answer a question if this BG contains any plasmids.
Upon closer examination, oBI finds that one of the fragments is actually the complete chromosome, and all others are just unplaced fragments of it.
oBI knows that this genome could probably be merged into a single draft scaffold
using bioinformatics tools and manual examination in maybe a few days (or a week… or two? ).
oBI also knows that with a little bit of B‘s help (a few primer walking experiments) it should be possible to have the complete BG within a month or two.
However, BG stays a draft, and is not going to be complete any time soon.

Why?

Let’s look at motivations of all the players, and see if any of the players wants the complete BG:

PI wants publications; spending extra time/effort to make BG complete does not present any obvious benefits;
BG wants to be left alone;
B wants to publish exciting new findings; they are already supported by the draft BG, so there is clearly no need for a complete BG;
SC was happy to get payment in time; SC is also proud to be able to provide genome assembly as an extra service with its (primary) sequencing offers;
oBI has interest in finishing the BG: it will then be complete; however, there are 5 more other BGs awaiting processing, and the backlog of semi-written manuscripts only keeps growing… finishing this specific BG will not result in a perceived benefit to oBI;
GD stores genomes; it doesn’t care much if the genome submitted could have been better.

Surprise!
Looks like none of the players sees benefits in actually finishing the BG,
simply because efforts spent (or time waited) does not bring any perceived benefits to any of the players.

Sample scenario 2.
Players: Bacterial Genome (BG), Biologist (B), Sequencing Company (SC), non-optional Bioinformatician (noBI), Genomes Database (GD).

This time, B (who is interested in quickly publishing a short genome announcement) asks for noBI‘s help from the moment the BG is provided by the SC.
noBI has a cursory look at the BG, and although there is a huge discrepancy between thousands of contigs on the one hand and insanely high coverage on the other,
the BG otherwise appears good enough for further work, especially after scaffolding; after all, this is just a genome announcement, not a full-blown article!
There is also some weirdness about the coverage distribution of the BG, but noBI carelessly ignores that.
The BG is worked on: annotated, examined, described, prepared for submission to the GD.
Meanwhile, the announcement article is also nearly complete.
Genome is submitted, and GD‘s response comes back: some scaffolds contain orangutan and human DNA, and some scaffolds contain known adapter sequences in the middle…
“Oh crap“, thinks noBI, “I should have checked the raw reads for adapters and contamination, in spite of having the BG assembly already…”
The GD also kindly offers an easy way out: just remove the obviously-orangutan scaffolds, and remove/mask/discard adapter sequences.
This is the easy way, leading to a quicker genome announcement, and a slight bump to the personal publication records of both B and noBI.

The right way is, of course, to clean raw reads from adapters and contamination, re-assemble, re-scaffold, re-annotate, re-describe the BG,
then prepare again for submission. This can delay the quick genome announcement by about a week,
but will highly likely result in a more contiguous and more correct BG – although still not complete.

As we have learned from Scenario 1, perceived benefits of going the right way (as opposed to the easy way) are nearly non-existent…

There was a genome I have finalized manually a few years ago.
I had some good quality data, obtained a 300-something contigs initial assembly,
then scaffolded and manually finalized to about 10 scaffolds.
There was simply not enough evidence (data) to keep merging scaffolds, so I had to stop.

Nowadays, as bacterial genome sequencing prices are akin to weekend supermarket shopping expenses,
nobody is going the extra mile to produce a better quality, more contiguous, or even a complete genome.
And this feels sad…

On the other hand, consumer markets function like that for decades.
An old water heater with a failed heating element is not repaired: it is replaced by a new water heater,
because human time cost to repair the old one is higher than just buying a new one.

Funnily, universal basic income might change that: without the need to spend 40+ hours a week at work
(and thus being unable to repair that water heater on one’s own),
one might just order that heating element and fix it – instead of buying the new one.

Would universal basic income have the same effect on draft and incomplete bacterial genomes? I have no idea.

Streptomyces morphogenesis regulation: overview presentation

Bogdan — Fri, 13 May 2016 09:18:44 +0000

Note: this post is just a placeholder/draft, it will be extended later. But it can already be useful

Streptomyces Morphogenesis
Streptomyces Morphogenesis notes
Morphogenesis regulation poster

Preprint servers and open journals

Bogdan — Sun, 28 Feb 2016 13:13:53 +0000

Let’s start with some definitions.

With Open Journals I’m referring to open/public peer-review journals.
With preprint servers, I’m referring to services which allow you to publish your manuscript with a DOI, for pre-submission interest and feedback collection.

I am aware of the following public peer-review journals:

F1000 Research: your submission is made public without any editorial pre-screening within an average of 7 days, but only gets indexed in PubMed/Scopus/Scholar after a successful public peer review. Public means that a reviewer-signed evaluation appears together with the submitted manuscript. Authors may respond to criticism, and upload revisions of their submission. I believe a submission passes peer review after two positive reviews. Note that even your initial submission receives a DOI, and is thus citable (as well as all subsequent revisions). Brief examination of articles in some of the topics tells me that F1000 Research is a good place to publish, esp. because it is a kind of pre-print + journal in one package. You pay per-submission, there are 3 tiers by word count.
The Winnower: submit-review-revise, but here you pay for the DOI after your submission is reviewed. Before review your submission is thus not citable (except for by URL, which isn’t tracked as easily as DOI references). I haven’t formed an opinion on how attractive the winnower is for submitting, but I did find this quite interesting story for you to enjoy
Science Open: this project encompasses 5 mostly medical journals. It lists over 11 million articles on the front page, but those are sourced from other publications; Science Open itself seems to have several hundred publications across all 5 journals. Submissions get a DOI, then can undergo public review. It is not clear to me in which direction Science Open will be moving – towards becoming an excellent research papers aggregator, or towards becoming a publishing platform, or – like now – towards both.

I’m also aware of the following preprint servers:

arXiv: probably the oldest one, suitable for quantitative research. Submissions are pre-screened to meet certain minimal requirements.
bioRxiv (CSHL): preprint server for biology. Submissions are pre-screened to meet certain minimal requirements.
figShare: online repository for digital artifacts, including figures, datasets, tables, PDF files et cetera. Uploaded items get a DOI. I used to think that you have to pay for a DOI, but right now this feature is listed under free account features.
PeerJ preprints (and PeerJ journal): preprints are free, and you can submit a PeerJ preprint to PeerJ with a single button click. PeerJ has two journals, PeerJ itself (Life, Bio, Health) and PeerJ Computer Science. As is common, manuscript submitter pays for open access article. PeerJ has several different schemes of payment, including per-article, author membership, and institutional subscription. PeerJ has approximately 1800 articles published.
Zenodo is a DOI-providing repository similar to FigShare, powered by Horizon-2020 EU program funding and CERN’s Data Centre.

So far I only had experience with BioRxiv, and it was great. I’ll consider F1000 Research or PeerJ for some of my next manuscripts – both models are quite attractive, especially F1000′s open review.

How to use mkfifo named pipes with prinseq-lite.pl

Bogdan — Wed, 24 Feb 2016 11:39:37 +0000

prinseq-lite.pl is a utility written in Perl for preprocessing NGS reads, also in FASTQ format.
It can read sequences both from files and from stdin (if you only have 1 sequence).

I wanted to use it with compressed (gzipped/bzipped2) FASTQ input files.
As I do not need to store decompressed input files, the most efficient solution is to use pipes.
This works well for a single file, but not for 2 files (paired-end reads).

For 2 files, named pipes (also known as FIFOs) can be used.
You can create a named pipe in Linux with the help of mkfifo command, for example mkfifo R1_decompressed.fastq.
To use it, start decompressing something into it (either in a different terminal, or in background), for example zcat R1.fastq.gz > R1_decompressed.fastq &;
we can call this a writing/generating process, because it writes into a pipe.
(If you are writing software to use named pipes, any processes writing into them should be started in a new thread, as they will block until all the data is consumed.)
Now if you give the R1_decompressed.fastq as a file argument to some other program, it will see decompressed content (e.g. wc -l R1_decompressed.fastq will tell you the number of lines in the decompressed file); we can call program reading from the named pipe a reading/consuming process.
As soon as a consuming process had consumed (read) all of the data, the writing/generating process will finally exit.

This, however, does not work with prinseq-lite.pl (version 0.20.4 or earlier), with a broken pipe error.

Named pipes are very similar to usual files, with two major differences:

named pipes are not seekable: you cannot move file pointer (at least not backwards, not sure about skipping forward);
you cannot arbitrarily close/re-open a named pipe from the consuming end: closing a pipe on the consuming end also closes it for the writing/generating process.

The reason why prinseq-lite.pl does not work with named pipes is that it performs file format checking first – by opening the file, reading the first 3 lines, and closing it.
Closing a named pipe causes broken pipe for the writing process, and when prinseq-lite.pl attempts to open the pipe again – it succeeds, but there is no data there anymore, so it just sits and waits for data

I’m ok with a quick and dirty solution, so here it is: prinseq-lite.pl patch to enable mkfifo named pipes as input files (also local prinseq-lite.pl.patch).
WARNING: this patch simply disables file format checking!

Good hands-on explanation of differences between Spearman’s and Pearson’s correlation

Bogdan — Tue, 22 Apr 2014 10:42:10 +0000

Linear correlation vs. Rank order correlation: drag 11 data points around the plot and observe how both Spearman’s and Pearson’s correlation measures change. But first follow the Next button at the bottom-right for a guided tour of data manipulations.

How to cite PHYLIP

Bogdan — Fri, 10 Jan 2014 15:29:07 +0000

Official PHYLIP FAQ does suggest a few ways to cite the software, but I believe that the best citation is mentioned in the wikipedia PHYLIP article: pubmed reference for PMID 7288891. This PubMed citations seems the best, because

it does mention the software tool implementing the maximum likelihood approach,
it is likely the earliest mention of the PHYLIP software (which was distributed since around 1980),
it refers to a journal indexed by pubmed, and
according to Google Scholar, it was already cited over 6660 times

GUIs for R

Bogdan — Thu, 17 Oct 2013 20:59:01 +0000

I’ve tried [briefly] Cantor (which also supports Octave and KAlgebra as backends), rkward, deducer/JGR, R Commander, and RStudio.

My personal choice was RStudio: it is good-looking, intuitive, easy-to-use, while powerful.

Next step would be using some R-equivalent of the excellent ipython’s Mathematica-like Notebook webinterface…

MultiParanoid vs. QuickParanoid: pro et contra for each

Bogdan — Tue, 09 Jul 2013 15:57:31 +0000

MultiParanoid

Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups.

QuickParanoid

QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

So… both use InParanoid… Are there any differences? Let me list those which I’ve found.

MultiParanoid

requires all species names to be passed through the command line; I had 11, so that’s a downside (even though I got that list with 1 extra ls command)
is written in Perl
needs source code editing – to specify the input directory with all the InParanoid’s sqltable.* files, and also the output file
seemed to run fairly fast on my 11 genomes – finished in under 4 minutes

Overall, MultiParanoid left a somewhat “messy” impression… But it definitely did work.

QuickParanoid

interactive: it just asks you for a directory containing all the sqltable.* files, for configuration file, and the executable prefix
configuration file is just a list of all species names; this is similar to MultiParanoid, is also achieved with ls command (ls -1 *.faa > config in my case), but feels a little better than dropping those 11 filenames in the command line
written in C++
generates and compiles code! after collecting your input, two custom binaries are generated to actually run the analysis, which seems to have no practical utility for the end-user, but is definitely cool!
much faster than MultiParanoid – analysis itself (with the generated custom binary) took less than 5 seconds; generating that custom binary adds only a few more seconds
contains helpful qa1 and qa2 utilities; qa1 summarizes the final clusters, and qa2 compares two different results with each other (see examples below)
it had to be compiled first with make qa; also, import was missing in one of the source files…

Overall, QuickParanoid left the impressions “quick and cool”, with the minor drawback of having to add that missing string.h import.

With the help of QuickParanoid’s qa1, I’ve collected some stats on the clusters of orthologs in my 11 genomes.

QuickParanoid clusters:

Number of clusters consisting of 1 species : 0
Number of clusters consisting of 2 species : 1757
Number of clusters consisting of 3 species : 975
Number of clusters consisting of 4 species : 622
Number of clusters consisting of 5 species : 463
Number of clusters consisting of 6 species : 406
Number of clusters consisting of 7 species : 325
Number of clusters consisting of 8 species : 355
Number of clusters consisting of 9 species : 448
Number of clusters consisting of 10 species : 607
Number of clusters consisting of 11 species : 2449
Total: 8407

MultiParanoid clusters:

Number of clusters consisting of 1 species : 0
Number of clusters consisting of 2 species : 1872
Number of clusters consisting of 3 species : 1023
Number of clusters consisting of 4 species : 637
Number of clusters consisting of 5 species : 479
Number of clusters consisting of 6 species : 418
Number of clusters consisting of 7 species : 338
Number of clusters consisting of 8 species : 358
Number of clusters consisting of 9 species : 454
Number of clusters consisting of 10 species : 605
Number of clusters consisting of 11 species : 2451
Total: 8635

So, qa1 is definitely useful. As the output of QuickParanoid is almost the same as that of MultiParanoid, qa1 also works on MultiParanoid results – one just has to add a hash # at the beginning of the very first line of the MultiParanoid results file.

qa2 allows to compare, e.g., MultiParanoid and QuickParanoid results. Here’s the output from the default ‘names-only’ comparison mode:

Checking only sequence names…
Number of clusters in multiparanoid_result.txt : 8635
Number of clusters in quickparanoid_result.txt : 8407
Number of matched clusters: 8253
Residue clusters in multiparanoid_result.txt:
5945 4338 8374 7913 8334 8315 4564 7508 7492 6173 5922 8187 8104 4377 7003 8590 7808 5265 8064 5195 6297 8285 6849 6868 6317 8619 5983 4549 6659 4503 7500 4550 8510 7591 7776 8075 5969 6051 3857 5949 7795 6031 8005 8441 7737 7445 8507 4534 8289 6712 7294 6083 8377 8117 7344 8040 6858 8138 4559 7788 7818 7479 7100 7906 6287 8007 8552 7198 7489 7446 8522 8270 8327 6367 5278 8150 6832 8519 6332 6361 6674 8323 6156 6183 7187 8048 6328 8127 7260 7143 7794 7810 7376 8541 8397 8389 8516 5146 6311 8347 7573 7103 4391 6980 8330 8384 5918 4318 4759 4656 6061 7030 6694 8260 6180 4084 4321 7745 7875 7650 5112 7635 6976 4776 6249 7607 5208 5046 7566 4401 8487 7307 5571 4310 7322 6625 4398 5410 7401 6213 7392 4985 8369 8165 5704 8520 6244 6305 8072 8172 8143 8236 8419 5235 4746 5333 5930 4309 4349 3097 8393 5665 4265 4317 6502 7175 6386 5215 7117 4332 5182 7590 4788 5907 8396 7910 7221 7484 4807 5610 6613 4050 4410 4878 6522 4284 4245 1936 4274 5407 6963 7694 6528 8538 5800 6577 6767 5344 4069 5747 4105 7162 5047 4879 5036 6303 78 3446 5883 3891 4727 6912 4219 4724 3775 2404 7618 4171 6499 6127 3438 3844 4962 5708 4824 4657 99 160 5507 4992 6610 3683 5871 3908 4127 4725 4749 5200 5450 4138 3342 4283 4814 2210 5672 5551 6487 569 1545 2556 2950 6949 3933 4051 5084 5492 4389 1142 1303 1555 1784 2045 2490 3037 3894 4390 4668 4066 510 695 3982 1349 1922 2148 2221 4781 1065 1100 1125 1263 6489 2054 3147 951 1708 2472 3009 3458 3569 3875 4996 6470 145 1565 2196 2298 1470 2312 3868 4750 1996 3579 1029 515 716 1075 2119 4825 509 1855 1863 101 390 756 1426 4721 4912 5342 351 4734 1455 1738 1880 2367 5167 788 2750 338 3543 3745 4762 837 2662 4143 209 3175 4717 2071 2429 847 1220 1340 2046 4573 2065 2328 939 1995 97 845 1631 2076 3711 4880 1039 1214 534 1133 578 767 754 226 988 3521 1221 2689 1848 1917 3654 2906 61 435 738 1577 971 1099 926 2101 376 294 173 119
Residue clusters in quickparanoid_result.txt:
7697 5071 8137 8292 7403 6349 6516 7093 4532 5193 3596 6608 6214 5738 5438 3606 3492 4631 4037 5155 2725 5804 5871 3930 1185 3044 4776 5064 5321 760 2653 3419 4426 4530 5281 958 3602 1158 3008 256 5551 2742 3356 3143 3626 3548 1754 1411 1718 2104 3922 3867 3087 967 1932 2261 269 3881 3692 3206 3059 1115 500 3664 3985 3180 3615 3713 748 3311 1351 1396 1409 4113 3822 1917 1816 1109 3505 1110 1826 4079 4007 201 1308 4572 471 5128 1216 3060 2697 5732 2248 2299 3590 4537 2741 1440 2563 1001 3772 206 3020 4075 4571 3147 3727 2438 1744 1365 1156 1181 1442 2379 272 3902 777 3198 350 2111 3320 2670 3415 2118 3939 326 2826 3343 3698 2066 3284 3319 2883 3487 2976 876 1576 661 942 3960 1525 2962 270 577 1266 287 1499 2643 1651 1401 2218 1991 1 1681

Not a conclusion:

thanks to the summary of qa1, I’ve decided to take MultiParanoid results – they have (in my case) larger clusters with more genes in them, which is good, and overall more clusters – which is also good
if I had 20+ genomes to compare, or if I had to re-run this type of analysis multiple times – I’d use QuickParanoid
if I had to implement yet-another-inparanoid-based orthology clustering tool, then I’d first consider the QuickParanoid’s preprocessor/code generator, which was designed in an easy to extend manner

Initially, I had also considered OrthoMCL for multi-species orthologs clustering. However, InParanoid + Multi/QuickParanoid is way much easier and quicker to set up and use, as OrthoMCL requires a database back-end for better scalability.

Well, QuickParanoid has a test dataset with 120 species, and

… it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species …

R functions for regression analysis cheat sheet

Bogdan — Tue, 29 May 2012 13:11:48 +0000

Original PDF.
My local copy.

Information criteria for choosing best predictive models

Bogdan — Tue, 29 May 2012 11:44:50 +0000

Usually I’m using 10-fold (non-stratified) CV to measure the predictive power of the models: it gives consistent results, and is easy to perform (at least on smaller datasets).

Just came across the Akaikeâ€™s InforÂmaÂtion Criterion (AIC) and Schwarz Bayesian InforÂmaÂtion Criterion (BIC). Citing robjhyndman,

AsympÂtotÂiÂcally, minÂiÂmizÂing the AIC is equivÂaÂlent to minÂiÂmizÂing the CV value. This is true for any model (Stone 1977), not just linÂear modÂels. It is this propÂerty that makes the AIC so useÂful in model selecÂtion when the purÂpose is prediction.
…
Because of the heavÂier penalty, the model choÂsen by BIC is either the same as that choÂsen by AIC, or one with fewer terms. AsympÂtotÂiÂcally, for linÂear modÂels minÂiÂmizÂing BIC is equivÂaÂlent to leaveâ€“vâ€“out cross-â€‹â€‹validation when v = n[1-1/(log(n)-1)] (Shao 1997).

Want to try AIC and maybe BIC on my models. Conveniently, both functions exist in R.

Academia or life?

Bogdan — Sat, 16 Apr 2011 10:56:42 +0000

Worth reading: Goodbye academia, I get a life.

Amazonia! 6462 human microarray datasets

Bogdan — Sun, 06 Mar 2011 19:18:51 +0000

Amazonia! – explore the jungle of microarray results

Paradoxically, the tremendous downpour of microarray results prevents a simple use of expression data. Therefore, we propose a thematic entry to public transcriptomes: you may for instance query a gene on a “Stem Cells page”, where you will see the expression of your favorite gene across selected microarray experiments related to stem cell biology. This selection of samples can be customized at will among the 6462 samples currently present in the database.

Every transcriptome study results in the identification of lists of genes relevant to a given biological condition. In order to include this valuable information in any new query in the Amazonia! database, we indicate for each gene in which lists it is included. This is a straightforward and efficient way to synthesize hundreds of microarray publications.
A special feature of Amazonia! is the field of human stem cells, notably embryonic stem cells.

Introduction to Python for bioinformatics

Bogdan — Fri, 25 Feb 2011 12:03:55 +0000

This overview presentation is two years old, but still a highly valuable resource: modules and tools mentioned are alive and useful.
I think this is the second presentation by Giovanni I’m embedding (first one being about GNU/make for bioinformatics).

Introduction to python for bioinformatics

How to replace newlines with commas, tabs etc (merge lines)

Bogdan — Tue, 16 Nov 2010 08:20:45 +0000

Imagine you need to get a few lines from a group of files with missing identifier mappings. I have a bunch of files with content similar to this one:

ENSRNOG00000018677 1368832_at 25233
ENSRNOG00000002079 1369102_at 25272
ENSRNOG00000043451 25353
ENSRNOG00000001527 1388013_at 25408
ENSRNOG00000007390 1389538_at 25493

In the example above I need ’25353′, which does not have corresponding affy_probeset_id in the 2nd column.

It is clear how to do that:

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}'

This outputs a column of required IDs (EntrezGene in this example):

116720
679845
309295
364867
298220
298221
25353

However, I need these IDs as a comma-separated list, not as newline-separated list.

There are several ways to achieve the desired result (only the last pipe commands differ):

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | gawk '$1=$1' ORS=', '

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | tr '\n' ','

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':a;N;$!ba;s/\n/, /g'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':q;N;s/\n/, /g;t q'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | paste -s -d ","

These solutions differ in efficiency and (slightly) in output. sed will read all the input into its buffer to replace newlines with other separators, so it might not be best for large files. tr might be the most efficient, but I haven’t tested that. paste will re-use delimiters, so you cannot really get comma-space “, ” separation with it.

Sources: linuxquestions 1 (explains used sed commands), linuxquestions 2, nixcraft.

Overlaying gene expression data onto pathways from databases

Bogdan — Fri, 05 Nov 2010 13:20:06 +0000

Superimposing gene expression data onto pathways from databases is a common task in the final steps of microarray data analysis – that is, biological interpretation and results discussion.

I have found many tools which claim to facilitate this procedure. Some of them are reviewed below (in no specific order).

Pathway Explorer by Bernhard Mlecnik was last updated in 2007, but is fully functional (I believe it is being maintained without changes to the last-updated date). Both online and downloadable Java applications are available. Note that for downloadable application you will need to obtain a license key – the procedure is well documented and was very fast for me.

Pathway Explorer supports import from 3 sources: KEGG xml files, biocarta URLs, and GenMAPP URLs. Import from KEGG does work as described in the short manual, and seems functional (I had some problems exporting/saving the resulting picture, but didn’t investigate further). Biocarta import seems to work, but for some reason does not display expression levels of pathway components. I could not test the import of GenMAPP pathways, because they are not available online.

I found Pathway Explorer good, but then switched to PathVisio (reviewed next), because for some reason Pathway Explorer was recognising only a small fraction of genes from my expression data. It could be that identifiers mappings are outdated, but this is just a guess.

PathVisio appears to be a spin-off of GenMAPP and WikiPathways. It excells at importing/visualizing WikiPathways data, which even comes bundled with PathVisio Java application. It is easier to use than Pathway Explorer, and it seems to recognize more genes (although still not all the genes which are present in the data). There is KEGG pathways support, but it is not always usable – many edges (links between genes/proteins) are absent, so instead of a pathway you get a bunch of nodes relevant to a pathway, but cannot really see how they are connected. PathVisio supports an insanely long list of database identifiers, so it is highly unlikely that you will have to map your data to use a different identifier. This pathway mapper exports to several formats, including PNG and PDF.

I could not fully test AffyWEB, because it doesn’t list rat arrays we used. Trying their barley genome example did work, so the tool is probably functional. It overlays your expression data onto KEGG pathways.

G-language Microarray System is a comparatively simple pathway visualizer. It accepts CSV files containing EntrezGene IDs column with a single column of expression values normalized to 1-100 range, fetches requested KEGG pathway, and generates a Flash (SWF) object depicting that pathway with coloured components. It does work with sample data. I was too lazy to normalize my expression data to [1;100] range, and SWF is not exactly a usable format, so I haven’t tested this tool any further (you can right-click to zoom in the flash pathway below).

If time permits (or work requires) this post may be extended with the reviews of GenMAPP, GEPAT, KEGG2heatmap script, EGAN, MapMan, Pathway Miner, ArrayXPath, VisANT, SpotXplore, and maybe others.

Please comment to share your experience using pathway expression overlaying tools or to suggest other tools.

Batch-retrieve EntrezGene homologs using NCBI’s HomoloGene and R’s annotationTools

Bogdan — Wed, 27 Oct 2010 10:49:01 +0000

Install the annotationTools R package:
source(“http://bioconductor.org/biocLite.R”)
biocLite(“annotationTools”)
Download full HomoloGene data file from ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current
library(annotationTools)
homologene = read.delim(“homologene.data”, header=FALSE)
mygenes = read.table(“file with one entrez ID of the source organism per line.txt”)
getHOMOLOG(unlist(mygenes), taxonomy_ID_of_target_organism, homologene) [alternatively, wrap the call to getHOMOLOG into unlist to get a vector]

It might be easier to achieve the same results with a Perl script calling NCBI’s e-utils.

International salary survey in sciences (2010)

Bogdan — Thu, 14 Oct 2010 15:28:42 +0000

Nature published the said survey based on responses of over 10000 employees in science. It has lots of multi-axis data to explore, and some major trends are discussed in the special report. Highly recommended for anyone considering science career changes.

Tools for conversion of IDs in genomics

Bogdan — Tue, 10 Aug 2010 12:31:44 +0000

Tools for conversion of IDs in genomics