Archive for the 'Bioinformatics' Category

Bioinformatics is a general term which refers to using computers and computational/math methods in applications to biology.

Practical comparison of NGS adapter trimming tools

1st June 2016

I used to work with sequencing providers who were giving me fairly clean data.
It was already barcode-separated, and had no over-represented adapter sequences.
The only thing I had to do was to (optionally) quality-trim the reads, and check for biological contamination.

Recently, however, I have come across some real-world data, which not only had contamination in it, but also quite a noticeable percentage of adapters.
I did a quick test of multiple tools to see if they fit my requirements:

should be easy/logical to use: no arcane/convoluted command lines or config files
should detect adapters automatically, either using its own database or a provided plain FASTA file
should be reasonably fast
must leave no adapter traces behind: I prefer aggressive trimming

I have tried the following tools:

fastq-mcf from the ea-tools package
skewer
TrimmomaticPE
cutadapt: haven’t used it directly, but it is used by some of the compared tools
bbduk from BBMAP
autoadapt
TrimGalore!

Read the rest of this entry »

Posted in Bioinformatics | 6 Comments »

Nobody wants higher-quality, complete bacterial genomes

24th May 2016

This is a piece of rant.

Disclaimer

The story, all names, characters, genomes and incidents portrayed in this blog post are fictitious.
No identification with actual persons (living, dead or undead), places, companies, and processes is intended or should be inferred.
No animals were harmed in the making of this blog post.

Let’s try answering a question:

why are there many incomplete/draft bacterial genomes, and much fewer complete genomes?

Read the rest of this entry »

Posted in Bioinformatics, Rant | 2 Comments »

How to use mkfifo named pipes with prinseq-lite.pl

24th February 2016

prinseq-lite.pl is a utility written in Perl for preprocessing NGS reads, also in FASTQ format.
It can read sequences both from files and from stdin (if you only have 1 sequence).

I wanted to use it with compressed (gzipped/bzipped2) FASTQ input files.
As I do not need to store decompressed input files, the most efficient solution is to use pipes.
This works well for a single file, but not for 2 files (paired-end reads).

For 2 files, named pipes (also known as FIFOs) can be used.
You can create a named pipe in Linux with the help of mkfifo command, for example mkfifo R1_decompressed.fastq.
To use it, start decompressing something into it (either in a different terminal, or in background), for example zcat R1.fastq.gz > R1_decompressed.fastq &;
we can call this a writing/generating process, because it writes into a pipe.
(If you are writing software to use named pipes, any processes writing into them should be started in a new thread, as they will block until all the data is consumed.)
Now if you give the R1_decompressed.fastq as a file argument to some other program, it will see decompressed content (e.g. wc -l R1_decompressed.fastq will tell you the number of lines in the decompressed file); we can call program reading from the named pipe a reading/consuming process.
As soon as a consuming process had consumed (read) all of the data, the writing/generating process will finally exit.

This, however, does not work with prinseq-lite.pl (version 0.20.4 or earlier), with a broken pipe error. Read the rest of this entry »

Posted in *nix, Bioinformatics, Software | No Comments »

MultiParanoid vs. QuickParanoid: pro et contra for each

9th July 2013

MultiParanoid

Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups.

QuickParanoid

QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

So… both use InParanoid… Are there any differences? Let me list those which I’ve found.

Read the rest of this entry »

Posted in *nix, Bioinformatics, Software | 2 Comments »

R functions for regression analysis cheat sheet

29th May 2012

Original PDF.
My local copy.

Posted in Bioinformatics, Links, Misc | No Comments »

Information criteria for choosing best predictive models

29th May 2012

Usually I’m using 10-fold (non-stratified) CV to measure the predictive power of the models: it gives consistent results, and is easy to perform (at least on smaller datasets).

Just came across the Akaikeâ€™s InforÂmaÂtion Criterion (AIC) and Schwarz Bayesian InforÂmaÂtion Criterion (BIC). Citing robjhyndman,

AsympÂtotÂiÂcally, minÂiÂmizÂing the AIC is equivÂaÂlent to minÂiÂmizÂing the CV value. This is true for any model (Stone 1977), not just linÂear modÂels. It is this propÂerty that makes the AIC so useÂful in model selecÂtion when the purÂpose is prediction.
…
Because of the heavÂier penalty, the model choÂsen by BIC is either the same as that choÂsen by AIC, or one with fewer terms. AsympÂtotÂiÂcally, for linÂear modÂels minÂiÂmizÂing BIC is equivÂaÂlent to leaveâ€“vâ€“out cross-â€‹â€‹validation when v = n[1-1/(log(n)-1)] (Shao 1997).

Want to try AIC and maybe BIC on my models. Conveniently, both functions exist in R.

Posted in Bioinformatics, Machine learning | No Comments »

Amazonia! 6462 human microarray datasets

6th March 2011

Amazonia! – explore the jungle of microarray results

Paradoxically, the tremendous downpour of microarray results prevents a simple use of expression data. Therefore, we propose a thematic entry to public transcriptomes: you may for instance query a gene on a “Stem Cells page”, where you will see the expression of your favorite gene across selected microarray experiments related to stem cell biology. This selection of samples can be customized at will among the 6462 samples currently present in the database.

Every transcriptome study results in the identification of lists of genes relevant to a given biological condition. In order to include this valuable information in any new query in the Amazonia! database, we indicate for each gene in which lists it is included. This is a straightforward and efficient way to synthesize hundreds of microarray publications.
A special feature of Amazonia! is the field of human stem cells, notably embryonic stem cells.

Posted in Bioinformatics, Links, Science | No Comments »

« Previous Entries

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Subscribe

Archives

Recent comments

Meta

Archive for the 'Bioinformatics' Category

Practical comparison of NGS adapter trimming tools

Nobody wants higher-quality, complete bacterial genomes

How to use mkfifo named pipes with prinseq-lite.pl

MultiParanoid vs. QuickParanoid: pro et contra for each

R functions for regression analysis cheat sheet

Information criteria for choosing best predictive models

Amazonia! 6462 human microarray datasets

Tiny bits of bioinformatics, [web-]programming etc

Categories

Tags list

Subscribe

Archives

Recent comments

Meta

Archive for the 'Bioinformatics' Category