Archive for the 'Bioinformatics' Category

Bioinformatics is a general term which refers to using computers and computational/math methods in applications to biology.

Introduction to Python for bioinformatics

25th February 2011

This overview presentation is two years old, but still a highly valuable resource: modules and tools mentioned are alive and useful.
I think this is the second presentation by Giovanni I’m embedding (first one being about GNU/make for bioinformatics).

Introduction to python for bioinformatics

Posted in Bioinformatics, Links, Python, Software | No Comments »

How to replace newlines with commas, tabs etc (merge lines)

16th November 2010

Imagine you need to get a few lines from a group of files with missing identifier mappings. I have a bunch of files with content similar to this one:

ENSRNOG00000018677 1368832_at 25233
ENSRNOG00000002079 1369102_at 25272
ENSRNOG00000043451 25353
ENSRNOG00000001527 1388013_at 25408
ENSRNOG00000007390 1389538_at 25493

In the example above I need ’25353′, which does not have corresponding affy_probeset_id in the 2nd column.

It is clear how to do that:

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}'

This outputs a column of required IDs (EntrezGene in this example):

116720
679845
309295
364867
298220
298221
25353

However, I need these IDs as a comma-separated list, not as newline-separated list.

There are several ways to achieve the desired result (only the last pipe commands differ):

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | gawk '$1=$1' ORS=', '

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | tr '\n' ','

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':a;N;$!ba;s/\n/, /g'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':q;N;s/\n/, /g;t q'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | paste -s -d ","

These solutions differ in efficiency and (slightly) in output. sed will read all the input into its buffer to replace newlines with other separators, so it might not be best for large files. tr might be the most efficient, but I haven’t tested that. paste will re-use delimiters, so you cannot really get comma-space “, ” separation with it.

Sources: linuxquestions 1 (explains used sed commands), linuxquestions 2, nixcraft.

Posted in *nix, Bioinformatics, how-to, Notepad, Software | 2 Comments »

Overlaying gene expression data onto pathways from databases

5th November 2010

Superimposing gene expression data onto pathways from databases is a common task in the final steps of microarray data analysis – that is, biological interpretation and results discussion.

I have found many tools which claim to facilitate this procedure. Some of them are reviewed below (in no specific order).
Read the rest of this entry »

Posted in Bioinformatics, Links, Software | No Comments »

Batch-retrieve EntrezGene homologs using NCBI’s HomoloGene and R’s annotationTools

27th October 2010

Install the annotationTools R package:
source(“http://bioconductor.org/biocLite.R”)
biocLite(“annotationTools”)
Download full HomoloGene data file from ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current
library(annotationTools)
homologene = read.delim(“homologene.data”, header=FALSE)
mygenes = read.table(“file with one entrez ID of the source organism per line.txt”)
getHOMOLOG(unlist(mygenes), taxonomy_ID_of_target_organism, homologene) [alternatively, wrap the call to getHOMOLOG into unlist to get a vector]

It might be easier to achieve the same results with a Perl script calling NCBI’s e-utils.

Posted in Bioinformatics, how-to, Notepad | 2 Comments »

Tools for conversion of IDs in genomics

10th August 2010

Tools for conversion of IDs in genomics

Posted in Bioinformatics, Links, Science | No Comments »

R tutorial links

29th March 2010

R time series tutorial (2010, a website of the “Time Series Analysis and Its Applications: With R Examples” book)
Statistics with R (2007)
R for programmers PDF (2008, 104 pages, linked to from here)
Brief R tutorial (2004)
Statistical computing with R: a tutorial (2004)
An introduction to R (from the official r-project website, should be always up-to-date)
R tutorial (date unknown, definitely newer than 2005)

Posted in Bioinformatics, Links, Science, Systems Biology | 1 Comment »

R script to filter probesets with log-expression values below the lowest spike-in

27th January 2010

Sometimes there is a need to remove all the probesets, which have expression values below the minimal spike-in intensity on the Affymetrix microarray. The reasoning behind this procedure is simple: minimal-expression spike-ins represent the bottom margin of microarray sensitivity, and anything below that margin cannot be reliably quantified – which also means that both fold-change and p-value of expression variance will be unreliable for these probesets.

Here’s a simple R script to do just that. It is abundantly commented, and also contains an optional (commented out) fragment which allows the removal of more low-variance, low-intensity probesets.

Read the rest of this entry »

Posted in Bioinformatics, Programming, Science | No Comments »

« Previous Entries

Next Entries »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Subscribe

Archives

Recent comments

Meta

Archive for the 'Bioinformatics' Category

Introduction to Python for bioinformatics

How to replace newlines with commas, tabs etc (merge lines)

Overlaying gene expression data onto pathways from databases

Batch-retrieve EntrezGene homologs using NCBI’s HomoloGene and R’s annotationTools

Tools for conversion of IDs in genomics

R tutorial links

R script to filter probesets with log-expression values below the lowest spike-in

Tiny bits of bioinformatics, [web-]programming etc

Categories

Tags list

Subscribe

Archives

Recent comments

Meta

Archive for the 'Bioinformatics' Category