R script to filter probesets with log-expression values below the lowest spike-in

27th January 2010

Sometimes there is a need to remove all the probesets, which have expression values below the minimal spike-in intensity on the Affymetrix microarray. The reasoning behind this procedure is simple: minimal-expression spike-ins represent the bottom margin of microarray sensitivity, and anything below that margin cannot be reliably quantified – which also means that both fold-change and p-value of expression variance will be unreliable for these probesets.

Here’s a simple R script to do just that. It is abundantly commented, and also contains an optional (commented out) fragment which allows the removal of more low-variance, low-intensity probesets.

Hint: click the “plain text” box header to be able to right-click the code, “Select All”, and then “Copy”.
[CODE]
filter_below_spikes = function(eset) {
# Finds max(lowest) AFFX/spike-in intensity, and removes rows consisting entirely of values below max(lowest).
# @param eset
# ExpressionSet
# @returns
# exprs(ExpressionSet), filtered

# Without Biobase exprs() will not work.
require(Biobase)
expr = exprs(eset)

# ‘expr’ sample:
# 1
# 1367452_at 10.880208
# 1367453_at 10.554647

cat(nrow(expr), “rows before filtering.\n”)

# Make a vector of spike row names.
spikes = grep(“AFFX”, rownames(expr), value = TRUE)
cat(“Expression matrix has”, length(spikes), “spike-in rows.\n”)
cat(“Summary of spike-in values distribution follows:\n”)
print(summary(expr[spikes, ]))

# Find max(lowest) spike-in values.
minval_max = max(as.double(substr(grep(“Min”, summary(expr[spikes, ]), value = TRUE), 10, 14)))
cat(“max(minimal spike-in log-intensity values) =”, minval_max, “\n”)

# Remove spike-ins from expr.
expr = expr[grep("AFFX", rownames(expr), value = TRUE, invert = TRUE), ]

nospike_rows = nrow(expr)
cat(nospike_rows, “rows remaining after the removal of”, length(spikes), “spike-in probesets.\n”)

# Optional: calculate max(SD) of all removed rows.
#bad_sds_max = max(apply(expr[!apply((expr > minval_max), 1, any),], 1, sd))

# Now remove all rows, where each value is <= minval_max. expr = expr[!apply((expr <= minval_max), 1, all), ] cat(nrow(expr), "rows remaining after filtering out", nospike_rows - nrow(expr), "probesets with all values below", minval_max, "\n") #cat(bad_sds_max, "is max(SD) of all", nospike_rows - nrow(expr), "filtered probesets with all values below", minval_max, "\n") # Optional: Remove *some* of the rows, which have at least one value below minval_max, and row_SD <= bad_sds_max. #pre_final_rows = nrow(expr) #expr = expr[(apply(expr, 1, sd) > bad_sds_max) | (apply(expr, 1, min) > minval_max), ]
#cat(pre_final_rows-nrow(expr), “rows with SD <=", bad_sds_max, "and min(row) <=", minval_max, "were removed.\n") #cat(nrow(expr), "final rows returned.\n") return(expr) } [/CODE] Sample use:

> source(“script.R”)
> expr.filtered = filter_below_spikes(eset)

Comments and suggestions are welcome.

This entry was posted on Wednesday, January 27th, 2010 at 14:44 and is filed under Bioinformatics, Programming, Science. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

« Chandler 1.0.3 package for Debian testing amd64

Had happiness ever creeped up on you? »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta