Position Frequency Matrix to Position Weight Matrix (PFM2PWM)
11th September 2006
In the course of my current research, I was dealing with the TFBS (Transcription Factor Binding Sites) search. To perfrom the search, one needs position weight matrix (PWM) for each TFBS. When you refer to the TRANSFAC database of transcription factors (and matrices), you will get position frequency matrix (PFM), and will need to convert PFM into PWM.
I did find a couple of conversion formulas, but that was quite an effort to figure out which one is correct – I had seen two different formula variations. Here I will share what I had found.
The original PFM2PWM might have been introduced by Hertz et al, 1999. In compiling the formula presented below, I looked through the sources of the Perl module TFBS-0.5.0, consulted RSA-tools: convert-matrix manual (or here), and used some supplementary presentation evidently prepared on the basis of the article “Identifying estrogen receptor α target genes using integrated computational genomics and chromatin immunoprecipitation microarray” by Victor X. Jin et al 2004.
Here is the formula with some explanations:
- w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
If we have PFM for ISRE from TRANSFAC 7.0:
A 1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
C 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
G 2 1 12 0 0 0 0 1 2 0 0 0 0 2 3
T 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1
w – is a weight for the current nucleotide we are calculating
f – is a number of occurences of the current nucleotide in the current column (e.g., “1″ for A in column 1, “8″ for C etc)
N – total number of observations, the sum of all nucleotides occurences in a column (13 in this example)
p – [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)
For the example, to calculate weight for nucleotide A in the first column I would need to calculate the following:
- -1.1265 = log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25)
The negative value “-1.1265″ here means that nucleotide A is not favoured by the TF (transcription factor) binding to the TFBS we are looking for. If we calculate weight for nucleotide C in this column, we get “1.1″ – a positive value which shows that C is favoured by TF in this position.
Finally, for the example PFM above, corresponding PWM is:
A -1.13 1.64 -2.20 -2.20 -2.20 -2.20 -2.20 0.93 -1.13 -1.13 -2.20 -2.20 -2.20 -0.52 -1.13
C 1.10 -2.20 -2.20 -2.20 -2.20 -2.20 1.74 -1.13 0.93 -2.20 -2.20 -0.09 1.10 0.93 1.10
G -0.52 -1.13 1.64 -2.20 -2.20 -2.20 -2.20 -1.13 -0.52 -2.20 -2.20 -2.20 -2.20 -0.52 -0.09
T -0.52 -2.20 -2.20 1.74 1.74 1.74 -2.20 0.24 -0.09 1.64 1.74 1.39 0.51 -0.52 -1.13
I do not know why the [prior] background nucleotide frequency is usually taken to be equal 0.25 (except for the evident 1/4 explanation) – I think this could be replaced with real background frequency of the nucleotide in the sequence fragment where we are looking for TFBS. That might make a little difference for the GC-rich regions, such as promoters – where TFBS are usually located.
I would be glad to hear any comments on the subject, especially on the p = 0.25 or p = [real bg freq.].
Updated: see my notes on using sequence-optimized p(b).
If you just want to convert PFM into PWM, try using RSA-tools convert-matrix. It produces different results – especially if you apply the “smoothing” algorithm (which is on by default). Though different, results are close to those obtained with the formula above. Both with “smoothing” on and off, weights are less “divergent” – that is, values tend to group closer to zero.
October 18th, 2006 at 23:39
[...] Position Frequency Matrix to Position Weight Matrix (PFM2PWM)search-hilite.php fileSelecting notebook model to suit your needs [...]
March 20th, 2007 at 7:57
Hey thanks for the formula. Youre using the common method of sqrt(N) to compensate for small sample sizes. It turns out that this is still commonly used – see BOX 2 in (Wasserman, 2004) Applied Bioinformatics for the identification of regulatory elements. Wasserman uses the exact formula you have here but states using a “pseudocount function”. There are other methods for compensating for small samples, plus more stuff for nucleotide frequencies in regulatory regions, but the formula on this page is… succinct. I’m making a presentation for class and your page is cited if thats ok.
-j
March 20th, 2007 at 8:40
Jason, citing this page is perfectly OK. The “pseudocount function” was also used as a name for sqrt(N) part in the source where I first encountered the formula.
Thanks for the reference to Wasserman, 2004.
May 3rd, 2007 at 16:53
[...] Position Frequency Matrix to Position Weight Matrix (PFM2PWM) [...]
December 18th, 2007 at 8:08
I think it should be ln, i.e. log(e,x), not log2, isn’t it?
December 18th, 2007 at 14:58
Realzhang,
it should be log2(). For the explanation why, please see the “Bioinformatics” book by David W. Mount (section on PSSM information content).
In short, log2() is used to determine the uncertainty (entropy) and information content, thus it is also used for the PFM2PWM conversion.
However, in the resources cited in the post both base 2 and base e logarithms are used (see e.g. Perl modules documentation for TFBS 0.5 and Jason’s slide 5, which, in turn, cites Wasserman, 2004; actually, the only reference to using ln() is in Hertz, 1999).
December 19th, 2007 at 17:11
Bogdan:
Thanks for your reply. I’ve seen the use of ln() at WITA, which just let the pseudocount=1. After reading your reply, I think thers is no significant difference between the log2 and ln() if all the matrices use the same base.
By the way, do you have any good idea on how to determin the threshold of the PWM similarity? The genome.UCSC uses a interesing statistical method, and how do you think of this problem?
December 19th, 2007 at 18:07
Realzhang,
there’s some ambiguity in your question on PWM similarity threshold.
If you are interested in comparing PWM matrices, then I’d suggest Similarity of position frequency matrices for transcription factor binding sites.
However, if you mean the matrix-to-sequence similarity (the threshold/cut-off problem, which arises when looking for TFBSs with a known PFM/PWM matrix) – then it’s a complicated issue. From what I had previously seen in literature, 0.75 similarity (relative score) is often used. Based on my little research (look for Fig.1 and explanations in the text), for ISRE TFBS in rattus norvegicus promoters 0.75 similarity includes some 2/3 of all the maximal-scoring ISREs in all rat promoters. Though I used the 0.8 similarity cut-off, now I think that 0.75 (or even 0.7, given enough post-processing) is much more favourable. (Note: Fig.1 in that PDF has some possibly important theoretical flaws, but it’s a fair representation of actual maximal matrix-sequence scores, obtained for promoters and exons in rat.)
I think that TFBS search itself, no matter how you optimize the threshold, will not give biologically valid results. (The only exception I can think of right now is developing some algorithm which would automatically adjust the cut-off individually for each searched promoter or even each individual searched sub-sequence; however, it’s unclear what should be the criteria for such an algorithm to adjust the cut-off.) Thus, the best approach would be to use the lowest meaningful cut-off, and then just apply a series of filters (post-processors), which would refine the results set. One of the approaches to do that is to somehow employ phylogenetic information and evolutionary sequences conservation/divergence.
The UCSC link you gave me tries to do just that. (By the way, the only interesting thing in their calculations of PWM-sequence scores is the calculation of Z-score – this I haven’t met before, and that’s something to evaluate.)
Actually, I’m nearly done developing a genome-wide TFBS finder web-tool (COTRASIF), which also relies on the inter-species evolutionary conservation of sequences – but I do that somewhat differently from the method described at UCSC. I’ll make an official “COTRASIF opening” post in the nearest future, when all the initial features will be complete, and there will be a sufficient description for the tool.
Meanwhile, if you are interested, you may join the development of COTRASIF. This isn’t a paying job (the project, at least currently, is not commercial), but it just might fit your interests. And there are huge and challenging plans for future (including additional results filtering by the DNA 3D-structure…. but psst, I didn’t say that!)
December 21st, 2007 at 8:51
Realzhang,
did you get the link to results file from the task you submitted? As noted on the task submission page, gmail and yahoo sometimes either reject or delay for several days the delivery of emails from our processing server.
Your task was complete 5 minutes after submission, but I wonder if it got through to your inbox.
December 22nd, 2007 at 9:33
to Bogdan:
Yes, the result email was delayed about 2 days, and the “submitted” notification and “finished” notification were dilivered at the same time. After all, I received the mail.
But the result page seems blank. May you can check it for me?
December 22nd, 2007 at 14:19
Of course I’ll check.
The page opened for me, but I do believe there can be problems – we’re using two separate servers (interface & workhorse), which are physically quite distant and connected only via some 13-hop public internet channels which can be slow at times. I’ll try to negotiate more reliable single-server collocation at my institute.
I sent the file to you via email (or you can try again to see if it works from the site). I also significantly extended the help page, which now explains the format and meaning of the results file, and also some other important things about the functioning of COTRASIF. Please pay attention to the “duplicate lines” problem described on the help page – I considered that issue resolved until I had a look at your results file. So thanks for you help!
Feedback, criticism and suggestions are welcome – you may use both my email and contact page for replies.
December 17th, 2008 at 11:40
Hello,Bogdan,I search in the internet and want to find a tool could translate PFM into PWM,this page is useful.
But I have a little and stupid question in the formula:
w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
and your example:
-1.1265 = log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25)
I calculate that with my pen and OFFICE excel
((1+0.25*sqrt(13)) = 1.9013878
(13+sqrt(13))/0.25) = 66.4222051
SO
log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25) = log2(0.0286258) = -5.1265409 not -1.1265
And I found if (13+sqrt(13))/0.25) change to (13+sqrt(13))*0.25)
then
(13+sqrt(13))*0.25) = 4.1513878
final answer will be log2(1.9013878/4.1513878) = -1.1265409
I am confused about this,please tell me what’s wrong with me…
However, sorry for my poor english.
December 17th, 2008 at 13:38
Hi Steven,
there is nothing wrong with you. Fortunately, there seems to be nothing wrong with me either.
For the example we are discussing:
and
If I calculate that as w = log2 ( 1.901387819 / (16.605551275 / 0.25) ) – note additional braces – then I’d get -5.12654089.
But if I calculate exactly in the order written, as w = log2 ( (1.901387819 / 16.605551275) / 0.25 ) – braces added for clarity – then I get -1.12654089.
So, on the one hand, the error in your calculation was to group 16.605551275 / 0.25, which is ( N + sqrt(N) ) / p), although it is not grouped in the formula; the error stems from the incorrect order of operations. On the other hand, your finding of (13+sqrt(13))*0.25) = 4.1513878 and further correct results makes sense:
So w = (x/y)/z = x/(y*z), log2 ( 1.901387819 / 16.605551275 / 0.25 ) = log2 ( 1.901387819 / (16.605551275 * 0.25) ).
December 17th, 2008 at 13:45
Steven,
post has links to TFBS Perl module, which AFAIK has PFM2PWM conversion.
December 18th, 2008 at 5:36
Thank you very much Bodan for help me finding the scotoma ,I am a newbie in perl,Because I read the English document very slow,so there is long way to go,however,thanks for your information~~
December 8th, 2009 at 12:18
Thanks for the formula to calculate the weight. My question is once one convert the pfm to pwm how do we use the pwm to find binding sites. Everyone talks about using PSSM to find binding site but how to we use PSSM. I have managed to convert my PFM to PWM thanks to your formula. I have my PSSM, how do I use it to locate the binding sites?
Your assistance is and will be grealty appreciated
December 8th, 2009 at 13:09
I assume you know that PSSM (Position-Specific Scoring Matrix) is just an alternative name of PWMs. It appears that there is now an article on the subject here.
Generally, one should first look for existing tools to perform the needed task. You may find this supplement, briefly comparing several existing tools, helpful in identifying the tool you need.
If you do not need an existing tool, but rather the algorithm/schema of the search, then reading an aforementioned review by Wasserman, 2004 (comments #2 and #6 on this page) will help.
Let me know if you have more questions.
December 8th, 2009 at 14:49
Thanks for your answer. I want to understand how to find TFBS using PWM. The best way for me to learn is to implement it myself that is why I am doing the implementation step by step.
To be more specific. I wrote a c program that reads a file containing 14 lines of nucleotides. Each line has 50 nucleotides sequences. My c program reads the file and generate the PFM then after that I convert the PFM to PWM. Now what is the next step? what do I do with the PWM? How do I use it to find TFBS. I am a bit new in the field and I am still learning. Sorry if my questions sound stupid.
I can email you the c code (ANSI C) and the file containing the sequences maybe it can help to understand my concern
Thanks.
Regards,
A
December 8th, 2009 at 15:14
Assuming your program works properly (I haven’t checked that):
1. Do you really need a 50-long PWM? Your nucleotides.seq does not even distantly resemble a set of aligned conserved sequences, which are usually used to construct PFMs/PWMs. You could be mixing the concepts of “input sequence” and “sequences to construct PWM from”.
2. Assuming you really want a low-Ic 50-long PWM: to perform a search, you will now need to read the sequence you want to search in, doing so in 50-long chunks, and score each chunk with your PWM – by adding up individual row-column scores of matching nucleotides. You really want to see Wasserman, 2004, for a figure explaining how this is done. Then you will need to normalize the absolute score, to get a number between 0 and 1, and then make a decision whether currently processed chunk is above the “found threshold” (“found cut-off”).
March 10th, 2012 at 9:06
am working with TF-DNA binding. i computed score using PWM (using same equations mentioned by Wasserman,2004).but in this link (http://web.mit.edu/8.592/www/lectures/lec18/proteinDNAenergetics.pdf) they have computed energy(kT) and they mentioned that they used PWM only. I tried a lot, i got equations for probabilities from energy. if i use the same am not able to compute energy as it is in that link. Can anyone please help me to find this out? Thanks in advance.
March 25th, 2012 at 23:39
@Bubby, their formula (7) for the weight matrix is a little different. Mentioning just in case you somehow haven’t noticed that.
July 4th, 2013 at 12:58
Hello everyone here, I come across to your open discussion on PFM2PWM problem. However, I have another (maybe stupid one) question to you guys below: a) how to transfer back from a PWM to its corresponding PFM? b) if this can be done, wondering if the solution in unique or not? c) why do we have to use PWM for binding sites detection? To me, this is questionable.
Looking forward to hearing from Bogdan and others. Thanks!
July 8th, 2013 at 20:57
@Justin
If you have a PWM, the exact formula used to calculate it, and the total number of sequences used to construct the PFM – you can easily (and uniquely) convert PWM to PFM. If you do not have the total number of sequences, then reconstructing PFM from PWM is still possible – at the very least using a simple brute-force approach; in this case the solution is not guaranteed to be identical to the original PFM. If you only have the PWM and nothing else, you could still try the brute-force approach, but now for the multitude of possible PFM2PWM formulas.
Regarding your c point, the simplest answer is that PWM in a single number combines both the frequency information and Ic (information content) for that position, automatically up- or down-weighting important/non-important positions. PFMs do not give you that.
July 8th, 2013 at 23:25
Thanks, Bogdan. I basically understand the PFM and PWM but sometimes confused by myself about something like where the PFM or PWM comes from without knowing the true binding sites on hands. As for c), my point is the use of these models for detecting binding sites is lack of theoretical foundation, that is, the nucleotides cannot be independent of each other which directly results in a wrong intepretation of the methodology.
July 8th, 2013 at 23:50
PFMs are simply summaries of the known binding sites – so I do not quite understand your “without knowing the true binding sites”.
Also, in one of the earlier works PWM values were shown to approximate the binding energies – here’s the theoretical foundation you might be looking for.
The “independent nucleotides” model does seem to work nicely, even though it does not cover/explain all of the possible DNA-protein interactions.