Comments on: Position Frequency Matrix to Position Weight Matrix (PFM2PWM)

By: Bogdan

Bogdan — Mon, 08 Jul 2013 21:50:23 +0000

PFMs are simply summaries of the known binding sites – so I do not quite understand your “without knowing the true binding sites”.

Also, in one of the earlier works PWM values were shown to approximate the binding energies – here’s the theoretical foundation you might be looking for.

The “independent nucleotides” model does seem to work nicely, even though it does not cover/explain all of the possible DNA-protein interactions.

By: Justin

Justin — Mon, 08 Jul 2013 21:25:10 +0000

Thanks, Bogdan. I basically understand the PFM and PWM but sometimes confused by myself about something like where the PFM or PWM comes from without knowing the true binding sites on hands. As for c), my point is the use of these models for detecting binding sites is lack of theoretical foundation, that is, the nucleotides cannot be independent of each other which directly results in a wrong intepretation of the methodology.

By: Bogdan

Bogdan — Mon, 08 Jul 2013 18:57:10 +0000

@Justin

If you have a PWM, the exact formula used to calculate it, and the total number of sequences used to construct the PFM – you can easily (and uniquely) convert PWM to PFM. If you do not have the total number of sequences, then reconstructing PFM from PWM is still possible – at the very least using a simple brute-force approach; in this case the solution is not guaranteed to be identical to the original PFM. If you only have the PWM and nothing else, you could still try the brute-force approach, but now for the multitude of possible PFM2PWM formulas.

Regarding your c point, the simplest answer is that PWM in a single number combines both the frequency information and Ic (information content) for that position, automatically up- or down-weighting important/non-important positions. PFMs do not give you that.

By: Justin

Justin — Thu, 04 Jul 2013 10:58:07 +0000

Hello everyone here, I come across to your open discussion on PFM2PWM problem. However, I have another (maybe stupid one) question to you guys below: a) how to transfer back from a PWM to its corresponding PFM? b) if this can be done, wondering if the solution in unique or not? c) why do we have to use PWM for binding sites detection? To me, this is questionable.

Looking forward to hearing from Bogdan and others. Thanks!

By: Bogdan

Bogdan — Sun, 25 Mar 2012 21:39:57 +0000

@Bubby, their formula (7) for the weight matrix is a little different. Mentioning just in case you somehow haven’t noticed that.

By: bubby

bubby — Sat, 10 Mar 2012 07:06:07 +0000

am working with TF-DNA binding. i computed score using PWM (using same equations mentioned by Wasserman,2004).but in this link (http://web.mit.edu/8.592/www/lectures/lec18/proteinDNAenergetics.pdf) they have computed energy(kT) and they mentioned that they used PWM only. I tried a lot, i got equations for probabilities from energy. if i use the same am not able to compute energy as it is in that link. Can anyone please help me to find this out? Thanks in advance.

By: Bogdan

Bogdan — Tue, 08 Dec 2009 13:14:05 +0000

Assuming your program works properly (I haven’t checked that):

1. Do you really need a 50-long PWM? Your nucleotides.seq does not even distantly resemble a set of aligned conserved sequences, which are usually used to construct PFMs/PWMs. You could be mixing the concepts of “input sequence” and “sequences to construct PWM from”.

2. Assuming you really want a low-Ic 50-long PWM: to perform a search, you will now need to read the sequence you want to search in, doing so in 50-long chunks, and score each chunk with your PWM – by adding up individual row-column scores of matching nucleotides. You really want to see Wasserman, 2004, for a figure explaining how this is done. Then you will need to normalize the absolute score, to get a number between 0 and 1, and then make a decision whether currently processed chunk is above the “found threshold” (“found cut-off”).

By: abraham

abraham — Tue, 08 Dec 2009 12:49:37 +0000

Thanks for your answer. I want to understand how to find TFBS using PWM. The best way for me to learn is to implement it myself that is why I am doing the implementation step by step.

To be more specific. I wrote a c program that reads a file containing 14 lines of nucleotides. Each line has 50 nucleotides sequences. My c program reads the file and generate the PFM then after that I convert the PFM to PWM. Now what is the next step? what do I do with the PWM? How do I use it to find TFBS. I am a bit new in the field and I am still learning. Sorry if my questions sound stupid.

I can email you the c code (ANSI C) and the file containing the sequences maybe it can help to understand my concern
Thanks.
Regards,

By: Bogdan

Bogdan — Tue, 08 Dec 2009 11:09:40 +0000

I assume you know that PSSM (Position-Specific Scoring Matrix) is just an alternative name of PWMs. It appears that there is now an article on the subject here. Generally, one should first look for existing tools to perform the needed task. You may find this supplement, briefly comparing several existing tools, helpful in identifying the tool you need. If you do not need an existing tool, but rather the algorithm/schema of the search, then reading an aforementioned review by Wasserman, 2004 (comments #2 and #6 on this page) will help. Let me know if you have more questions.

By: abraham

abraham — Tue, 08 Dec 2009 10:18:16 +0000

Thanks for the formula to calculate the weight. My question is once one convert the pfm to pwm how do we use the pwm to find binding sites. Everyone talks about using PSSM to find binding site but how to we use PSSM. I have managed to convert my PFM to PWM thanks to your formula. I have my PSSM, how do I use it to locate the binding sites?

Your assistance is and will be grealty appreciated

By: Steven Chou

Steven Chou — Thu, 18 Dec 2008 03:36:46 +0000

Thank you very much Bodan for help me finding the scotoma ,I am a newbie in perl,Because I read the English document very slow,so there is long way to go,however,thanks for your information~~

By: Bogdan

Bogdan — Wed, 17 Dec 2008 11:45:02 +0000

Steven,

post has links to TFBS Perl module, which AFAIK has PFM2PWM conversion.

By: Bogdan

Bogdan — Wed, 17 Dec 2008 11:38:12 +0000

Hi Steven,

there is nothing wrong with you. Fortunately, there seems to be nothing wrong with me either.

For the example we are discussing:

w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )

and

w = log2 ( 1.901387819 / 16.605551275 / 0.25 )

If I calculate that as w = log2 ( 1.901387819 / (16.605551275 / 0.25) ) – note additional braces – then I’d get -5.12654089.
But if I calculate exactly in the order written, as w = log2 ( (1.901387819 / 16.605551275) / 0.25 ) – braces added for clarity – then I get -1.12654089.

So, on the one hand, the error in your calculation was to group 16.605551275 / 0.25, which is ( N + sqrt(N) ) / p), although it is not grouped in the formula; the error stems from the incorrect order of operations. On the other hand, your finding of (13+sqrt(13))*0.25) = 4.1513878 and further correct results makes sense:

w = (x/y)/z (this is the correct order of calculations, not w = x/(y/z) ),

w = (x/y)*1/z

w = x/(y*z)

So w = (x/y)/z = x/(y*z), log2 ( 1.901387819 / 16.605551275 / 0.25 ) = log2 ( 1.901387819 / (16.605551275 * 0.25) ).

By: Steven Chou

Steven Chou — Wed, 17 Dec 2008 09:40:09 +0000

Hello,Bogdan,I search in the internet and want to find a tool could translate PFM into PWM,this page is useful.
But I have a little and stupid question in the formula:

w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )

and your example:

-1.1265 = log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25)

I calculate that with my pen and OFFICE excel

((1+0.25*sqrt(13)) = 1.9013878

(13+sqrt(13))/0.25) = 66.4222051

SO
log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25) = log2(0.0286258) = -5.1265409 not -1.1265

And I found if (13+sqrt(13))/0.25) change to (13+sqrt(13))*0.25)

then

(13+sqrt(13))*0.25) = 4.1513878

final answer will be log2(1.9013878/4.1513878) = -1.1265409

I am confused about this,please tell me what’s wrong with me…

However, sorry for my poor english.

By: Bogdan

Bogdan — Sat, 22 Dec 2007 12:19:01 +0000

Of course I'll check. The page opened for me, but I do believe there can be problems - we're using two separate servers (interface & workhorse), which are physically quite distant and connected only via some 13-hop public internet channels which can be slow at times. I'll try to negotiate more reliable single-server collocation at my institute. I sent the file to you via email (or you can try again to see if it works from the site). I also significantly extended the help page, which now explains the format and meaning of the results file, and also some other important things about the functioning of COTRASIF. Please pay attention to the "duplicate lines" problem described on the help page - I considered that issue resolved until I had a look at your results file. So thanks for you help! Feedback, criticism and suggestions are welcome - you may use both my email and contact page for replies.

By: realzhang

realzhang — Sat, 22 Dec 2007 07:33:56 +0000

to Bogdan:

Yes, the result email was delayed about 2 days, and the “submitted” notification and “finished” notification were dilivered at the same time. After all, I received the mail.

But the result page seems blank. May you can check it for me?

By: Bogdan

Bogdan — Fri, 21 Dec 2007 06:51:53 +0000

Realzhang,

did you get the link to results file from the task you submitted? As noted on the task submission page, gmail and yahoo sometimes either reject or delay for several days the delivery of emails from our processing server.

Your task was complete 5 minutes after submission, but I wonder if it got through to your inbox.

By: Bogdan

Bogdan — Wed, 19 Dec 2007 16:07:45 +0000

Realzhang,

there’s some ambiguity in your question on PWM similarity threshold.

If you are interested in comparing PWM matrices, then I’d suggest Similarity of position frequency matrices for transcription factor binding sites.

However, if you mean the matrix-to-sequence similarity (the threshold/cut-off problem, which arises when looking for TFBSs with a known PFM/PWM matrix) – then it’s a complicated issue. From what I had previously seen in literature, 0.75 similarity (relative score) is often used. Based on my little research (look for Fig.1 and explanations in the text), for ISRE TFBS in rattus norvegicus promoters 0.75 similarity includes some 2/3 of all the maximal-scoring ISREs in all rat promoters. Though I used the 0.8 similarity cut-off, now I think that 0.75 (or even 0.7, given enough post-processing) is much more favourable. (Note: Fig.1 in that PDF has some possibly important theoretical flaws, but it’s a fair representation of actual maximal matrix-sequence scores, obtained for promoters and exons in rat.)

I think that TFBS search itself, no matter how you optimize the threshold, will not give biologically valid results. (The only exception I can think of right now is developing some algorithm which would automatically adjust the cut-off individually for each searched promoter or even each individual searched sub-sequence; however, it’s unclear what should be the criteria for such an algorithm to adjust the cut-off.) Thus, the best approach would be to use the lowest meaningful cut-off, and then just apply a series of filters (post-processors), which would refine the results set. One of the approaches to do that is to somehow employ phylogenetic information and evolutionary sequences conservation/divergence.

The UCSC link you gave me tries to do just that. (By the way, the only interesting thing in their calculations of PWM-sequence scores is the calculation of Z-score – this I haven’t met before, and that’s something to evaluate.)

Actually, I’m nearly done developing a genome-wide TFBS finder web-tool (COTRASIF), which also relies on the inter-species evolutionary conservation of sequences – but I do that somewhat differently from the method described at UCSC. I’ll make an official “COTRASIF opening” post in the nearest future, when all the initial features will be complete, and there will be a sufficient description for the tool.

Meanwhile, if you are interested, you may join the development of COTRASIF. This isn’t a paying job (the project, at least currently, is not commercial), but it just might fit your interests. And there are huge and challenging plans for future (including additional results filtering by the DNA 3D-structure…. but psst, I didn’t say that!)

By: realzhang

realzhang — Wed, 19 Dec 2007 15:11:50 +0000

Bogdan: Thanks for your reply. I've seen the use of ln() at WITA, which just let the pseudocount=1. After reading your reply, I think thers is no significant difference between the log2 and ln() if all the matrices use the same base. By the way, do you have any good idea on how to determin the threshold of the PWM similarity? The genome.UCSC uses a interesing statistical method, and how do you think of this problem?

By: Bogdan

Bogdan — Tue, 18 Dec 2007 12:58:06 +0000

Realzhang, it should be log₂(). For the explanation why, please see the "Bioinformatics" book by David W. Mount (section on PSSM information content). In short, log₂() is used to determine the uncertainty (entropy) and information content, thus it is also used for the PFM2PWM conversion. However, in the resources cited in the post both base 2 and base e logarithms are used (see e.g. Perl modules documentation for TFBS 0.5 and Jason's slide 5, which, in turn, cites Wasserman, 2004; actually, the only reference to using ln() is in Hertz, 1999).