Also, in one of the earlier works PWM values were shown to approximate the binding energies – here’s the theoretical foundation you might be looking for.

The “independent nucleotides” model does seem to work nicely, even though it does not cover/explain all of the possible DNA-protein interactions.

]]>If you have a PWM, the exact formula used to calculate it, and the total number of sequences used to construct the PFM – you can easily (and uniquely) convert PWM to PFM. If you do not have the total number of sequences, then reconstructing PFM from PWM is still possible – at the very least using a simple brute-force approach; in this case the solution is not guaranteed to be identical to the original PFM. If you only have the PWM and nothing else, you could still try the brute-force approach, but now for the multitude of possible PFM2PWM formulas.

Regarding your **c** point, the simplest answer is that PWM in a single number combines both the frequency information and Ic (information content) for that position, automatically up- or down-weighting important/non-important positions. PFMs do not give you that.

Looking forward to hearing from Bogdan and others. Thanks!

]]>1. Do you really need a 50-long PWM? Your nucleotides.seq does not even distantly resemble a set of aligned conserved sequences, which are usually used to construct PFMs/PWMs. You could be mixing the concepts of “input sequence” and “sequences to construct PWM from”.

2. Assuming you really want a low-Ic 50-long PWM: to perform a search, you will now need to read the sequence you want to search in, doing so in 50-long chunks, and score each chunk with your PWM – by adding up individual row-column scores of matching nucleotides. You really want to see Wasserman, 2004, for a figure explaining how this is done. Then you will need to normalize the absolute score, to get a number between 0 and 1, and then make a decision whether currently processed chunk is above the “found threshold” (“found cut-off”).

]]>To be more specific. I wrote a c program that reads a file containing 14 lines of nucleotides. Each line has 50 nucleotides sequences. My c program reads the file and generate the PFM then after that I convert the PFM to PWM. Now what is the next step? what do I do with the PWM? How do I use it to find TFBS. I am a bit new in the field and I am still learning. Sorry if my questions sound stupid.

I can email you the c code (ANSI C) and the file containing the sequences maybe it can help to understand my concern

Thanks.

Regards,

A

]]>Generally, one should first look for existing tools to perform the needed task. You may find this supplement, briefly comparing several existing tools, helpful in identifying the tool you need.

If you do not need an existing tool, but rather the algorithm/schema of the search, then reading an aforementioned review by Wasserman, 2004 (comments #2 and #6 on this page) will help.

Let me know if you have more questions.

]]>Your assistance is and will be grealty appreciated

]]>post has links to TFBS Perl module, which AFAIK has PFM2PWM conversion.

]]>there is nothing wrong with you. Fortunately, there seems to be nothing wrong with me either.

For the example we are discussing:

- w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )

and

- w = log2 ( 1.901387819 / 16.605551275 / 0.25 )

If I calculate that as w = log2 ( 1.901387819 / (16.605551275 / 0.25) ) – note additional braces – then I’d get -5.12654089.

But if I calculate exactly in the order written, as w = log2 ( (1.901387819 / 16.605551275) / 0.25 ) – braces added for clarity – then I get -1.12654089.

So, on the one hand, the error in your calculation was to group 16.605551275 / 0.25, which is ( N + sqrt(N) ) / p), although it is **not** grouped in the formula; the error stems from the incorrect order of operations. On the other hand, your finding of (13+sqrt(13))*0.25) = 4.1513878 and further correct results makes sense:

- w = (x/y)/z (this is the correct order of calculations, not w = x/(y/z) ),

- w = (x/y)*1/z

- w = x/(y*z)

So w = (x/y)/z = x/(y*z), log2 ( 1.901387819 / 16.605551275 / 0.25 ) = log2 ( 1.901387819 / (16.605551275 * 0.25) ).

]]>But I have a little and stupid question in the formula:

w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )

and your example:

-1.1265 = log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25)

I calculate that with my pen and OFFICE excel

((1+0.25*sqrt(13)) = 1.9013878

(13+sqrt(13))/0.25) = 66.4222051

SO

log2((1+0.25*sqrt(13))/(13+sqrt(13))/0.25) = log2(0.0286258) = -5.1265409 not -1.1265

And I found if (13+sqrt(13))/0.25) change to (13+sqrt(13))*0.25)

then

(13+sqrt(13))*0.25) = 4.1513878

final answer will be log2(1.9013878/4.1513878) = -1.1265409

I am confused about this,please tell me what’s wrong with me…

However, sorry for my poor english.

]]>The page opened for me, but I do believe there can be problems – we’re using two separate servers (interface & workhorse), which are physically quite distant and connected only via some 13-hop public internet channels which can be slow at times. I’ll try to negotiate more reliable single-server collocation at my institute.

I sent the file to you via email (or you can try again to see if it works from the site). I also significantly extended the help page, which now explains the format and meaning of the results file, and also some other important things about the functioning of COTRASIF. Please pay attention to the “duplicate lines” problem described on the help page – I considered that issue resolved until I had a look at your results file. So thanks for you help!

Feedback, criticism and suggestions are welcome – you may use both my email and contact page for replies.

]]>Yes, the result email was delayed about 2 days, and the “submitted” notification and “finished” notification were dilivered at the same time. After all, I received the mail.

But the result page seems blank. May you can check it for me?

]]>did you get the link to results file from the task you submitted? As noted on the task submission page, gmail and yahoo sometimes either reject or delay for several days the delivery of emails from our processing server.

Your task was complete 5 minutes after submission, but I wonder if it got through to your inbox.

]]>there’s some ambiguity in your question on PWM similarity threshold.

If you are interested in comparing PWM matrices, then I’d suggest Similarity of position frequency matrices for transcription factor binding sites.

However, if you mean the matrix-to-sequence similarity (the threshold/cut-off problem, which arises when looking for TFBSs with a known PFM/PWM matrix) – then it’s a complicated issue. From what I had previously seen in literature, 0.75 similarity (relative score) is often used. Based on my little research (look for Fig.1 and explanations in the text), for ISRE TFBS in *rattus norvegicus* promoters 0.75 similarity includes some 2/3 of all the maximal-scoring ISREs in all rat promoters. Though I used the 0.8 similarity cut-off, now I think that 0.75 (or even 0.7, given enough post-processing) is much more favourable. (Note: Fig.1 in that PDF has some possibly important theoretical flaws, but it’s a fair representation of actual maximal matrix-sequence scores, obtained for promoters and exons in rat.)

I think that TFBS search itself, no matter how you optimize the threshold, will not give biologically valid results. (The only exception I can think of right now is developing some algorithm which would automatically adjust the cut-off individually for each searched promoter or even each individual searched sub-sequence; however, it’s unclear what should be the criteria for such an algorithm to adjust the cut-off.) Thus, the best approach would be to use the lowest meaningful cut-off, and then just apply a series of filters (post-processors), which would refine the results set. One of the approaches to do that is to somehow employ phylogenetic information and evolutionary sequences conservation/divergence.

The UCSC link you gave me tries to do just that. (By the way, the only interesting thing in their calculations of PWM-sequence scores is the calculation of Z-score – this I haven’t met before, and that’s something to evaluate.)

Actually, I’m nearly done developing a genome-wide TFBS finder web-tool (COTRASIF), which also relies on the inter-species evolutionary conservation of sequences – but I do that somewhat differently from the method described at UCSC. I’ll make an official “COTRASIF opening” post in the nearest future, when all the initial features will be complete, and there will be a sufficient description for the tool.

Meanwhile, if you are interested, you may join the development of COTRASIF. This isn’t a paying job (the project, at least currently, is not commercial), but it just might fit your interests. And there are huge and challenging plans for future (including additional results filtering by the DNA 3D-structure…. but psst, I didn’t say that!)

]]>Thanks for your reply. I’ve seen the use of ln() at WITA, which just let the pseudocount=1. After reading your reply, I think thers is no significant difference between the log2 and ln() if all the matrices use the same base.

By the way, do you have any good idea on how to determin the threshold of the PWM similarity? The genome.UCSC uses a interesing statistical method, and how do you think of this problem?

]]>it should be log_{2}(). For the explanation why, please see the “Bioinformatics” book by David W. Mount (section on PSSM information content).

In short, log_{2}() is used to determine the uncertainty (entropy) and information content, thus it is also used for the PFM2PWM conversion.

However, in the resources cited in the post both base 2 and base e logarithms are used (see e.g. Perl modules documentation for TFBS 0.5 and Jason’s slide 5, which, in turn, cites Wasserman, 2004; actually, the only reference to using ln() is in Hertz, 1999).

]]>