Homology and similarity

23rd October 2006

In bioinformatics and biology, the “homology” term is used quite often, and quite often it is mis-used. So what are “homology” and “similarity”, and how can one use these terms correctly?

Zoologists and botanists make a distinction between homologous organs (of the same or very similar ancestry/anatomy: bat’s wing and human’s hand) and analogous (only functionally similar, but different in ancestry/structure) organs (e.g. bat’s wing and butterfly’s wing). Homologous organs are not necessarily similar (at least the similarity may not be obvious); similar organs are not necessarily homologous.

In application to sequnces (DNA, RNA, proteins) the concepts of homology and similarity are often used the way author prefers to use them (reference). Phrases like â€œsequence homologyâ€, “structural homology”, â€œhigh homologyâ€, â€œsignificant homologyâ€, or even â€œ35% homologyâ€ are clearly incorrect, taking into account the definition above. However, these phrases are common in scientific articles. â€œSequence homologyâ€ found its way even into the NLM’s Medical Subject Heading (MeSH) system (see here or here). This MeSH term has been assigned to approximately 150 000 articles in PubMed (at the moment of writing). Here, â€œhomologyâ€ is used as a good-looking substitution for â€œsimilarityâ€.

One could argue, that it might be left as it is – well, words might change meaning when migrating between spheres of knowledge. But in this case, the notion of “common ancestry” is being lost, as “homology” boils down to simple “similarity”. Thus it is reasonable to speak about “sequence (structure) similarity”, which might infer “sequence (structure) homology”.

Whether similar sequences are homologues, depends on a number of factors. In short, the following criteria can be applied:

similarity level, %. If it’s 80% or more, you have a high chance of examining homologs. However, similarity alone is not sufficient.
random similarity probability (or, better, statistical reliability of similarity). One can compute an e-value (expectation value), which shows the probability of independent “co-evolution” of two sequences with final high similarity we calculated. This approach is used, for example, by BLAST, to determine whether sequence searching against the database generated reliable similarity results. Usually, the shorter the sequence – the higher is the “random similarity” probability.
based on similar sequences alignment, some dependancies might be found. For example, a short highly similar stretch of sequence might be present in all the investigated sequences, etc. This is an indirect indication of true homology, unlike the “highly similar”, but badly aligning sequences.
additional evidence can be used. For example, when comparing DNA sequences, protein sequences and 3D structure can also be taken into account. 3D structure, if available, is a valuable resource of similarity information, as it largely defines protein functions.

That seems to be it for homology and similarity.
Use words wisely, they are not an infinite resource

This entry was posted on Monday, October 23rd, 2006 at 17:39 and is filed under Bioinformatics, Science. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

« Flag as a symbol of language – usability or convenience?

XName.org down: largest DDoS they ever had »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta