Homology and similarity
23rd October 2006
In bioinformatics and biology, the “homology” term is used quite often, and quite often it is mis-used. So what are “homology” and “similarity”, and how can one use these terms correctly?
Zoologists and botanists make a distinction between homologous organs (of the same or very similar ancestry/anatomy: bat’s wing and human’s hand) and analogous (only functionally similar, but different in ancestry/structure) organs (e.g. bat’s wing and butterfly’s wing). Homologous organs are not necessarily similar (at least the similarity may not be obvious); similar organs are not necessarily homologous.
In application to sequnces (DNA, RNA, proteins) the concepts of homology and similarity are often used the way author prefers to use them (reference). Phrases like “sequence homologyâ€, “structural homology”, “high homologyâ€, “significant homologyâ€, or even “35% homology†are clearly incorrect, taking into account the definition above. However, these phrases are common in scientific articles. “Sequence homology†found its way even into the NLM’s Medical Subject Heading (MeSH) system (see here or here). This MeSH term has been assigned to approximately 150 000 articles in PubMed (at the moment of writing). Here, “homology†is used as a good-looking substitution for “similarityâ€.
One could argue, that it might be left as it is – well, words might change meaning when migrating between spheres of knowledge. But in this case, the notion of “common ancestry” is being lost, as “homology” boils down to simple “similarity”. Thus it is reasonable to speak about “sequence (structure) similarity”, which might infer “sequence (structure) homology”.
Whether similar sequences are homologues, depends on a number of factors. In short, the following criteria can be applied:
- similarity level, %. If it’s 80% or more, you have a high chance of examining homologs. However, similarity alone is not sufficient.
- random similarity probability (or, better, statistical reliability of similarity). One can compute an e-value (expectation value), which shows the probability of independent “co-evolution” of two sequences with final high similarity we calculated. This approach is used, for example, by BLAST, to determine whether sequence searching against the database generated reliable similarity results. Usually, the shorter the sequence – the higher is the “random similarity” probability.
- based on similar sequences alignment, some dependancies might be found. For example, a short highly similar stretch of sequence might be present in all the investigated sequences, etc. This is an indirect indication of true homology, unlike the “highly similar”, but badly aligning sequences.
- additional evidence can be used. For example, when comparing DNA sequences, protein sequences and 3D structure can also be taken into account. 3D structure, if available, is a valuable resource of similarity information, as it largely defines protein functions.
That seems to be it for homology and similarity.
Use words wisely, they are not an infinite resource