1 research outputs found
Multiple Sequence Alignment is not a Solved Problem
Multiple sequence alignment is a basic procedure in molecular biology, and it
is often treated as being essentially a solved computational problem. However,
this is not so, and here I review the evidence for this claim, and outline the
requirements for a solution. The goal of alignment is often stated to be to
juxtapose nucleotides (or their derivatives, such as amino acids) that have
been inherited from a common ancestral nucleotide (although other goals are
also possible). Unfortunately, this is not an operational definition, because
homology (in this sense) refers to unique and unobservable historical events,
and so there can be no objective mathematical function to optimize.
Consequently, almost all algorithms developed for multiple sequence alignment
are based on optimizing some sort of compositional similarity (similarity =
homology + analogy). As a result, many, if not most, practitioners either
manually modify computer-produced alignments or they perform de novo manual
alignment, especially in the field of phylogenetics. So, if homology is the
goal, then multiple sequence alignment is not yet a solved computational
problem. Several criteria have been developed by biologists to help them
identify potential homologies (compositional, ontogenetic, topographical and
functional similarity, plus conjunction and congruence), and these criteria can
be applied to molecular data, in principle. Current computer programs do
implement one (or occasionally two) of these criteria, but no program
implements them all. What is needed is a program that evaluates all of the
evidence for the sequence homologies, optimizes their combination, and thus
produces the best hypotheses of homology. This is basically an inference
problem not an optimization problem.Comment: 37 pages, 5 figures, 3 table