1 research outputs found
Aligning 415 519 proteins in less than two hours on PC
Rapid development of modern sequencing platforms enabled an unprecedented
growth of protein families databases. The abundance of sets composed of
hundreds of thousands sequences is a great challenge for multiple sequence
alignment algorithms. In the article we introduce FAMSA, a new progressive
algorithm designed for fast and accurate alignment of thousands of protein
sequences. Its features include the utilisation of longest common subsequence
measure for determining pairwise similarities, a novel method of gap costs
evaluation, and a new iterative refinement scheme. Importantly, its
implementation is highly optimised and parallelised to make the most of modern
computer platforms. Thanks to the above, quality indicators, namely
sum-of-pairs and total-column scores, show FAMSA to be superior to competing
algorithms like Clustal Omega or MAFFT for datasets exceeding a few thousand of
sequences. The quality does not compromise time and memory requirements which
are an order of magnitude lower than that of existing solutions. For example, a
family of 415 519 sequences was analysed in less than two hours and required
only 8GB of RAM.
FAMSA is freely available at http://sun.aei.polsl.pl/REFRESH/famsa