1 research outputs found
Optimization of Consistency-Based Multiple Sequence Alignment using Big Data technologies
With the advent of new high-throughput next-generation sequencing
technologies, the volume of genetic data processed has increased significantly.
It is becoming essential for these applications to achieve large-scale
alignments with thousands of sequences or even whole genomes. However, all
current MSA tools have exhibited scalability issues when the number of sequences
increases. The main drawback of these methods is that errors made
in early pairwise alignments are propagated to the nal result, a ecting the
accuracy of the global alignment. The use of consistency information enables
the nal result to be improved and makes it more stable from the accuracy
point of view. However, such methods are severely limited by the memory
required to store the consistency information. Authors in a previous work
analyzed the structure and distribution of the data stored in the constraint
library and demonstrated that it could be possible to reduce it without loosing
accuracy and thus it is possible to increase the number of sequences to
be aligned. However, the execution time for obtaining the constraint library
for a bigger number of sequences also increases greatly. In the present paper,
the authors apply Big Data technologies to take advantage of the high degree
of parallelism provided by the MapReduce paradigm in order to reduce
considerably the library calculation time. Moreover, Big Data infrastructure
provides a distributed storage system to improve the library scalability and
machine-learning algorithms to enhance the consistency selection policies.This work has been supported by the MEyC-Spain under contract TIN2014-53234-C2-2-R, TIN2017-84553-C2-2-R and TIN2016-81840-REDT