Numerous machine learning (ML) models employed in protein function and
structure prediction depend on evolutionary information, which is captured
through multiple-sequence alignments (MSA) or position-specific scoring
matrices (PSSM) as generated by PSI-BLAST. Consequently, these predictive
methods are burdened by substantial computational demands and prolonged
computing time requirements. The principal challenge stems from the necessity
imposed on the PSI-BLAST software to load large sequence databases sequentially
in batches and then search for sequence alignments akin to a given query
sequence. In the case of batch queries, the runtime scales even linearly. The
predicament at hand is becoming more challenging as the size of bio-sequence
data repositories experiences exponential growth over time and as a
consequence, this upward trend exerts a proportional strain on the runtime of
PSI-BLAST. To address this issue, an eminent resolution lies in leveraging the
MMseqs2 method, capable of expediting the search process by a magnitude of 100.
However, MMseqs2 cannot be directly employed to generate the final output in
the desired format of PSI-BLAST alignments and PSSM profiles. In this research
work, I developed a comprehensive pipeline that synergistically integrates both
MMseqs2 and PSI-BLAST, resulting in the creation of a robust, optimized, and
highly efficient hybrid alignment pipeline. Notably, the hybrid tool exhibits a
significant speed improvement, surpassing the runtime performance of PSI-BLAST
in generating sequence alignment profiles by a factor of two orders of
magnitude. It is implemented in C++ and is freely available under the MIT
license at https://github.com/issararab/EPSAPG.Comment: 10th IEEE/ACM International Conference on Big Data Computing,
Applications and Technologie