Novel algorithms for motif discovery in bio-sequence datasets

Abstract

A significant growth in the volume of bio-molecular sequence data (DNA, RNA and protein sequences) over the past decade calls for novel computational techniques to extract meaningful information from such data. Existing methods to extract such information predominantly consist of identifying patterns or motifs, for example, repeated substrings of bio-sequences, conserved substrings in a group of homologous protein sequences, or similar substrings in a set of DNA sequences. Identifying such motifs has applications in, to name a few, understanding gene function, human disease, and identifying potential therapeutic drug targets. Several variants of the motif discovery problem could be identified in the literature and numerous algorithms have been proposed for such variants. In this research work, we propose novel algorithms, significantly different from the techniques adopted so far by the existing algorithms, to address salient problems in the domain of molecular biology that require discovering motifs in a set of bio-sequences. The proposed algorithms employ basic sorting techniques and simple data structures such as arrays and linked lists, and have been shown to perform better in practice than many of the previously known algorithms, when applied to synthetic and real biological datasets.

    Similar works