Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences

Abstract

A simple motif is a short DNA sequence found in the promoter region and believed to act as a binding site for a transcription factor protein. A structured motif is a sequence of simple motifs (boxes) separated by short sequences (gaps). Biologists theorize that the presence of these motifs play a key role in gene expression regulation. Discovering these patterns is an important step towards understanding protein-gene and gene-gene interaction thus facilitates the building of accurate gene regulatory network models. DNA sequence motif extraction is an important problem in bioinformatics. Many studies have proposed algorithms to solve the problem instance of simple motif extraction. Only in the past decade has the more complex structured motif extraction problem been examined by researchers. The problem is inherently challenging as structured motif patterns are segmented into several boxes separated by variable size gaps for each instance. These boxes may not be exact copies, but may have multiple mismatched positions. The challenge is extenuated by the lack of resources for real datasets covering a wide range of possible cases. Also, incomplete annotation of real data leads to the discovery of unknown motifs that may be regarded as false positives. Furthermore, current algorithms demand unreasonable amount of prior knowledge to successfully extract the target pattern. The contributions of this research are four new algorithms. First, SMGenerate generates simulated datasets of implanted motifs that covers a wide range of biologically possible cases. Second, SMAlign aligns a pair of structured motifs optimally and efficiently given their gap constraints. Third, SMCluster produces multiple alignment of structured motifs through hierarchical clustering using SMAlign\u27s affinity score. Finally, SMExtract extracts structured motifs from a set of sequences by using SMCluster to construct the target pattern from the top reported two-box patterns (fragments), extracted using an existing algorithm (Exmotif) and a two-box template. The main advantage of SMExtract is its efficiency to extract longer degenerate patterns while requiring less prior knowledge, about the pattern to be extracted, than current algorithms

    Similar works