162 research outputs found

    Probabilistic Model and Algorithms Design for Motif Detection

    Get PDF
    In this thesis, a natural probabilistic model has been used to test the quality of motif discovery programs. In this model, there are k background sequences, in which each character is a random character from Σ. Motif is a string G = g1g2 . . . gm. Each background sequence is implanted a probabilistically generated approximate copy of G. For each copy b1b2 . . . bm of G, every character bi is probabilistically generated such that the probability for bi 6= gi is at most α. Based on this model, two randomized algorithms ,one deterministic algorithm and one enumerative algorithm are designed, which can handle any motif patterns, and run much faster than those before, one can even run in sublinear time. These methods have been implemented in software

    Efficient motif finding algorithms for large-alphabet inputs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.</p> <p>Results</p> <p>The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10Ă— more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12Ă—, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.</p> <p>Conclusions</p> <p>Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.</p

    Structator: fast index-based search for RNA sequence-structure patterns

    Get PDF
    Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator webcite.Deutsche Forschungsgemeinschaft (grant WI 3628/1-1
    • …
    corecore