1 research outputs found

    Algorithms for path-constrained sequence alignment

    No full text
    International audienceWe define a novel variation on the constrained sequence alignment problem in which the constraint is given in the form of a regular expression. Given two sequences, an alphabet Γ\Gamma describing pairwise sequence alignment operations, and a regular expression RR over Γ\Gamma, the problem is to compute the highest scoring sequence alignment AA of the given sequences, such that AΓL(R)ΓA \in \Gamma^* L(R) \Gamma^* . Two algorithms are given for solving this problem. The first basic algorithm is general and solves the problem in O(nmrlog2r)O(nmr \log^2 r) time and O(min{n,m}r)O(\min\{n,m\}r) space, where mm and nn are the lengths of the two sequences and rr is the size of the NFA for RR. The second algorithm is restricted to patterns PP that do not contain the Kleene-closure star, and exploits this constraint to reduce the NFA size factor rr in the time complexity to a smaller factor P|P|. P|P| is compacted by supporting alignment patterns extended by \emph{meta-characters} including general insertion, deletion and match operations, as well as some cases of substitutions. For a regular expression P=P1PkP=P_1\cup\ldots\cup P_k, these time bounds range from O(knm)O(knm) to O(knmlog(max{Pi}))O(knm\log(\max\{|P_i|\})), depending on the meta-characters used in PP. An additional result obtained along the way is an extension of the algorithm of Fischer and Paterson for String Matching with Wildcards. Our extension allows the input strings to include "negation symbols" (that match all letters but a specific one) while retaining the original time complexity. We implemented both algorithms and applied them to data-mine new miRNA seeding patterns in \textit{C. elegans} Clip-seq experimental data
    corecore