24,648 research outputs found

    Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

    Get PDF
    We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.Comment: 22 page

    Indexing weighted sequences: Neat and efficient

    Get PDF
    In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known under the name of Position Weight Matrices. Given a probability threshold 1/z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m−1 in X is at least 1/z . In this article, we consider an indexing variant of the problem, in which we are to pre-process a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n that answers pattern matching queries in the optimal O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of [z] strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We thus improve the most efficient previously known index by Amir et al. (Theor. Comput. Sci., 2008) with size and construction time O(nz2 log z), preserving optimal query time. On the way we develop a new, more straightforward index for the so-called property matching problem. We provide an open-source implementation of our data structure and present experimental results using both synthetic and real data. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. at EDBT 2016 and an improvement of the space complexity of their general index. We also present applications of our index

    Cross-Document Pattern Matching

    Get PDF
    We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
    • …
    corecore