80 research outputs found

    Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

    Full text link
    The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length nn can be represented as an array of length nn words, or, in the presence of the SA, as a bit vector of 2n2n bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of O(n)O(n) words (i.e. O(nlogn)O(n \log n) bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the 2n2n bit version of the LCP array which uses O(nlogσ)O(n \log \sigma) bits of additional space in external memory when given a (compressed) BWT with alphabet size σ\sigma and a sampled inverse suffix array at sampling rate O(logn)O(\log n). This is often a significant space gain in practice where σ\sigma is usually much smaller than nn or even constant. We also consider the case of computing succinct LCP arrays for circular strings

    YASS: enhancing the sensitivity of DNA similarity search

    Get PDF
    YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface () is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dot-plot, tabular output and others). A standalone version is available for download from the web page

    Text indexing for long patterns: Anchors are all you need

    Get PDF
    In many real-world database systems, a large fraction of the data is represented by strings: Sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (ⅰ) index space; (ⅱ) query time;(ⅲ) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound ℓ on the length of the queried patterns — which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors,by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index

    An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms

    Get PDF
    We propose a framework for the exact probabilistic analysis of window-based pattern matching algorithms, such as Boyer--Moore, Horspool, Backward DAWG Matching, Backward Oracle Matching, and more. In particular, we develop an algorithm that efficiently computes the distribution of a pattern matching algorithm's running time cost (such as the number of text character accesses) for any given pattern in a random text model. Text models range from simple uniform models to higher-order Markov models or hidden Markov models (HMMs). Furthermore, we provide an algorithm to compute the exact distribution of \emph{differences} in running time cost of two pattern matching algorithms. Methodologically, we use extensions of finite automata which we call \emph{deterministic arithmetic automata} (DAAs) and \emph{probabilistic arithmetic automata} (PAAs)~\cite{Marschall2008}. Given an algorithm, a pattern, and a text model, a PAA is constructed from which the sought distributions can be derived using dynamic programming. To our knowledge, this is the first time that substring- or suffix-based pattern matching algorithms are analyzed exactly by computing the whole distribution of running time cost. Experimentally, we compare Horspool's algorithm, Backward DAWG Matching, and Backward Oracle Matching on prototypical patterns of short length and provide statistics on the size of minimal DAAs for these computations

    Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

    Full text link
    Sparse suffix sorting is the problem of sorting b=o(n)b=o(n) suffixes of a string of length nn. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(nlogb)\mathcal{O}(n\log b) time, in the worst case, or in O(n)\mathcal{O}(n) time, when the total number of suffixes with an LCP value greater than 2lognb+112^{\lfloor \log \frac{n}{b} \rfloor + 1}-1 is in O(b/logb)\mathcal{O}(b/\log b), matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b+o(b)8b+o(b) machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(nlogb)\mathcal{O}(n\log b) time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.Comment: 16 pages, 1 figur

    Succinct Data Structures for Parameterized Pattern Matching and Related Problems

    Get PDF
    Let T be a fixed text-string of length n and P be a varying pattern-string of length |P| \u3c= n. Both T and P contain characters from a totally ordered alphabet Sigma of size sigma \u3c= n. Suffix tree is the ubiquitous data structure for answering a pattern matching query: report all the positions i in T such that T[i + k - 1] = P[k], 1 \u3c= k \u3c= |P|. Compressed data structures support pattern matching queries, using much lesser space than the suffix tree, mainly by relying on a crucial property of the leaves in the tree. Unfortunately, in many suffix tree variants (such as parameterized suffix tree, order-preserving suffix tree, and 2-dimensional suffix tree), this property does not hold. Consequently, compressed representations of these suffix tree variants have been elusive. We present the first compressed data structures for two important variants of the pattern matching problem: (1) Parameterized Matching -- report a position i in T if T[i + k - 1] = f(P[k]), 1 \u3c= k \u3c= |P|, for a one-to-one function f that renames the characters in P to the characters in T[i,i+|P|-1], and (2) Order-preserving Matching -- report a position i in T if T[i + j - 1] and T[i + k -1] have the same relative order as that of P[j] and P[k], 1 \u3c= j \u3c k \u3c= |P|. For each of these two problems, the existing suffix tree variant requires O(n*log n) bits of space and answers a query in O(|P|*log sigma + occ) time, where occ is the number of starting positions where a match exists. We present data structures that require O(n*log sigma) bits of space and answer a query in O((|P|+occ) poly(log n)) time. As a byproduct, we obtain compressed data structures for a few other variants, as well as introduce two new techniques (of independent interest) for designing compressed data structures for pattern matching

    Bidirectional string anchors: A new string sampling mechanism

    Get PDF
    The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w+k-1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer , our mechanism selects the lexicographically smallest rotation in every length- fragment (in every sliding window of length ). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experimen

    Algorithms for Order-Preserving Matching

    Get PDF
    String matching is a widely studied problem in Computer Science. There have been many recent developments in this field. One fascinating problem considered lately is the order-preserving matching (OPM) problem. The task is to find all the substrings in the text which have the same length and relative order as the pattern, where the relative order is the numerical order of the numbers in a string. The problem finds its applications in the areas involving time series or series of numbers. More specifically, it is useful for those who are interested in the relative order of the pattern and not in the pattern itself. For example, it can be used by analysts in a stock market to study movements of prices.  In addition to the OPM problem, we also studied its approximate variation. In approximate order-preserving matching, we search for those substrings in the text which have relative order similar to the pattern, i.e., relative order of the pattern matches with at most k mismatches. With respect to applications of order-preserving matching, approximate search is more meaningful than exact search. We developed various advanced solutions for the problem and its variant. Special emphasis was laid on the practical efficiency of the solutions. Particularly, we introduced a simple solution for the OPM problem using filtration. We proved experimentally that our method was effective and faster than the previous solutions for the problem. In addition, we combined the Single Instruction Multiple Data (SIMD) instruction set architecture with filtration to develop competent solutions which were faster than our previous solution. Moreover, we proposed another efficient solution without filtration using the SIMD architecture. We also presented an offline solution based on the FM-index scheme. Furthermore, we proposed practical solutions for the approximate order-preserving matching problem and one of the solutions was the first sublinear solution on average for the problem
    corecore