60,596 research outputs found

    String Indexing with Compressed Patterns

    Get PDF
    Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern

    Partial fillup and search time in LC tries

    Full text link
    Andersson and Nilsson introduced in 1993 a level-compressed trie (in short: LC trie) in which a full subtree of a node is compressed to a single node of degree being the size of the subtree. Recent experimental results indicated a 'dramatic improvement' when full subtrees are replaced by partially filled subtrees. In this paper, we provide a theoretical justification of these experimental results showing, among others, a rather moderate improvement of the search time over the original LC tries. For such an analysis, we assume that n strings are generated independently by a binary memoryless source with p denoting the probability of emitting a 1. We first prove that the so called alpha-fillup level (i.e., the largest level in a trie with alpha fraction of nodes present at this level) is concentrated on two values with high probability. We give these values explicitly up to O(1), and observe that the value of alpha (strictly between 0 and 1) does not affect the leading term. This result directly yields the typical depth (search time) in the alpha-LC tries with p not equal to 1/2, which turns out to be C loglog n for an explicitly given constant C (depending on p but not on alpha). This should be compared with recently found typical depth in the original LC tries which is C' loglog n for a larger constant C'. The search time in alpha-LC tries is thus smaller but of the same order as in the original LC tries.Comment: 13 page

    On Locating Paths in Compressed Tries

    Full text link
    In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our first contribution is to propose a suitable generalization of the run-length BWT to tries. We show that this natural generalization enjoys several of the useful properties of its counterpart on strings: in particular, the transform natively supports counting occurrences of a query pattern on the trie's paths and its size rr captures the trie's repetitiveness and lower-bounds a natural notion of trie entropy. Our main contribution is a much deeper insight into the combinatorial structure of this object. In detail, we show that a data structure of O(rlogn)+2n+o(n)O(r\log n) + 2n + o(n) bits, where nn is the number of nodes, allows locating the occocc occurrences of a pattern of length mm in nearly-optimal O(mlogσ+occ)O(m\log\sigma + occ) time, where σ\sigma is the alphabet's size. Our solution consists in sampling O(r)O(r) nodes that can be used as "anchor points" during the locate process. Once obtained the pre-order identifier of the first pattern occurrence (in co-lexicographic order), we show that a constant number of constant-time jumps between those anchor points lead to the identifier of the next pattern occurrence, thus enabling locating in optimal O(1)O(1) time per occurrence.Comment: Improved toehold lemma running time; added more detailed proofs that take care of all border cases in the locate strategy; postprint version to appear in SODA 202

    A fast approach for overcomplete sparse decomposition based on smoothed L0 norm

    Full text link
    In this paper, a fast algorithm for overcomplete sparse decomposition, called SL0, is proposed. The algorithm is essentially a method for obtaining sparse solutions of underdetermined systems of linear equations, and its applications include underdetermined Sparse Component Analysis (SCA), atomic decomposition on overcomplete dictionaries, compressed sensing, and decoding real field codes. Contrary to previous methods, which usually solve this problem by minimizing the L1 norm using Linear Programming (LP) techniques, our algorithm tries to directly minimize the L0 norm. It is experimentally shown that the proposed algorithm is about two to three orders of magnitude faster than the state-of-the-art interior-point LP solvers, while providing the same (or better) accuracy.Comment: Accepted in IEEE Transactions on Signal Processing. For MATLAB codes, see (http://ee.sharif.ir/~SLzero). File replaced, because Fig. 5 was missing erroneousl

    On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

    Get PDF
    We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with pp processors. Given a static text of length nn, we first show how to compute the suffix array interval of a given pattern of length mm in O(mp+lgp+lglgplglgn)O(\frac{m}{p}+ \lg p + \lg\lg p\cdot\lg\lg n) time for pmp \le m. For approximate pattern matching with kk differences or mismatches, we show how to compute all occurrences of a given pattern in O(mkσkpmax(k,lglgn) ⁣+ ⁣(1+mp)lgplglgn+occ)O(\frac{m^k\sigma^k}{p}\max\left(k,\lg\lg n\right)\!+\!(1+\frac{m}{p}) \lg p\cdot \lg\lg n + \text{occ}) time, where σ\sigma is the size of the alphabet and pσkmkp \le \sigma^k m^k. The workhorse of our algorithms is a data structure for merging suffix array intervals quickly: Given the suffix array intervals for two patterns PP and PP', we present a data structure for computing the interval of PPPP' in O(lglgn)O(\lg\lg n) sequential time, or in O(1+lgplgn)O(1+\lg_p\lg n) parallel time. All our data structures are of size O(n)O(n) bits (in addition to the suffix array)

    String Indexing for Patterns with Wildcards

    Get PDF
    We consider the problem of indexing a string tt of length nn to report the occurrences of a query pattern pp containing mm characters and jj wildcards. Let occocc be the number of occurrences of pp in tt, and σ\sigma the size of the alphabet. We obtain the following results. - A linear space index with query time O(m+σjloglogn+occ)O(m+\sigma^j \log \log n + occ). This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time Θ(jn)\Theta(jn) in the worst case. - An index with query time O(m+j+occ)O(m+j+occ) using space O(σk2nlogklogn)O(\sigma^{k^2} n \log^k \log n), where kk is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    Full text link
    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

    Full text link
    Given a dynamic set KK of kk strings of total length nn whose characters are drawn from an alphabet of size σ\sigma, a keyword dictionary is a data structure built on KK that provides locate, prefix search, and update operations on KK. Under the assumption that α=w/lgσ\alpha = w / \lg \sigma characters fit into a single machine word ww, we propose a keyword dictionary that represents KK in nlgσ+Θ(klgn)n \lg \sigma + \Theta(k \lg n) bits of space, supporting all operations in O(m/α+lgα)O(m / \alpha + \lg \alpha) expected time on an input string of length mm in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations
    corecore