9 research outputs found

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Full text link
    In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to TT in O(1) worst-case time. At any moment, we can report all occurrences of a pattern PP in the current text in O(P+k)O(|P|+k) time, where P|P| is the length of PP and kk is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

    Sufficient Conditions for Efficient Indexing Under Different Matchings

    Get PDF
    The most important task derived from the massive digital data accumulation in the world, is efficient access to this data, hence the importance of indexing. In the last decade, many different types of matching relations were defined, each requiring an efficient indexing scheme. Cole and Hariharan in a ground breaking paper [Cole and Hariharan, SIAM J. Comput., 33(1):26-42, 2003], formulate sufficient conditions for building an efficient indexing for quasi-suffix collections, collections that behave as suffixes. It was shown that known matchings, including parameterized, 2-D array and order preserving matchings, fit their indexing settings. In this paper, we formulate more basic sufficient conditions based on the order relation derived from the matching relation itself, our conditions are more general than the previously known conditions

    Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

    Full text link
    This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time O(logn)O(\log n) per input symbol (as opposed to amortized O(logn)O(\log n) time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves O(logn)O(\log n) worst case time per input symbol. Searching for a pattern of length mm in the resulting suffix tree takes O(min(mlogΣ,m+logn)+tocc)O(\min(m\log |\Sigma|, m + \log n) + tocc) time, where tocctocc is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

    Sliding Window String Indexing in Streams

    Full text link
    Given a string SS over an alphabet Σ\Sigma, the 'string indexing problem' is to preprocess SS to subsequently support efficient pattern matching queries, i.e., given a pattern string PP report all the occurrences of PP in SS. In this paper we study the 'streaming sliding window string indexing problem'. Here the string SS arrives as a stream, one character at a time, and the goal is to maintain an index of the last ww characters, called the 'window', for a specified parameter ww. At any point in time a pattern matching query for a pattern PP may arrive, also streamed one character at a time, and all occurrences of PP within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w)O(w) space data structure that uses O(logw)O(\log w) time with high probability to process each character from both the input string SS and the pattern string PP. Reporting each occurrence from PP uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ\delta characters that arrive from either stream. We present an O(w+δ)O(w + \delta) space data structure for this problem that improves the above time bounds to O(log(w/δ))O(\log(w/\delta)). In particular, for a delay of δ=ϵw\delta = \epsilon w we obtain an O(w)O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees

    Sliding Window String Indexing in Streams

    Get PDF
    Given a string S over an alphabet ?, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next ? characters that arrive from either stream. We present an O(w + ?) space data structure for this problem that improves the above time bounds to O(log (w/?)). In particular, for a delay of ? = ? w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees

    Locally Consistent Parsing for Text Indexing in Small Space

    Full text link
    We consider two closely related problems of text indexing in a sub-linear working space. The first problem is the Sparse Suffix Tree (SST) construction of a set of suffixes BB using only O(B)O(|B|) words of space. The second problem is the Longest Common Extension (LCE) problem, where for some parameter 1τn1\le\tau\le n, the goal is to construct a data structure that uses O(nτ)O(\frac {n}{\tau}) words of space and can compute the longest common prefix length of any pair of suffixes. We show how to use ideas based on the Locally Consistent Parsing technique, that was introduced by Sahinalp and Vishkin [STOC '94], in some non-trivial ways in order to improve the known results for the above problems. We introduce new Las-Vegas and deterministic algorithms for both problems. We introduce the first Las-Vegas SST construction algorithm that takes O(n)O(n) time. This is an improvement over the last result of Gawrychowski and Kociumaka [SODA '17] who obtained O(n)O(n) time for Monte-Carlo algorithm, and O(nlogB)O(n\sqrt{\log |B|}) time for Las-Vegas algorithm. In addition, we introduce a randomized Las-Vegas construction for an LCE data structure that can be constructed in linear time and answers queries in O(τ)O(\tau) time. For the deterministic algorithms, we introduce an SST construction algorithm that takes O(nlognB)O(n\log \frac{n}{|B|}) time (for B=Ω(logn)|B|=\Omega(\log n)). This is the first almost linear time, O(npolylogn)O(n\cdot poly\log{n}), deterministic SST construction algorithm, where all previous algorithms take at least Ω(min{nB,n2B})\Omega\left(\min\{n|B|,\frac{n^2}{|B|}\}\right) time. For the LCE problem, we introduce a data structure that answers LCE queries in O(τlogn)O(\tau\sqrt{\log^*n}) time, with O(nlogτ)O(n\log\tau) construction time (for τ=O(nlogn)\tau=O(\frac{n}{\log n})). This data structure improves both query time and construction time upon the results of Tanimura et al. [CPM '16].Comment: Extended abstract to appear is SODA 202

    Full-Fledged Real-Time Indexing for Constant Size Alphabets

    No full text
    International audienceIn this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet of constant size. Each new symbol can be prepended to T in O(1) worst-case time. At any moment, we can report all occurrences of a pattern P in the current text in O(|P|+k) time, where |P| is the length of P and k is the number of occurrences. This resolves, under assumption of constant size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see Amir and Nor in Real-time indexing over fixed finite alphabets, pp. 1086–1095, 2008)
    corecore