14 research outputs found

    Cross-Document Pattern Matching

    Get PDF
    We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

    Weighted ancestors in suffix trees

    Full text link
    The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

    Computing Lempel-Ziv Factorization Online

    Full text link
    We present an algorithm which computes the Lempel-Ziv factorization of a word WW of length nn on an alphabet Σ\Sigma of size σ\sigma online in the following sense: it reads WW starting from the left, and, after reading each r=O(logσn)r = O(\log_{\sigma} n) characters of WW, updates the Lempel-Ziv factorization. The algorithm requires O(nlogσ)O(n \log \sigma) bits of space and O(n \log^2 n) time. The basis of the algorithm is a sparse suffix tree combined with wavelet trees

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Full text link
    In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to TT in O(1) worst-case time. At any moment, we can report all occurrences of a pattern PP in the current text in O(P+k)O(|P|+k) time, where P|P| is the length of PP and kk is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

    Internal Pattern Matching Queries in a Text and Applications

    Full text link
    We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword xx in another subword yy of a given text, assuming that y=O(x)|y|=\mathcal{O}(|x|), which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding δ\delta-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed δ\delta we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201

    Fast Algorithm for Partial Covers in Words

    Get PDF
    A factor uu of a word ww is a cover of ww if every position in ww lies within some occurrence of uu in ww. A word ww covered by uu thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of uu. In this article we introduce a new notion of α\alpha-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least α\alpha positions in ww. We develop a data structure of O(n)O(n) size (where n=wn=|w|) that can be constructed in O(nlogn)O(n\log n) time which we apply to compute all shortest α\alpha-partial covers for a given α\alpha. We also employ it for an O(nlogn)O(n\log n)-time algorithm computing a shortest α\alpha-partial cover for each α=1,2,,n\alpha=1,2,\ldots,n

    On Optimal Top-K String Retrieval

    Full text link
    Let D{\cal{D}} = {d1,d2,d3,...,dD}\{d_1, d_2, d_3, ..., d_D\} be a given set of DD (string) documents of total length nn. The top-kk document retrieval problem is to index D\cal{D} such that when a pattern PP of length pp, and a parameter kk come as a query, the index returns the kk most relevant documents to the pattern PP. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in O(p+klogk)O(p + k\log k) time. This was improved by Navarro and Nekrich \cite{NN12} to O(p+k)O(p + k). These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into O(p)O(p) subproblems and thus incurs the additive factor of O(p)O(p). In external memory, these approaches will lead to O(p)O(p) I/Os instead of optimal O(p/B)O(p/B) I/O term where BB is the block-size. We re-interpret the problem independent of pp, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-kk queries (with unsorted outputs) in near optimal O(p/B+logBn+log(h)n+k/B)O(p/B + \log_B n + \log^{(h)} n + k/B) I/Os for any constant hh{log(1)n=logn\log^{(1)}n =\log n and log(h)n=log(log(h1)n)\log^{(h)} n = \log (\log^{(h-1)} n)}. Then we get O(nlogn)O(n\log^*n) space index with optimal O(p/B+logBn+k/B)O(p/B+\log_B n + k/B) I/Os.Comment: 3 figure

    The Online House Numbering Problem: Min-Max Online List Labeling

    Get PDF
    We introduce and study the online house numbering problem, where houses are added arbitrarily along a road and must be assigned labels to maintain their ordering along the road. The online house numbering problem is related to classic online list labeling problems, except that the optimization goal here is to minimize the maximum number of times that any house is relabeled. We provide several algorithms that achieve interesting tradeoffs between upper bounds on the number of maximum relabels per element and the number of bits used by labels
    corecore