176 research outputs found

    Structural Pattern Matching - Succinctly

    Get PDF
    Let T be a text of length n containing characters from an alphabet Sigma, which is the union of two disjoint sets: Sigma_s containing static characters (s-characters) and Sigma_p containing parameterized characters (p-characters). Each character in Sigma_p has an associated complementary character from Sigma_p. A pattern P (also over Sigma) matches an equal-length substring SS of T iff the s-characters match exactly, there exists a one-to-one function that renames the p-characters in S to the p-characters in P, and if a p-character x is renamed to another p-character y then the complement of x is renamed to the complement of y. The task is to find the starting positions (occurrences) of all such substrings S. Previous indexing solution [Shibuya, SWAT 2000], known as Structural Suffix Tree, requires Theta(nlog n) bits of space, and can find all occ occurrences in time O(|P|log sigma+ occ), where sigma = |Sigma|. In this paper, we present the first succinct index for this problem, which occupies n log sigma + O(n) bits and offers O(|P|logsigma+ occcdot log n logsigma) query time

    pBWT: Achieving succinct data structures for parameterized pattern matching and related problems

    Get PDF
    The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last two decades. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of the suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-preserving suffix links. Specifically, the relative order between two suffixes in the subtree of an internal node is same as that of the suffixes obtained by truncating the furst character of the two suffixes. Unfortunately, in many variants of the text-indexing problem, for e.g., parameterized pattern matching, 2D pattern matching, and order-isomorphic pattern matching, this property does not hold. Consequently, the compressed indexes based on BWT do not directly apply. Furthermore, a compressed index for any of these variants has been elusive throughout the advancement of the field of succinct data structures. We achieve a positive breakthrough on one such problem, namely the Parameterized Pattern Matching problem. Let T be a text that contains n characters from an alphabet , which is the union of two disjoint sets: containing static characters (s-characters) and containing parameterized characters (p-characters). A pattern P (also over ) matches an equal-length substring S of T i the s-characters match exactly, and there exists a one-to-one function that renames the p-characters in S to that in P. The task is to find the starting positions (occurrences) of all such substrings S. Previous index [Baker, STOC 1993], known as Parameterized Suffix Tree, requires (n log n) bits of space, and can find all occ occurrences in time O(jPj log +occ), where = jj. We introduce an n log +O(n)-bit index with O(jPj log +occlog n log ) query time. At the core, lies a new BWT-like transform, which we call the Parame- terized Burrows-Wheeler Transform (pBWT). The techniques are extended to obtain a succinct index for the Parameterized Dictionary Matching problem of Idury and Schaer [CPM, 1994]

    Shared-Constraint Range Reporting

    Get PDF
    Orthogonal range reporting is one of the classic and most fundamental data structure problems. (2,1,1) query is a 3 dimensional query with two-sided constraint on the first dimension and one sided constraint on each of the 2nd and 3rd dimension. Given a set of N points in three dimension, a particular formulation of such a (2,1,1) query (known as four-sided range reporting in three-dimension) asks to report all those K points within a query region [a, b]X(-infinity, c]X[d, infinity). These queries have overall 4 constraints. In Word-RAM model, the best known structure capable of answering such queries with optimal query time takes O(N log^{epsilon} N) space, where epsilon>0 is any positive constant. It has been shown that any external memory structure in optimal I/Os must use Omega(N log N/ log log_B N) space (in words), where B is the block size [Arge et al., PODS 1999]. In this paper, we study a special type of (2,1,1) queries, where the query parameters a and c are the same i.e., a=c. Even though the query is still four-sided, the number of independent constraints is only three. In other words, one constraint is shared. We call this as a Shared-Constraint Range Reporting (SCRR) problem. We study this problem in both internal as well as external memory models. In RAM model where coordinates can only be compared, we achieve linear-space and O(log N+K) query time solution, matching the best-known three dimensional dominance query bound. Whereas in external memory, we present a linear space structure with O(log_B N + log log N + K/B) query I/Os. We also present an I/O-optimal (i.e., O(log_B N+K/B) I/Os) data structure which occupies O(N log log N)-word space. We achieve these results by employing a novel divide and conquer approach. SCRR finds application in database queries containing sharing among the constraints. We also show that SCRR queries naturally arise in many well known problems such as top-k color reporting, range skyline reporting and ranked document retrieval

    Forbidden Extension Queries

    Get PDF
    Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions. Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank: - an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k). - for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document

    LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching

    Get PDF

    Space-Time Trade-Offs for the Shortest Unique Substring Problem

    Get PDF
    Given a string X[1, n] and a position k in [1, n], the Shortest Unique Substring of X covering k, denoted by S_k, is a substring X[i, j] of X which satisfies the following conditions: (i) i leq k leq j, (ii) i is the only position where there is an occurrence of X[i, j], and (iii) j - i is minimized. The best-known algorithm [Hon et al., ISAAC 2015] can find S k for all k in [1, n] in time O(n) using the string X and additional 2n words of working space. Let tau be a given parameter. We present the following new results. For any given k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n tau) time using X and additional O(n/tau) words of working space. For every k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n/tau) time using X and additional O(n/tau) words and 4n + o(n) bits of working space. For both problems above, we present an O(n tau log^{c+1} n)-time randomized algorithm that uses n/ log c n words in addition to that mentioned above, where c geq 0 is an arbitrary constant. In this case, the reported string is unique and covers k, but with probability at most n^{-O(1)}may not be the shortest. As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [Delcher et al., Nucleic Acids Res. 1999]

    On Optimal Top-K String Retrieval

    Full text link
    Let D{\cal{D}} = {d1,d2,d3,...,dD}\{d_1, d_2, d_3, ..., d_D\} be a given set of DD (string) documents of total length nn. The top-kk document retrieval problem is to index D\cal{D} such that when a pattern PP of length pp, and a parameter kk come as a query, the index returns the kk most relevant documents to the pattern PP. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in O(p+klogk)O(p + k\log k) time. This was improved by Navarro and Nekrich \cite{NN12} to O(p+k)O(p + k). These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into O(p)O(p) subproblems and thus incurs the additive factor of O(p)O(p). In external memory, these approaches will lead to O(p)O(p) I/Os instead of optimal O(p/B)O(p/B) I/O term where BB is the block-size. We re-interpret the problem independent of pp, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-kk queries (with unsorted outputs) in near optimal O(p/B+logBn+log(h)n+k/B)O(p/B + \log_B n + \log^{(h)} n + k/B) I/Os for any constant hh{log(1)n=logn\log^{(1)}n =\log n and log(h)n=log(log(h1)n)\log^{(h)} n = \log (\log^{(h-1)} n)}. Then we get O(nlogn)O(n\log^*n) space index with optimal O(p/B+logBn+k/B)O(p/B+\log_B n + k/B) I/Os.Comment: 3 figure
    corecore