Search CORE

176 research outputs found

Structural Pattern Matching - Succinctly

Author: Ganguly Arnab
Shah Rahul
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th International Symposium on Algorithms and Computation (ISAAC 2017)
Publication date: 01/01/2017
Field of study

Let T be a text of length n containing characters from an alphabet Sigma, which is the union of two disjoint sets: Sigma_s containing static characters (s-characters) and Sigma_p containing parameterized characters (p-characters). Each character in Sigma_p has an associated complementary character from Sigma_p. A pattern P (also over Sigma) matches an equal-length substring

S

of T iff the s-characters match exactly, there exists a one-to-one function that renames the p-characters in S to the p-characters in P, and if a p-character x is renamed to another p-character y then the complement of x is renamed to the complement of y. The task is to find the starting positions (occurrences) of all such substrings S. Previous indexing solution [Shibuya, SWAT 2000], known as Structural Suffix Tree, requires Theta(nlog n) bits of space, and can find all occ occurrences in time O(|P|log sigma+ occ), where sigma = |Sigma|. In this paper, we present the first succinct index for this problem, which occupies n log sigma + O(n) bits and offers O(|P|logsigma+ occcdot log n logsigma) query time

Dagstuhl Research Online Publication Server

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

pBWT: Achieving succinct data structures for parameterized pattern matching and related problems

Author: Ganguly Arnab
Shah Rahul
Thankachan Sharma V.
Publication venue: LSU Digital Commons
Publication date: 01/01/2017
Field of study

The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last two decades. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of the suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-preserving suffix links. Specifically, the relative order between two suffixes in the subtree of an internal node is same as that of the suffixes obtained by truncating the furst character of the two suffixes. Unfortunately, in many variants of the text-indexing problem, for e.g., parameterized pattern matching, 2D pattern matching, and order-isomorphic pattern matching, this property does not hold. Consequently, the compressed indexes based on BWT do not directly apply. Furthermore, a compressed index for any of these variants has been elusive throughout the advancement of the field of succinct data structures. We achieve a positive breakthrough on one such problem, namely the Parameterized Pattern Matching problem. Let T be a text that contains n characters from an alphabet , which is the union of two disjoint sets: containing static characters (s-characters) and containing parameterized characters (p-characters). A pattern P (also over ) matches an equal-length substring S of T i the s-characters match exactly, and there exists a one-to-one function that renames the p-characters in S to that in P. The task is to find the starting positions (occurrences) of all such substrings S. Previous index [Baker, STOC 1993], known as Parameterized Suffix Tree, requires (n log n) bits of space, and can find all occ occurrences in time O(jPj log +occ), where = jj. We introduce an n log +O(n)-bit index with O(jPj log +occlog n log ) query time. At the core, lies a new BWT-like transform, which we call the Parame- terized Burrows-Wheeler Transform (pBWT). The techniques are extended to obtain a succinct index for the Parameterized Dictionary Matching problem of Idury and Schaer [CPM, 1994]

Crossref

Louisiana State University

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Shared-Constraint Range Reporting

Author: Biswas Sudip
Patil Manish
Shah Rahul
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Conference on Database Theory (ICDT 2015)
Publication date: 01/01/2015
Field of study

Orthogonal range reporting is one of the classic and most fundamental data structure problems. (2,1,1) query is a 3 dimensional query with two-sided constraint on the first dimension and one sided constraint on each of the 2nd and 3rd dimension. Given a set of N points in three dimension, a particular formulation of such a (2,1,1) query (known as four-sided range reporting in three-dimension) asks to report all those K points within a query region [a, b]X(-infinity, c]X[d, infinity). These queries have overall 4 constraints. In Word-RAM model, the best known structure capable of answering such queries with optimal query time takes O(N log^{epsilon} N) space, where epsilon>0 is any positive constant. It has been shown that any external memory structure in optimal I/Os must use Omega(N log N/ log log_B N) space (in words), where B is the block size [Arge et al., PODS 1999]. In this paper, we study a special type of (2,1,1) queries, where the query parameters a and c are the same i.e., a=c. Even though the query is still four-sided, the number of independent constraints is only three. In other words, one constraint is shared. We call this as a Shared-Constraint Range Reporting (SCRR) problem. We study this problem in both internal as well as external memory models. In RAM model where coordinates can only be compared, we achieve linear-space and O(log N+K) query time solution, matching the best-known three dimensional dominance query bound. Whereas in external memory, we present a linear space structure with O(log_B N + log log N + K/B) query I/Os. We also present an I/O-optimal (i.e., O(log_B N+K/B) I/Os) data structure which occupies O(N log log N)-word space. We achieve these results by employing a novel divide and conquer approach. SCRR finds application in database queries containing sharing among the constraints. We also show that SCRR queries naturally arise in many well known problems such as top-k color reporting, range skyline reporting and ranked document retrieval

Dagstuhl Research Online Publication Server

Forbidden Extension Queries

Author: Biswas Sudip
Ganguly Arnab
Shah Rahul
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)
Publication date: 01/01/2015
Field of study

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions. Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank: - an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k). - for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document

Dagstuhl Research Online Publication Server

Louisiana State University

LF Successor: Compact Space Indexing for Order-Isomorphic Pattern Matching

Author: Ganguly Arnab
Patel Dhrumil
Shah Rahul
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server

Space-Time Trade-Offs for the Shortest Unique Substring Problem

Author: Ganguly Arnab
Hon Wing-Kai
Shah Rahul
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th International Symposium on Algorithms and Computation (ISAAC 2016)
Publication date: 01/01/2016
Field of study

Given a string X[1, n] and a position k in [1, n], the Shortest Unique Substring of X covering k, denoted by S_k, is a substring X[i, j] of X which satisfies the following conditions: (i) i leq k leq j, (ii) i is the only position where there is an occurrence of X[i, j], and (iii) j - i is minimized. The best-known algorithm [Hon et al., ISAAC 2015] can find S k for all k in [1, n] in time O(n) using the string X and additional 2n words of working space. Let tau be a given parameter. We present the following new results. For any given k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n tau) time using X and additional O(n/tau) words of working space. For every k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n/tau) time using X and additional O(n/tau) words and 4n + o(n) bits of working space. For both problems above, we present an O(n tau log^{c+1} n)-time randomized algorithm that uses n/ log c n words in addition to that mentioned above, where c geq 0 is an arbitrary constant. In this case, the reported string is unique and covers k, but with probability at most n^{-O(1)}may not be the shortest. As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [Delcher et al., Nucleic Acids Res. 1999]

Dagstuhl Research Online Publication Server

Louisiana State University

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

On Optimal Top-K String Retrieval

Author: Shah Rahul
Sheng Cheng
Thankachan Sharma V.
Vitter Jeffrey Scott
Publication venue
Publication date: 01/01/2012
Field of study

Let

{\cal{D}}

\{d_1, d_2, d_3, ..., d_D\}

be a given set of

D

(string) documents of total length

n

. The top-

k

document retrieval problem is to index

\cal{D}

such that when a pattern

P

of length

p

, and a parameter

k

come as a query, the index returns the

k

most relevant documents to the pattern

P

. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in

O(p + k\log k)

time. This was improved by Navarro and Nekrich \cite{NN12} to

O(p + k)

. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into

O(p)

subproblems and thus incurs the additive factor of

O(p)

. In external memory, these approaches will lead to

O(p)

I/Os instead of optimal

O(p/B)

I/O term where

B

is the block-size. We re-interpret the problem independent of

p

, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-

k

queries (with unsorted outputs) in near optimal

O(p/B + \log_B n + \log^{(h)} n + k/B)

I/Os for any constant

h

{

\log^{(1)}n =\log n

and

\log^{(h)} n = \log (\log^{(h-1)} n)

}. Then we get

O(n\log^*n)

space index with optimal

O(p/B+\log_B n + k/B)

I/Os.Comment: 3 figure

arXiv.org e-Print Archive

CiteSeerX