15 research outputs found
Cross-Document Pattern Matching
We study a new variant of the string matching problem called cross-document
string matching, which is the problem of indexing a collection of documents to
support an efficient search for a pattern in a selected document, where the
pattern itself is a substring of another document. Several variants of this
problem are considered, and efficient linear-space solutions are proposed with
query time bounds that either do not depend at all on the pattern size or
depend on it in a very limited way (doubly logarithmic). As a side result, we
propose an improved solution to the weighted level ancestor problem
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
On Indexing and Compressing Finite Automata
An index for a finite automaton is a powerful data structure that supports
locating paths labeled with a query pattern, thus solving pattern matching on
the underlying regular language. In this paper, we solve the long-standing
problem of indexing arbitrary finite automata. Our solution consists in finding
a partial co-lexicographic order of the states and proving, as in the total
order case, that states reached by a given string form one interval on the
partial order, thus enabling indexing. We provide a lower bound stating that
such an interval requires words to be represented, being the order's
width (i.e. the size of its largest antichain). Indeed, we show that
determines the complexity of several fundamental problems on finite automata:
(i) Letting be the alphabet size, we provide an encoding for NFAs
using bits per transition
and a smaller encoding for DFAs using bits per transition. This is achieved by generalizing the
Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed
pattern matching can be solved in query time on NFAs.
(iii) We provide a polynomial-time algorithm to index DFAs, while matching the
optimal value for . On the other hand, we prove that the problem is
NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset
construction algorithm for NFA determinization generates an equivalent DFA of
size , where is the number of NFA's states
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching (Extended Abstract)
The proliferation of online text, such as on the World Wide Web and in databases, motivates the need for space-efficient index methods that support fast search. Consider a text T of n binary symbols to index. Given any query pattern P of m binary symbols, the goal is to search for P in T quickly, with T being fully scanned only once, namely, when the index is created. All indexing schemes published in the last thirty years support searching in \Theta(m) worst-case time and require \Theta(n) memory words (or \Theta(n log n) bits), which is significantly larger than the text itself. In this paper we provide a breakthrough both in searching time and index space under the same model of computation as the one adopted in previous work. Based upon new compressed representations of suffix arrays and suffix trees, we construct an index structure that occupies only O(n) bits and compares favorably with inverted lists in space. We can search any binary pattern P , stored in O(m= log n) words, in only o(m) time. Specifically, searching takes O(1) time for m = o(log n), and O(m= log n + log ffl n) = o(m) time for m =\Omega\Gamma239 n) and any fixed 0 ! ffl ! 1. That is, we achieve optimal O(m= log n) search time for sufficiently large m =\Omega\Gamma206 1+ffl n). We can list all the occ pattern occurrences in optimal O(occ) additional time when m = \Omega\Gamma1 olylog(n)) or when occ = \Omega\Gamma n ffl ); otherwise, listing takes O(occ log ffl n) additional time
On Locating Paths in Compressed Tries
In this paper, we consider the problem of compressing a trie while supporting
the powerful \emph{locate} queries: to return the pre-order identifiers of all
nodes reached by a path labeled with a given query pattern. Our result builds
on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and
generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018,
JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our
first contribution is to propose a suitable generalization of the run-length
BWT to tries. We show that this natural generalization enjoys several of the
useful properties of its counterpart on strings: in particular, the transform
natively supports counting occurrences of a query pattern on the trie's paths
and its size captures the trie's repetitiveness and lower-bounds a natural
notion of trie entropy. Our main contribution is a much deeper insight into the
combinatorial structure of this object. In detail, we show that a data
structure of bits, where is the number of nodes,
allows locating the occurrences of a pattern of length in
nearly-optimal time, where is the alphabet's
size. Our solution consists in sampling nodes that can be used as
"anchor points" during the locate process. Once obtained the pre-order
identifier of the first pattern occurrence (in co-lexicographic order), we show
that a constant number of constant-time jumps between those anchor points lead
to the identifier of the next pattern occurrence, thus enabling locating in
optimal time per occurrence.Comment: Improved toehold lemma running time; added more detailed proofs that
take care of all border cases in the locate strategy; postprint version to
appear in SODA 202
String Searching with Ranking Constraints and Uncertainty
Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model
Breaking the -Barrier in the Construction of Compressed Suffix Arrays
The suffix array, describing the lexicographic order of suffixes of a given
text, is the central data structure in string algorithms. The suffix array of a
length- text uses bits, which is prohibitive in many
applications. To address this, Grossi and Vitter [STOC 2000] and,
independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient
versions of the suffix array, known as the compressed suffix array (CSA) and
the FM-index. For a length- text over an alphabet of size , these
data structures use only bits. Immediately after their
discovery, they almost completely replaced plain suffix arrays in practical
applications, and a race started to develop efficient construction procedures.
Yet, after more than 20 years, even for , the fastest algorithm
remains stuck at time [Hon et al., FOCS 2003], which is slower by a
factor than the lower bound of (following
simply from the necessity to read the entire input). We break this
long-standing barrier with a new data structure that takes
bits, answers suffix array queries in time, and can be
constructed in time using
bits of space. Our result is based on several new insights into the recently
developed notion of string synchronizing sets [STOC 2019]. In particular,
compared to their previous applications, we eliminate orthogonal range queries,
replacing them with new queries that we dub prefix rank and prefix selection
queries. As a further demonstration of our techniques, we present a new
pattern-matching index that simultaneously minimizes the construction time and
the query time among all known compact indexes (i.e., those using bits).Comment: 41 page