198 research outputs found
String Indexing with Compressed Patterns
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
Sliding Window String Indexing in Streams
Given a string over an alphabet , the 'string indexing problem'
is to preprocess to subsequently support efficient pattern matching
queries, i.e., given a pattern string report all the occurrences of in
. In this paper we study the 'streaming sliding window string indexing
problem'. Here the string arrives as a stream, one character at a time, and
the goal is to maintain an index of the last characters, called the
'window', for a specified parameter . At any point in time a pattern
matching query for a pattern may arrive, also streamed one character at a
time, and all occurrences of within the current window must be returned.
The streaming sliding window string indexing problem naturally captures
scenarios where we want to index the most recent data (i.e. the window) of a
stream while supporting efficient pattern matching.
Our main result is a simple space data structure that uses
time with high probability to process each character from both the input string
and the pattern string . Reporting each occurrence from uses
additional constant time per reported occurrence. Compared to previous work in
similar scenarios this result is the first to achieve an efficient worst-case
time per character from the input stream. We also consider a delayed variant of
the problem, where a query may be answered at any point within the next
characters that arrive from either stream. We present an space data structure for this problem that improves the above time
bounds to . In particular, for a delay of we obtain an space data structure with constant time processing per
character. The key idea to achieve our result is a novel and simple
hierarchical structure of suffix trees of independent interest, inspired by the
classic log-structured merge trees
Nucleotide String Indexing using Range Matching
The two most common data-structures for genome indexing, FM-indices and
hash-tables, exhibit a fundamental trade-off between memory footprint and
performance. We present Ranger, a new indexing technique for nucleotide
sequences that is both memory efficient and fast. We observe that nucleotide
sequences can be represented as integer ranges and leverage a range-matching
algorithm based on neural networks to perform the lookup.
We prototype Ranger in software and integrate it into the popular Minimap2
tool. Ranger achieves almost identical end-to-end performance as the original
Minimap2, while occupying 1.7 and 1.2 less memory for short-
and long-reads, respectively. With a limited memory capacity, Ranger achieves
up to 4.3 speedup for short reads compared to FM-Index, and up to
4.2 and 1.8 speedups for short- and long-reads, compared to
hash-tables. Ranger opens up new opportunities in the context of hardware
acceleration by reducing the memory footprint of long-seed indexes used in
state-of-the-art alignment accelerators by up to 23 which results with
3 faster alignment and negligible accuracy degradation. Moreover, its
worst case memory bandwidth and latency can be bounded in advance without the
need to inflate DRAM capacity
Sliding Window String Indexing in Streams
Given a string S over an alphabet ?, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next ? characters that arrive from either stream. We present an O(w + ?) space data structure for this problem that improves the above time bounds to O(log (w/?)). In particular, for a delay of ? = ? w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees
String Indexing for Top- Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string into a
compact data structure that supports efficient subsequent pattern matching
queries, that is, given a pattern string , report all occurrences of
within . In this paper, we study a basic and natural extension of string
indexing called the string indexing for top- close consecutive occurrences
problem (SITCCO). Here, a consecutive occurrence is a pair , ,
such that occurs at positions and in and there is no occurrence
of between and , and their distance is defined as . Given a
pattern and a parameter , the goal is to report the top- consecutive
occurrences of in of minimal distance. The challenge is to compactly
represent while supporting queries in time close to length of and .
We give two time-space trade-offs for the problem. Let be the length of
, the length of , and . Our first result achieves
space and optimal query time of , and our second result
achieves linear space and query time . Along the way, we
develop several techniques of independent interest, including a new translation
of the problem into a line segment intersection problem and a new recursive
clustering technique for trees.Comment: Fixed typos, minor change
String Indexing for Top-k Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees
Space-Efficient String Indexing for Wildcard Pattern Matching
In this paper we describe compressed indexes that support pattern matching queries for strings with wildcards. For a constant size alphabet our data structure uses O(n.log^e(n)) bits for any e>0 and reports all occ occurrences of a wildcard string in O(m+s^g.M(n)+occ) time, where M(n)=o(log(log(log(n)))), s is the alphabet size, m is the number of alphabet symbols and g is the number of wildcard symbols in the query string. We also present an O(n)-bit index with O((m+s^g+occ).log^e(n)) query time and an O(n{log(log(n))}^2)-bit index with O((m+s^g+occ).log(log(n))) query time. These are the first non-trivial data structures for this problem that need o(n.log(n)) bits of space
Deterministic indexing for packed strings
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In the deterministic variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the packed variant the strings are stored with several character in a single word, giving us the opportunity to read multiple characters simultaneously. Our main result is a new string index in the deterministic and packed setting. Given a packed string S of length n over an alphabet s, we show how to preprocess S in O(n) (deterministic) time and space O(n) such that given a packed pattern string of length m we can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n. Our query time is always at least as good as the previous best known bounds and whenever several characters are packed in a word, i.e., log s << w, the query times are faster
Linear-time String Indexing and Analysis in Small Space
The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T is an element of {1, . . . , sigma}(n) can be built in deterministic O(n) time using just O(n log sigma) bits of space, where sigma We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log sigma) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log sigma) bits of space, took O(n log log sigma) time for the first two structures and O(n log(epsilon) n) time for the third, where. is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log sigma log log(sigma) n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.Peer reviewe
- …