3,116 research outputs found
Sampled longest common prefix array
When augmented with the longest common prefix (LCP) array and some other structures, the suffix array can solve many string processing problems in optimal time and space. A compressed representation of the LCP array is also one of the main building blocks in many compressed suffix tree proposals. In this paper, we describe a new compressed LCP representation: the sampled LCP array. We show that when used with a compressed suffix array (CSA), the sampled LCP array often offers better time/space trade-offs than the existing alternatives. We also show how to construct the compressed representations of the LCP array directly from a CS
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
Deterministic sub-linear space LCE data structures with efficient construction
Given a string of symbols, a longest common extension query
asks for the length of the longest common prefix of the
th and th suffixes of . LCE queries have several important
applications in string processing, perhaps most notably to suffix sorting.
Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015:
65-76) described several data structures for answering LCE queries that offers
a space-time trade-off between data structure size and query time. In
particular, for a parameter , their best deterministic
solution is a data structure of size which allows LCE queries to be
answered in time. However, the construction time for all
deterministic versions of their data structure is quadratic in . In this
paper, we propose a deterministic solution that achieves a similar space-time
trade-off of query time using
space, but significantly improve the construction time to
.Comment: updated titl
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Longest Common Extensions in Sublinear Space
The longest common extension problem (LCE problem) is to construct a data
structure for an input string of length that supports LCE
queries. Such a query returns the length of the longest common prefix of the
suffixes starting at positions and in . This classic problem has a
well-known solution that uses space and query time. In this paper
we show that for any trade-off parameter , the problem can
be solved in space and query time. This
significantly improves the previously best known time-space trade-offs, and
almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
- …