97 research outputs found
Size-constrained Weighted Ancestors with Applications
The weighted ancestor problem on a rooted node-weighted tree is a
generalization of the classic predecessor problem: construct a data structure
for a set of integers that supports fast predecessor queries. Both problems are
known to require time for queries provided
space is available, where is the input
size. The weighted ancestor problem has attracted a lot of attention by the
combinatorial pattern matching community due to its direct application to
suffix trees. In this formulation of the problem, the nodes are weighted by
string depth. This attention has culminated in a data structure for weighted
ancestors in suffix trees with query time and an
-time construction algorithm [Belazzougui et al., CPM 2021]. In
this paper, we consider a different version of the weighted ancestor problem,
where the nodes are weighted by any function that maps the
nodes of to positive integers, such that for any node and if node is a descendant of node , where
is the number of nodes in the subtree rooted at . In the
size-constrained weighted ancestor (SWAQ) problem, for any node of and
any integer , we are asked to return the lowest ancestor of with
weight at least . We show that for any rooted tree with nodes, we can
locate node in time after -time
preprocessing. In particular, this implies a data structure for the SWAQ
problem in suffix trees with query time and
-time preprocessing, when the nodes are weighted by
. We also show several string-processing applications of this
result
Recommended from our members
Text Indexing for Long Patterns: Anchors are All you Need
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/lorrainea/BDA- index.Copyright © 2023 the owner/author(s). In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.European Unionâs Horizon 2020 research and innovation programme under the Marie SkĆodowska-Curie grant agreements No 872539 and 956229, respectively; and by UKRI through REPHRAIN (EP/V011189/1)
Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression.
Unfortunately, no such clear measure exists for the compressibility of
repetitive strings. Thus, ad-hoc measures are employed to estimate the
repetitiveness of strings, e.g., the size of the Lempel-Ziv parse or the
number of equal-letter runs of the Burrows-Wheeler transform. A more recent
one is the size of a smallest string attractor. Unfortunately, Kempa
and Prezza [STOC 2018] showed that computing is NP-hard. Kociumaka et
al. [LATIN 2020] considered a new measure that is based on the function
counting the cardinalities of the sets of substrings of each length of ,
also known as the substring complexity. This new measure is defined as and lower bounds all the measures previously
considered. In particular, always holds and can be
computed in time using working space. Kociumaka et
al. showed that if is given, one can construct an -sized representation of supporting efficient direct
access and efficient pattern matching queries on . Given that for highly
compressible strings, is significantly smaller than , it is natural
to pose the following question: Can we compute efficiently using
sublinear working space?
It is straightforward to show that any algorithm computing using
space requires time through a reduction
from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present
the following results: an -time and
-space algorithm to compute , for any ; and
an -time and -space algorithm to
compute , for any
Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications
An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si â T[i] for all i â [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1âÏ”)-time algorithm, thus no O ((N1m2 + N2m1)1âÏ”)-time algorithm and no O ((N1n2 + N2n1)1âÏ”)-time algorithm, for any constant Ï” > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2âÏ”f(n1, n2))-time algorithm, for any constant Ï” > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Ă(N1Ïâ1n2 + N2Ïâ1n1)-time algorithm for EDSI, where Ï is the exponent of matrix multiplication; the Ă notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison
Faster algorithms for longest common substring
In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size Ï, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log Ï)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log Ï/log n) space and read in (n log Ï/log n) time. We show that, in this model, we can compute an LCS in time (n log Ï / â{log n}), which is sublinear in n if Ï = 2^{o(â{log n})} (in particular, if Ï = (1)), using optimal space (n log Ï/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p
Text indexing for long patterns: Anchors are all you need
In many real-world database systems, a large fraction of the data is represented by strings: Sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (â
°) index space; (â
±) query time;(â
Č) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound â on the length of the queried patterns â which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors,by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index
String Covering: A Survey
The study of strings is an important combinatorial field that precedes the
digital computer. Strings can be very long, trillions of letters, so it is
important to find compact representations. Here we first survey various forms
of one potential compaction methodology, the cover of a given string x,
initially proposed in a simple form in 1990, but increasingly of interest as
more sophisticated variants have been discovered. We then consider covering by
a seed; that is, a cover of a superstring of x. We conclude with many proposals
for research directions that could make significant contributions to string
processing in future
Internal Shortest Absent Word Queries in Constant Time and Linear Space
International audienceGiven a string T of length n over an alphabet ÎŁ â {1, 2,. .. , n O(1) } of size Ï, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over ÎŁ that is absent in the fragment T [i] âą âą âą T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log Ï n) time
Internal shortest absent word queries
Given a string T of length n over an alphabet ÎŁ â {1, 2, . . . , nO(1)} of size Ï, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over ÎŁ that is absent in the fragment T[i] · · · T[j] of T. For any positive integer k â [1, log logÏ n], we present an O((n/k) · log logÏ n)-size data structure, which can be constructed in O(n logÏ n) time, and answers queries in time O(log logÏ k)
Elastic-Degenerate String Matching with 1 Error
An elastic-degenerate string is a sequence of finite sets of strings of
total length , introduced to represent a set of related DNA sequences, also
known as a pangenome. The ED string matching (EDSM) problem consists in
reporting all occurrences of a pattern of length in an ED text. This
problem has recently received some attention by the combinatorial pattern
matching community, culminating in an
-time algorithm [Bernardini
et al., SIAM J. Comput. 2022], where denotes the matrix multiplication
exponent and the notation suppresses polylog
factors. In the -EDSM problem, the approximate version of EDSM, we are asked
to report all pattern occurrences with at most errors. -EDSM can be
solved in time, under edit distance, or
time, under Hamming distance, where denotes the total
number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020].
Unfortunately, is only bounded by , and so even for , the existing
algorithms run in time in the worst case. In this paper we show
that -EDSM can be solved in or
time under edit distance. For the decision version, we
present a faster -time algorithm.
We also show that -EDSM can be solved in time
under Hamming distance. Our algorithms for edit distance rely on non-trivial
reductions from -EDSM to special instances of classic computational geometry
problems (2d rectangle stabbing or 2d range emptiness), which we show how to
solve efficiently. In order to obtain an even faster algorithm for Hamming
distance, we rely on employing and adapting the -errata trees for indexing
with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202
- âŠ