240 research outputs found
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
Gapped Indexing for Consecutive Occurrences
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P? and P? and a gap range [?, ?] we can quickly find the consecutive occurrences of P? and P? with distance in [?, ?], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use O?(n) space and query time O?(|P?|+|P?|+n^{2/3}) for existence and counting and O?(|P?|+|P?|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using O?(n) space must use ??(|P?| + |P?| + ?n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem
Linear Time Construction of Cover Suffix Tree and Applications
The Cover Suffix Tree (CST) of a string is the suffix tree of with
additional explicit nodes corresponding to halves of square substrings of .
In the CST an explicit node corresponding to a substring of is
annotated with two numbers: the number of non-overlapping consecutive
occurrences of and the total number of positions in that are covered by
occurrences of in . Kociumaka et al. (Algorithmica, 2015) have shown how
to compute the CST of a length- string in time. We show how to
compute the CST in time assuming that is over an integer alphabet.
Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown
that knowing the CST of a length- string , one can compute a linear-sized
representation of all seeds of as well as all shortest -partial
covers and seeds in for a given in time. Thus our result
implies linear-time algorithms computing these notions of quasiperiodicity. The
resulting algorithm computing seeds is substantially different from the
previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020).
Kociumaka et al. (Algorithmica, 2015) proposed an -time algorithm
for computing a shortest -partial cover for each ;
we improve this complexity to .
Our results are based on a new characterization of consecutive overlapping
occurrences of a substring of in terms of the set of runs (see Kolpakov
and Kucherov, FOCS 1999) in . This new insight also leads to an -sized
index for reporting overlapping consecutive occurrences of a given pattern
of length in time, where is the number of
occurrences reported. In comparison, a general index for reporting bounded-gap
consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016)
uses space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement
Faster algorithms for computing maximal multirepeats in multiple sequences
A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin greater than/equal to 2) in each of at least q greater than/equal to 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string
String Indexing for Top-k Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees
String Indexing for Top- Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string into a
compact data structure that supports efficient subsequent pattern matching
queries, that is, given a pattern string , report all occurrences of
within . In this paper, we study a basic and natural extension of string
indexing called the string indexing for top- close consecutive occurrences
problem (SITCCO). Here, a consecutive occurrence is a pair , ,
such that occurs at positions and in and there is no occurrence
of between and , and their distance is defined as . Given a
pattern and a parameter , the goal is to report the top- consecutive
occurrences of in of minimal distance. The challenge is to compactly
represent while supporting queries in time close to length of and .
We give two time-space trade-offs for the problem. Let be the length of
, the length of , and . Our first result achieves
space and optimal query time of , and our second result
achieves linear space and query time . Along the way, we
develop several techniques of independent interest, including a new translation
of the problem into a line segment intersection problem and a new recursive
clustering technique for trees.Comment: Fixed typos, minor change
Faster algorithms for computing maximal multirepeats in multiple sequences
A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin _ 2) in each of at least q (greater than or equal to) 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string
Pattern Discovery in Colored Strings
In this paper, we consider the problem of identifying patterns of interest in
colored strings. A colored string is a string where each position is assigned
one of a finite set of colors. Our task is to find substrings of the colored
string that always occur followed by the same color at the same distance. The
problem is motivated by applications in embedded systems verification, in
particular, assertion mining. The goal there is to automatically find
properties of the embedded system from the analysis of its simulation traces.
We show that, in our setting, the number of patterns of interest is
upper-bounded by , where is the length of the string. We
introduce a baseline algorithm, running in time, which
identifies all patterns of interest satisfying certain minimality conditions,
for all colors in the string. For the case where one is interested in patterns
related to one color only, we also provide a second algorithm which runs in
time in the worst case but is faster than the baseline
algorithm in practice. Both solutions use suffix trees, and the second
algorithm also uses an appropriately defined priority queue, which allows us to
reduce the number of computations. We performed an experimental evaluation of
the proposed approaches over both synthetic and real-world datasets, and found
that the second algorithm outperforms the first algorithm on all simulated
data, while on the real-world data, the performance varies between a slight
slowdown (on half of the datasets) and a speedup by a factor of up to 11.Comment: 22 pages, 5 figures, 2 tables, published in ACM Journal of
Experimental Algorithmics. This is the journal version of the paper with the
same title at SEA 2020 (18th Symposium on Experimental Algorithms, Catania,
Italy, June 16-18, 2020
The streaming -mismatch problem
We consider the streaming complexity of a fundamental task in approximate
pattern matching: the -mismatch problem. It asks to compute Hamming
distances between a pattern of length and all length- substrings of a
text for which the Hamming distance does not exceed a given threshold . In
our problem formulation, we report not only the Hamming distance but also, on
demand, the full \emph{mismatch information}, that is the list of mismatched
pairs of symbols and their indices. The twin challenges of streaming pattern
matching derive from the need both to achieve small working space and also to
guarantee that every arriving input symbol is processed quickly.
We present a streaming algorithm for the -mismatch problem which uses
bits of space and spends \ourcomplexity time on
each symbol of the input stream, which consists of the pattern followed by the
text. The running time almost matches the classic offline solution and the
space usage is within a logarithmic factor of optimal.
Our new algorithm therefore effectively resolves and also extends an open
problem first posed in FOCS'09. En route to this solution, we also give a
deterministic -bit encoding of all
the alignments with Hamming distance at most of a length- pattern within
a text of length . This secondary result provides an optimal solution to
a natural communication complexity problem which may be of independent
interest.Comment: 27 page
- …