Search CORE

240 research outputs found

String Indexing for Patterns with Wildcards

Author: A. Tam
B. Chazelle
D. Harel
D. Tsur
G. Chen
G. Landau
G. Landau
G. Navarro
H.L. Chan
K. Hofmann
L.P. Coelho
M. Lewenstein
M. Maas
M.L. Fredman
P. Bille
P. Bille
P. Clifford
T.-W. Lam
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of indexing a string

t

of length

n

to report the occurrences of a query pattern

p

containing

m

characters and

j

wildcards. Let

occ

be the number of occurrences of

p

t

, and

\sigma

the size of the alphabet. We obtain the following results. - A linear space index with query time

O(m+\sigma^j \log \log n + occ)

. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time

\Theta(jn)

in the worst case. - An index with query time

O(m+j+occ)

using space

O(\sigma^{k^2} n \log^k \log n)

, where

k

is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Gapped Indexing for Consecutive Occurrences

Author: Bille Philip
Steiner Teresa Anna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Publication date: 01/01/2021
Field of study

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P? and P? and a gap range [?, ?] we can quickly find the consecutive occurrences of P? and P? with distance in [?, ?], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use O?(n) space and query time O?(|P?|+|P?|+n^{2/3}) for existence and counting and O?(|P?|+|P?|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using O?(n) space must use ??(|P?| + |P?| + ?n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem

Dagstuhl Research Online Publication Server

Linear Time Construction of Cover Suffix Tree and Applications

Author: Radoszewski Jakub
Publication venue
Publication date: 08/08/2023
Field of study

The Cover Suffix Tree (CST) of a string

T

is the suffix tree of

T

with additional explicit nodes corresponding to halves of square substrings of

T

. In the CST an explicit node corresponding to a substring

C

T

is annotated with two numbers: the number of non-overlapping consecutive occurrences of

C

and the total number of positions in

T

that are covered by occurrences of

C

T

. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-

n

string in

O(n \log n)

time. We show how to compute the CST in

O(n)

time assuming that

T

is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-

n

string

T

, one can compute a linear-sized representation of all seeds of

T

as well as all shortest

\alpha

-partial covers and seeds in

T

for a given

\alpha

O(n)

time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an

O(n \log n)

-time algorithm for computing a shortest

\alpha

-partial cover for each

\alpha=1,\ldots,n

; we improve this complexity to

O(n)

. Our results are based on a new characterization of consecutive overlapping occurrences of a substring

S

T

in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in

T

. This new insight also leads to an

O(n)

-sized index for reporting overlapping consecutive occurrences of a given pattern

P

of length

m

O(m+output)

time, where

output

is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses

O(n \log n)

space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement

arXiv.org e-Print Archive

Faster algorithms for computing maximal multirepeats in multiple sequences

Author: Iliopoulos Costas
Smyth Bill
Yusufu M.
Publication venue: 'IOS Press'
Publication date: 01/01/2009
Field of study

A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin greater than/equal to 2) in each of at least q greater than/equal to 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string

espace@Curtin

Linear Time Construction of Cover Suffix Tree and Applications

Author: Radoszewski Jakub
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

String Indexing for Top-k Close Consecutive Occurrences

Author: Bille Philip
Rotenberg Eva
Steiner Teresa Anna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)
Publication date: 01/01/2020
Field of study

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees

Dagstuhl Research Online Publication Server

String Indexing for Top- $k$ Close Consecutive Occurrences

Author: Bille Philip
Gørtz Inge Li
Pedersen Max Rishøj
Rotenberg Eva
Steiner Teresa Anna
Publication venue
Publication date: 29/09/2020
Field of study

The classic string indexing problem is to preprocess a string

S

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

P

, report all occurrences of

P

within

S

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

k

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

(i,j)

i < j

, such that

P

occurs at positions

i

and

j

S

and there is no occurrence of

P

between

i

and

j

, and their distance is defined as

j-i

. Given a pattern

P

and a parameter

k

, the goal is to report the top-

k

consecutive occurrences of

P

S

of minimal distance. The challenge is to compactly represent

S

while supporting queries in time close to length of

P

and

k

. We give two time-space trade-offs for the problem. Let

n

be the length of

S

m

the length of

P

, and

\epsilon\in(0,1]

. Our first result achieves

O(n\log n)

space and optimal query time of

O(m+k)

, and our second result achieves linear space and query time

O(m+k^{1+\epsilon})

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

arXiv.org e-Print Archive

Online Research Database In Technology

Faster algorithms for computing maximal multirepeats in multiple sequences

Author: Ilopoulos C.S.
Smyth W.F.
Yusufu M.
Publication venue: 'IOS Press'
Publication date: 01/01/2009
Field of study

A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin _ 2) in each of at least q (greater than or equal to) 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string

Research Repository

Pattern Discovery in Colored Strings

Author: Lipták Zsuzsanna
Puglisi Simon J.
Rossi Massimiliano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by

\mathcal{O}(n^2)

, where

n

is the length of the string. We introduce a baseline algorithm, running in

\mathcal{O}(n^2)

time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in

\mathcal{O}(n^2\log n)

time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.Comment: 22 pages, 5 figures, 2 tables, published in ACM Journal of Experimental Algorithmics. This is the journal version of the paper with the same title at SEA 2020 (18th Symposium on Experimental Algorithms, Catania, Italy, June 16-18, 2020

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca

The streaming $k$ -mismatch problem

Author: Clifford Raphaël
Kociumaka Tomasz
Porat Ely
Publication venue
Publication date: 09/04/2018
Field of study

We consider the streaming complexity of a fundamental task in approximate pattern matching: the

k

-mismatch problem. It asks to compute Hamming distances between a pattern of length

n

and all length-

n

substrings of a text for which the Hamming distance does not exceed a given threshold

k

. In our problem formulation, we report not only the Hamming distance but also, on demand, the full \emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the

k

-mismatch problem which uses

O(k\log{n}\log\frac{n}{k})

bits of space and spends \ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS'09. En route to this solution, we also give a deterministic

O( k (\log \frac{n}{k} + \log |\Sigma|) )

-bit encoding of all the alignments with Hamming distance at most

k

of a length-

n

pattern within a text of length

O(n)

. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.Comment: 27 page

arXiv.org e-Print Archive

Crossref

Explore Bristol Research