Search CORE

108 research outputs found

Linear Time Construction of Cover Suffix Tree and Applications

Author: Radoszewski Jakub
Publication venue
Publication date: 08/08/2023
Field of study

The Cover Suffix Tree (CST) of a string

T

is the suffix tree of

T

with additional explicit nodes corresponding to halves of square substrings of

T

. In the CST an explicit node corresponding to a substring

C

T

is annotated with two numbers: the number of non-overlapping consecutive occurrences of

C

and the total number of positions in

T

that are covered by occurrences of

C

T

. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-

n

string in

O(n \log n)

time. We show how to compute the CST in

O(n)

time assuming that

T

is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-

n

string

T

, one can compute a linear-sized representation of all seeds of

T

as well as all shortest

\alpha

-partial covers and seeds in

T

for a given

\alpha

O(n)

time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an

O(n \log n)

-time algorithm for computing a shortest

\alpha

-partial cover for each

\alpha=1,\ldots,n

; we improve this complexity to

O(n)

. Our results are based on a new characterization of consecutive overlapping occurrences of a substring

S

T

in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in

T

. This new insight also leads to an

O(n)

-sized index for reporting overlapping consecutive occurrences of a given pattern

P

of length

m

O(m+output)

time, where

output

is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses

O(n \log n)

space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement

arXiv.org e-Print Archive

Linear Time Construction of Cover Suffix Tree and Applications

Author: Radoszewski Jakub
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

DROPS Dagstuhl Research Online Publication Server

Efficient Ranking of Lyndon Words and Decoding Lexicographically Minimal de Bruijn Sequence

Author: Kociumaka Tomasz
Radoszewski Jakub
Rytter Wojciech
Publication venue
Publication date: 09/10/2015
Field of study

We give efficient algorithms for ranking Lyndon words of length n over an alphabet of size {\sigma}. The rank of a Lyndon word is its position in the sequence of lexicographically ordered Lyndon words of the same length. The outputs are integers of exponential size, and complexity of arithmetic operations on such large integers cannot be ignored. Our model of computations is the word-RAM, in which basic arithmetic operations on (large) numbers of size at most {\sigma}^n take O(n) time. Our algorithm for ranking Lyndon words makes O(n^2) arithmetic operations (this would imply directly cubic time on word-RAM). However, using an algebraic approach we are able to reduce the total time complexity on the word-RAM to O(n^2 log {\sigma}). We also present an O(n^3 log^2 {\sigma})-time algorithm that generates the Lyndon word of a given length and rank in lexicographic order. Finally we use the connections between Lyndon words and lexicographically minimal de Bruijn sequences (theorem of Fredricksen and Maiorana) to develop the first polynomial-time algorithm for decoding minimal de Bruijn sequence of any rank n (it determines the position of an arbitrary word of length n within the de Bruijn sequence).Comment: Improved version of a paper presented at CPM 201

arXiv.org e-Print Archive

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Author: Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Publication venue
Publication date: 01/01/2016
Field of study

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.Comment: 22 page

arXiv.org e-Print Archive

DROPS Dagstuhl Research Online Publication Server

King's Research Portal

Internal Pattern Matching Queries in a Text and Applications

Author: Kociumaka Tomasz
Radoszewski Jakub
Rytter Wojciech
Waleń Tomasz
Publication venue
Publication date: 13/10/2014
Field of study

We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword

x

in another subword

y

of a given text, assuming that

|y|=\mathcal{O}(|x|)

, which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding

\delta

-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed

\delta

we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201

arXiv.org e-Print Archive

Crossref