108 research outputs found
Linear Time Construction of Cover Suffix Tree and Applications
The Cover Suffix Tree (CST) of a string is the suffix tree of with
additional explicit nodes corresponding to halves of square substrings of .
In the CST an explicit node corresponding to a substring of is
annotated with two numbers: the number of non-overlapping consecutive
occurrences of and the total number of positions in that are covered by
occurrences of in . Kociumaka et al. (Algorithmica, 2015) have shown how
to compute the CST of a length- string in time. We show how to
compute the CST in time assuming that is over an integer alphabet.
Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown
that knowing the CST of a length- string , one can compute a linear-sized
representation of all seeds of as well as all shortest -partial
covers and seeds in for a given in time. Thus our result
implies linear-time algorithms computing these notions of quasiperiodicity. The
resulting algorithm computing seeds is substantially different from the
previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020).
Kociumaka et al. (Algorithmica, 2015) proposed an -time algorithm
for computing a shortest -partial cover for each ;
we improve this complexity to .
Our results are based on a new characterization of consecutive overlapping
occurrences of a substring of in terms of the set of runs (see Kolpakov
and Kucherov, FOCS 1999) in . This new insight also leads to an -sized
index for reporting overlapping consecutive occurrences of a given pattern
of length in time, where is the number of
occurrences reported. In comparison, a general index for reporting bounded-gap
consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016)
uses space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement
Efficient Ranking of Lyndon Words and Decoding Lexicographically Minimal de Bruijn Sequence
We give efficient algorithms for ranking Lyndon words of length n over an
alphabet of size {\sigma}. The rank of a Lyndon word is its position in the
sequence of lexicographically ordered Lyndon words of the same length. The
outputs are integers of exponential size, and complexity of arithmetic
operations on such large integers cannot be ignored. Our model of computations
is the word-RAM, in which basic arithmetic operations on (large) numbers of
size at most {\sigma}^n take O(n) time. Our algorithm for ranking Lyndon words
makes O(n^2) arithmetic operations (this would imply directly cubic time on
word-RAM). However, using an algebraic approach we are able to reduce the total
time complexity on the word-RAM to O(n^2 log {\sigma}). We also present an
O(n^3 log^2 {\sigma})-time algorithm that generates the Lyndon word of a given
length and rank in lexicographic order. Finally we use the connections between
Lyndon words and lexicographically minimal de Bruijn sequences (theorem of
Fredricksen and Maiorana) to develop the first polynomial-time algorithm for
decoding minimal de Bruijn sequence of any rank n (it determines the position
of an arbitrary word of length n within the de Bruijn sequence).Comment: Improved version of a paper presented at CPM 201
Pattern Matching and Consensus Problems on Weighted Sequences and Profiles
We study pattern matching problems on two major representations of uncertain
sequences used in molecular biology: weighted sequences (also known as position
weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple
version, in which only the pattern or only the text is uncertain, we obtain
efficient algorithms with theoretically-provable running times using a
variation of the lookahead scoring technique. We also consider a general
variant of the pattern matching problems in which both the pattern and the text
are uncertain. Central to our solution is a special case where the sequences
have equal length, called the consensus problem. We propose algorithms for the
consensus problem parameterized by the number of strings that match one of the
sequences. As our basic approach, a careful adaptation of the classic
meet-in-the-middle algorithm for the knapsack problem is used. On the lower
bound side, we prove that our dependence on the parameter is optimal up to
lower-order terms conditioned on the optimality of the original algorithm for
the knapsack problem.Comment: 22 page
Internal Pattern Matching Queries in a Text and Applications
We consider several types of internal queries: questions about subwords of a
text. As the main tool we develop an optimal data structure for the problem
called here internal pattern matching. This data structure provides
constant-time answers to queries about occurrences of one subword in
another subword of a given text, assuming that ,
which allows for a constant-space representation of all occurrences. This
problem can be viewed as a natural extension of the well-studied pattern
matching problem. The data structure has linear size and admits a linear-time
construction algorithm.
Using the solution to the internal pattern matching problem, we obtain very
efficient data structures answering queries about: primitivity of subwords,
periods of subwords, general substring compression, and cyclic equivalence of
two subwords. All these results improve upon the best previously known
counterparts. The linear construction time of our data structure also allows to
improve the algorithm for finding -subrepetitions in a text (a more
general version of maximal repetitions, also called runs). For any fixed
we obtain the first linear-time algorithm, which matches the linear
time complexity of the algorithm computing runs. Our data structure has already
been used as a part of the efficient solutions for subword suffix rank &
selection, as well as substring compression using Burrows-Wheeler transform
composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201
- …