1,679 research outputs found
A Note on Easy and Efficient Computation of Full Abelian Periods of a Word
Constantinescu and Ilie (Bulletin of the EATCS 89, 167-170, 2006) introduced
the idea of an Abelian period with head and tail of a finite word. An Abelian
period is called full if both the head and the tail are empty. We present a
simple and easy-to-implement -time algorithm for computing all
the full Abelian periods of a word of length over a constant-size alphabet.
Experiments show that our algorithm significantly outperforms the
algorithm proposed by Kociumaka et al. (Proc. of STACS, 245-256, 2013) for the
same problem.Comment: Accepted for publication in Discrete Applied Mathematic
Optimally Computing Compressed Indexing Arrays Based on the Compact Directed Acyclic Word Graph
In this paper, we present the first study of the computational complexity of
converting an automata-based text index structure, called the Compact Directed
Acyclic Word Graph (CDAWG), of size for a text of length into other
text indexing structures for the same text, suitable for highly repetitive
texts: the run-length BWT of size , the irreducible PLCP array of size ,
and the quasi-irreducible LPF array of size , as well as the lex-parse of
size and the LZ77-parse of size , where . As main
results, we showed that the above structures can be optimally computed from
either the CDAWG for stored in read-only memory or its self-index version
of size without a text in worst-case time and words of working
space. To obtain the above results, we devised techniques for enumerating a
particular subset of suffixes in the lexicographic and text orders using the
forward and backward search on the CDAWG by extending the results by
Belazzougui et al. in 2015.Comment: The short version of this paper will appear in SPIRE 2023, Pisa,
Italy, September 26-28, 2023, Lecture Notes in Computer Science, Springe
Computing Covers Using Prefix Tables
An \emph{indeterminate string} on an alphabet is a
sequence of nonempty subsets of ; is said to be \emph{regular} if
every subset is of size one. A proper substring of regular is said to
be a \emph{cover} of iff for every , an occurrence of in
includes . The \emph{cover array} of is
an integer array such that is the longest cover of .
Fifteen years ago a complex, though nevertheless linear-time, algorithm was
proposed to compute the cover array of regular based on prior computation
of the border array of . In this paper we first describe a linear-time
algorithm to compute the cover array of regular string based on the prefix
table of . We then extend this result to indeterminate strings.Comment: 14 pages, 1 figur
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure
Burrows-Wheeler transform (BWT) is an invertible text transformation that,
given a text of length , permutes its symbols according to the
lexicographic order of suffixes of . BWT is one of the most heavily studied
algorithms in data compression with numerous applications in indexing, sequence
analysis, and bioinformatics. Its construction is a bottleneck in many
scenarios, and settling the complexity of this task is one of the most
important unsolved problems in sequence analysis that has remained open for 25
years. Given a binary string of length , occupying machine
words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009)
runs in time and space. Recent advancements (Belazzougui,
STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size
dependency in the time complexity, but they still require time.
In this paper, we propose the first algorithm that breaks the -time
barrier for BWT construction. Given a binary string of length , our
procedure builds the Burrows-Wheeler transform in time and
space. We complement this result with a conditional lower bound
proving that any further progress in the time complexity of BWT construction
would yield faster algorithms for the very well studied problem of counting
inversions: it would improve the state-of-the-art -time
solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a
novel concept of string synchronizing sets, which is of independent interest.
As one of the applications, we show that this technique lets us design a data
structure of the optimal size that answers Longest Common
Extension queries (LCE queries) in time and, furthermore, can be
deterministically constructed in the optimal time.Comment: Full version of a paper accepted to STOC 201
Computing regularities in strings
Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities
A simple algorithm for computing the Lempel-Ziv factorization
We give a space-efficient simple algorithm for computing the Lempel?Ziv factorization ofa string. For a string of length n over an integer alphabet, it runs in O(n) time independentlyof alphabet size and uses o(n) additional space
- …