1,679 research outputs found

    A Note on Easy and Efficient Computation of Full Abelian Periods of a Word

    Get PDF
    Constantinescu and Ilie (Bulletin of the EATCS 89, 167-170, 2006) introduced the idea of an Abelian period with head and tail of a finite word. An Abelian period is called full if both the head and the tail are empty. We present a simple and easy-to-implement O(nloglogn)O(n\log\log n)-time algorithm for computing all the full Abelian periods of a word of length nn over a constant-size alphabet. Experiments show that our algorithm significantly outperforms the O(n)O(n) algorithm proposed by Kociumaka et al. (Proc. of STACS, 245-256, 2013) for the same problem.Comment: Accepted for publication in Discrete Applied Mathematic

    Optimally Computing Compressed Indexing Arrays Based on the Compact Directed Acyclic Word Graph

    Full text link
    In this paper, we present the first study of the computational complexity of converting an automata-based text index structure, called the Compact Directed Acyclic Word Graph (CDAWG), of size ee for a text TT of length nn into other text indexing structures for the same text, suitable for highly repetitive texts: the run-length BWT of size rr, the irreducible PLCP array of size rr, and the quasi-irreducible LPF array of size ee, as well as the lex-parse of size O(r)O(r) and the LZ77-parse of size zz, where r,zer, z \le e. As main results, we showed that the above structures can be optimally computed from either the CDAWG for TT stored in read-only memory or its self-index version of size ee without a text in O(e)O(e) worst-case time and words of working space. To obtain the above results, we devised techniques for enumerating a particular subset of suffixes in the lexicographic and text orders using the forward and backward search on the CDAWG by extending the results by Belazzougui et al. in 2015.Comment: The short version of this paper will appear in SPIRE 2023, Pisa, Italy, September 26-28, 2023, Lecture Notes in Computer Science, Springe

    Computing Covers Using Prefix Tables

    Get PDF
    An \emph{indeterminate string} x=x[1..n]x = x[1..n] on an alphabet Σ\Sigma is a sequence of nonempty subsets of Σ\Sigma; xx is said to be \emph{regular} if every subset is of size one. A proper substring uu of regular xx is said to be a \emph{cover} of xx iff for every i1..ni \in 1..n, an occurrence of uu in xx includes x[i]x[i]. The \emph{cover array} γ=γ[1..n]\gamma = \gamma[1..n] of xx is an integer array such that γ[i]\gamma[i] is the longest cover of x[1..i]x[1..i]. Fifteen years ago a complex, though nevertheless linear-time, algorithm was proposed to compute the cover array of regular xx based on prior computation of the border array of xx. In this paper we first describe a linear-time algorithm to compute the cover array of regular string xx based on the prefix table of xx. We then extend this result to indeterminate strings.Comment: 14 pages, 1 figur

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Full text link
    Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text TT of length nn, permutes its symbols according to the lexicographic order of suffixes of TT. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length nn, occupying O(n/logn)O(n/\log n) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n)O(n) time and O(n/logn)O(n/\log n) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n)\Omega(n) time. In this paper, we propose the first algorithm that breaks the O(n)O(n)-time barrier for BWT construction. Given a binary string of length nn, our procedure builds the Burrows-Wheeler transform in O(n/logn)O(n/\sqrt{\log n}) time and O(n/logn)O(n/\log n) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(mlogm)O(m\sqrt{\log m})-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn)O(n/\log n) that answers Longest Common Extension queries (LCE queries) in O(1)O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn)O(n/\log n) time.Comment: Full version of a paper accepted to STOC 201

    Computing regularities in strings

    Get PDF
    Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities

    A simple algorithm for computing the Lempel-Ziv factorization

    Get PDF
    We give a space-efficient simple algorithm for computing the Lempel?Ziv factorization ofa string. For a string of length n over an integer alphabet, it runs in O(n) time independentlyof alphabet size and uses o(n) additional space
    corecore