Search CORE

1,679 research outputs found

A Note on Easy and Efficient Computation of Full Abelian Periods of a Word

Author: Fici Gabriele
Lecroq Thierry
Lefebvre Arnaud
Prieur-Gaston Élise
Smyth William F.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Constantinescu and Ilie (Bulletin of the EATCS 89, 167-170, 2006) introduced the idea of an Abelian period with head and tail of a finite word. An Abelian period is called full if both the head and the tail are empty. We present a simple and easy-to-implement

O(n\log\log n)

-time algorithm for computing all the full Abelian periods of a word of length

n

over a constant-size alphabet. Experiments show that our algorithm significantly outperforms the

O(n)

algorithm proposed by Kociumaka et al. (Proc. of STACS, 245-256, 2013) for the same problem.Comment: Accepted for publication in Discrete Applied Mathematic

arXiv.org e-Print Archive

Research Repository

Archivio istituzionale della ricerca - Università di Palermo

Optimally Computing Compressed Indexing Arrays Based on the Compact Directed Acyclic Word Graph

Author: Arimura Hiroki
Inenaga Shunsuke
Kobayashi Yasuaki
Nakashima Yuto
Sue Mizuki
Publication venue
Publication date: 04/08/2023
Field of study

In this paper, we present the first study of the computational complexity of converting an automata-based text index structure, called the Compact Directed Acyclic Word Graph (CDAWG), of size

e

for a text

T

of length

n

into other text indexing structures for the same text, suitable for highly repetitive texts: the run-length BWT of size

r

, the irreducible PLCP array of size

r

, and the quasi-irreducible LPF array of size

e

, as well as the lex-parse of size

O(r)

and the LZ77-parse of size

z

, where

r, z \le e

. As main results, we showed that the above structures can be optimally computed from either the CDAWG for

T

stored in read-only memory or its self-index version of size

e

without a text in

O(e)

worst-case time and words of working space. To obtain the above results, we devised techniques for enumerating a particular subset of suffixes in the lexicographic and text orders using the forward and backward search on the CDAWG by extending the results by Belazzougui et al. in 2015.Comment: The short version of this paper will appear in SPIRE 2023, Pisa, Italy, September 26-28, 2023, Lecture Notes in Computer Science, Springe

arXiv.org e-Print Archive

Computing Covers Using Prefix Tables

Author: Alatabbi Ali
Rahman M. Sohel
Smyth W. F.
Publication venue
Publication date: 01/01/2015
Field of study

An \emph{indeterminate string}

x = x[1..n]

on an alphabet

\Sigma

is a sequence of nonempty subsets of

\Sigma

;

x

is said to be \emph{regular} if every subset is of size one. A proper substring

u

of regular

x

is said to be a \emph{cover} of

x

iff for every

i \in 1..n

, an occurrence of

u

x

includes

x[i]

. The \emph{cover array}

\gamma = \gamma[1..n]

x

is an integer array such that

\gamma[i]

is the longest cover of

x[1..i]

. Fifteen years ago a complex, though nevertheless linear-time, algorithm was proposed to compute the cover array of regular

x

based on prior computation of the border array of

x

. In this paper we first describe a linear-time algorithm to compute the cover array of regular string

x

based on the prefix table of

x

. We then extend this result to indeterminate strings.Comment: 14 pages, 1 figur

arXiv.org e-Print Archive

Research Repository

King's Research Portal

Handling Massive N-Gram Datasets Efficiently

Author: Pibiri Giulio Ermanno
Venturini Rossano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/06/2018
Field of study

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Author: A
Alzamel Mai
Counting
Grossi Roberto
Hagerup Torben
Optimal
Uniqueness
Wavelet
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/05/2019
Field of study

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text

T

of length

n

, permutes its symbols according to the lexicographic order of suffixes of

T

. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length

n

, occupying

O(n/\log n)

machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in

O(n)

time and

O(n/\log n)

space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require

\Omega(n)

time. In this paper, we propose the first algorithm that breaks the

O(n)

-time barrier for BWT construction. Given a binary string of length

n

, our procedure builds the Burrows-Wheeler transform in

O(n/\sqrt{\log n})

time and

O(n/\log n)

space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art

O(m\sqrt{\log m})

-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size

O(n/\log n)

that answers Longest Common Extension queries (LCE queries) in

O(1)

time and, furthermore, can be deterministically constructed in the optimal

O(n/\log n)

time.Comment: Full version of a paper accepted to STOC 201

arXiv.org e-Print Archive

Crossref

Computing regularities in strings

Author: Smyth William
Yusufu M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities

Crossref

Research Repository

espace@Curtin

A simple algorithm for computing the Lempel-Ziv factorization

Author: Crochemore M.
Ilie L.
Smyth William
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

We give a space-efficient simple algorithm for computing the Lempel?Ziv factorization ofa string. For a string of length n over an integer alphabet, it runs in O(n) time independentlyof alphabet size and uses o(n) additional space

Crossref

Research Repository

King's Research Portal

espace@Curtin

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM