Search CORE

42 research outputs found

Suffix arrays: what are they good for?

Author: Puglisi Simon
Smyth Bill
Turpin A.
Publication venue: Australian Computer Society, Inc.
Publication date: 01/01/2006
Field of study

Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays

espace@Curtin

Reducing the space and time requirements of LZ-index using the XBW transformation

Author: Μιχελής Παναγιώτης
Publication venue
Publication date: 01/01/2009
Field of study

University of Thessaly Institutional Repository

Comparison of LZ77-type Parsings

Author: Kosolobov Dmitry
Shur Arseny M.
Publication venue
Publication date: 23/05/2018
Field of study

We investigate the relations between different variants of the LZ77 parsing existing in the literature. All of them are defined as greedily constructed parsings encoding each phrase by reference to a string occurring earlier in the input. They differ by the phrase encodings: encoded by pairs (length + position of an earlier occurrence) or by triples (length + position of an earlier occurrence + the letter following the earlier occurring part); and they differ by allowing or not allowing overlaps between the phrase and its earlier occurrence. For a given string of length

n

over an alphabet of size

\sigma

, denote the numbers of phrases in the parsings allowing (resp., not allowing) overlaps by

z

(resp.,

\hat{z}

) for "pairs", and by

z_3

(resp.,

\hat{z}_3

) for "triples". We prove the following bounds and provide series of examples showing that these bounds are tight:

\bullet

z \le \hat{z} \le z \cdot O(\log\frac{n}{z\log_\sigma z})

and

z_3 \le \hat{z}_3 \le z_3 \cdot O(\log\frac{n}{z_3\log_\sigma z_3})

;

\bullet

\frac{1}2\hat{z} < \hat{z}_3 \le \hat{z}

and

\frac{1}2 z < z_3 \le z

.Comment: 6 page

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Suffix arrays: what are they good for?

Author: Puglisi S.J.
Smyth W.F.
Turpin A.
Publication venue
Publication date: 01/01/2006
Field of study

Research Repository

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Improved ESP-index: a practical self-index for highly repetitive texts

Author: F. Claude
F. Claude
G. Navarro
J. Barbay
J.I. Munro
K. Goto
O. Delpratt
S. Maruyama
T. Gagie
T. Gagie
T. Yamamoto
Publication venue
Publication date: 01/01/2014
Field of study

While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESP-index-I performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th International Symposium on Experimental Algorithms (SEA2014

arXiv.org e-Print Archive

Crossref

Indexación y búsqueda sobre datos no estructurados

Author: Azar Paola
Esquivel Susana Cecilia
Herrera Norma Edith
Ruano Darío
Publication venue
Publication date: 01/04/2017
Field of study

Las bases de datos actuales han incluido la capacidad de almacenar datos no estructurados tales como imágenes, sonido, texto, video, etc. La problemática de almacenamiento y búsqueda en estos tipos de base de datos difiere de las bases de datos clásicas, dado que no es posible organizarlos en registros y campos, y aun cuando pudiera hacerse, la búsqueda exacta carece de interés. Es en este contexto donde surgen nuevos modelos de bases de datos capaces de cubrir las necesidades de almacenamiento y búsqueda de estas aplicaciones. Nuestro interés se basa en el diseño de índices eficientes para estas nuevas bases de datos.Eje: Bases de datos y Minería de datos.Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Indexación y búsqueda sobre datos no estructurados

Author: Azar Paola
De Battista Anabella
Esquivel Susana Cecilia
Herrera Norma Edith
Ruano Darío
Publication venue
Publication date: 01/04/2016
Field of study

Las bases de datos actuales han incluido la capacidad de almacenar datos no estructurados tales como imágenes, sonido, texto, video, etc. La problemática de almacenamiento y búsqueda en estos tipos de base de datos difiere de las bases de datos clásicas, dado que no es posible organizarlos en registros y campos, y aun cuando pudiera hacerse, la búsqueda exacta carece de interés. Es en este contexto donde surgen nuevos modelos de bases de datos capaces de cubrir las necesidades de almacenamiento y búsqueda de estas aplicaciones. Nuestro interés se basa en el diseño de índices eficientes para estas nuevas bases de datos.Eje: Bases de Datos y Minería de DatosRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Indexando texto en memoria secundaria

Author: Esquivel Susana Cecilia
Herrera Norma Edith
Navarro Gonzalo
Rodríguez Brisaboa Nieves
Ruano Carina
Ruano Darío
Villegas Ana
Publication venue
Publication date: 01/04/2013
Field of study

La próxima generación de administradores de bases de datos deberá ser capaz de indexar datos no estructurados (datos multimedia) y responder consultas sobre estos datos con tanta eficiencia como actualmente responden consultas de búsqueda exacta sobre bases de datos relacionales. Si bien existen numerosas técnicas de indexación diseñadas para esta problemática, mejorar la eficiencia de las mismas es de vital importancia. Nuestro ámbito de investigación es el estudio de índices eficientes para datos no estructurados.Eje: Bases de Datos y Minería de DatosRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual