42 research outputs found

    Suffix arrays: what are they good for?

    Get PDF
    Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays

    Comparison of LZ77-type Parsings

    Full text link
    We investigate the relations between different variants of the LZ77 parsing existing in the literature. All of them are defined as greedily constructed parsings encoding each phrase by reference to a string occurring earlier in the input. They differ by the phrase encodings: encoded by pairs (length + position of an earlier occurrence) or by triples (length + position of an earlier occurrence + the letter following the earlier occurring part); and they differ by allowing or not allowing overlaps between the phrase and its earlier occurrence. For a given string of length nn over an alphabet of size σ\sigma, denote the numbers of phrases in the parsings allowing (resp., not allowing) overlaps by zz (resp., z^\hat{z}) for "pairs", and by z3z_3 (resp., z^3\hat{z}_3) for "triples". We prove the following bounds and provide series of examples showing that these bounds are tight: \bullet zz^zO(lognzlogσz)z \le \hat{z} \le z \cdot O(\log\frac{n}{z\log_\sigma z}) and z3z^3z3O(lognz3logσz3)z_3 \le \hat{z}_3 \le z_3 \cdot O(\log\frac{n}{z_3\log_\sigma z_3}); \bullet 12z^<z^3z^\frac{1}2\hat{z} < \hat{z}_3 \le \hat{z} and 12z<z3z\frac{1}2 z < z_3 \le z.Comment: 6 page

    Suffix arrays: what are they good for?

    Get PDF
    Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays

    Lightweight Lempel-Ziv Parsing

    Full text link
    We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

    Improved ESP-index: a practical self-index for highly repetitive texts

    Full text link
    While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESP-index-I performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th International Symposium on Experimental Algorithms (SEA2014

    Indexación y búsqueda sobre datos no estructurados

    Get PDF
    Las bases de datos actuales han incluido la capacidad de almacenar datos no estructurados tales como imágenes, sonido, texto, video, etc. La problemática de almacenamiento y búsqueda en estos tipos de base de datos difiere de las bases de datos clásicas, dado que no es posible organizarlos en registros y campos, y aun cuando pudiera hacerse, la búsqueda exacta carece de interés. Es en este contexto donde surgen nuevos modelos de bases de datos capaces de cubrir las necesidades de almacenamiento y búsqueda de estas aplicaciones. Nuestro interés se basa en el diseño de índices eficientes para estas nuevas bases de datos.Eje: Bases de datos y Minería de datos.Red de Universidades con Carreras en Informática (RedUNCI

    Indexación y búsqueda sobre datos no estructurados

    Get PDF
    Las bases de datos actuales han incluido la capacidad de almacenar datos no estructurados tales como imágenes, sonido, texto, video, etc. La problemática de almacenamiento y búsqueda en estos tipos de base de datos difiere de las bases de datos clásicas, dado que no es posible organizarlos en registros y campos, y aun cuando pudiera hacerse, la búsqueda exacta carece de interés. Es en este contexto donde surgen nuevos modelos de bases de datos capaces de cubrir las necesidades de almacenamiento y búsqueda de estas aplicaciones. Nuestro interés se basa en el diseño de índices eficientes para estas nuevas bases de datos.Eje: Bases de Datos y Minería de DatosRed de Universidades con Carreras en Informática (RedUNCI

    Indexando texto en memoria secundaria

    Get PDF
    La próxima generación de administradores de bases de datos deberá ser capaz de indexar datos no estructurados (datos multimedia) y responder consultas sobre estos datos con tanta eficiencia como actualmente responden consultas de búsqueda exacta sobre bases de datos relacionales. Si bien existen numerosas técnicas de indexación diseñadas para esta problemática, mejorar la eficiencia de las mismas es de vital importancia. Nuestro ámbito de investigación es el estudio de índices eficientes para datos no estructurados.Eje: Bases de Datos y Minería de DatosRed de Universidades con Carreras en Informática (RedUNCI
    corecore