Search CORE

5,026 research outputs found

Scalable succinct indexing for large text collections

Author: Petri M
Publication venue: RMIT University
Publication date: 01/01/2013
Field of study

Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems

RMIT Research Repository

Universal Indexes for Highly Repetitive Document Collections

Author: Claude Francisco
Fariña Antonio
Martínez-Prieto Miguel A.
Navarro Gonzalo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Académico de la Universidad de Chile

Document Retrieval on Repetitive Collections

Author: C. Hernández
F. Claude
F. Claude
G. Navarro
G. Navarro
H. Ferrada
K. Sadakane
T. Gagie
U. Manber
V. Mäkinen
Publication venue
Publication date: 01/01/2014
Field of study

Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa

arXiv.org e-Print Archive

Crossref

ImageTerrier: an extensible platform for scalable high-performance image retrieval

Author: Dupplaw David
Hare Jonathon
Lewis Paul H.
Samangooei Sina
Publication venue
Publication date: 05/06/2012
Field of study

Southampton (e-Prints Soton)

From Theory to Practice: Plug and Play with Succinct Data Structures

Author: F. Claude
G. Navarro
G. Navarro
J.S. Culpepper
K. Sadakane
K. Sadakane
N. Jesper Larsson
R. Grossi
S. Vigna
V. Mäkinen
Publication venue
Publication date: 05/11/2013
Field of study

Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.Comment: 10 pages, 4 figures, 3 table

arXiv.org e-Print Archive

CiteSeerX

Crossref