Search CORE

17 research outputs found

From Theory to Practice: Plug and Play with Succinct Data Structures

Author: F. Claude
G. Navarro
G. Navarro
J.S. Culpepper
K. Sadakane
K. Sadakane
N. Jesper Larsson
R. Grossi
S. Vigna
V. Mäkinen
Publication venue
Publication date: 05/11/2013
Field of study

Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.Comment: 10 pages, 4 figures, 3 table

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Potential of Learned Index Structures for Index Compression

Author: Culpepper J.S.
de Rijke M.
Koopman B.
Oosterhuis H.
Thomas P.
Trotman A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research

A compressed self-indexed representation of xml documents

Author: E. Moura
G. Bordogna
J.S. Culpepper
N.R. Brisaboa
N.R. Brisaboa
N.R. Brisaboa
U. Manber
Publication venue
Publication date: 01/01/2009
Field of study

Abstract. This paper presents a structure we call XML Wavelet Tree (XWT) to represent any XML document in a compressed and self-indexed form. Therefore, any query or procedure that could be performed over the original document can be performed more efficiently over the XWT representation because it is shorter and has some indexing properties. In fact, XWT permits to answer XPath queries more efficiently than using the uncompressed version of the documents. XWT is also competitive when comparing it with inverted indexes over the XML document (if both structures use the same space).

CiteSeerX

LEKYTHOS

Crossref

Performance Improvements for Search Systems Using an Integrated Cache of Lists+Intersections

Author: E. Markatos
H. Turtle
H.T. Lam
I.H. Witten
J. Dean
J.S. Culpepper
R. Ozcan
T. Fagni
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref

Space-Efficient Top-k Document Retrieval

Author: D. Belazzougui
G. Navarro
J. Larsson
J.S. Culpepper
K. Sadakane
M. Bender
N. Välimäki
P. Ferragina
T. Gagie
T. Gagie
U. Manber
W.-K. Hon
Publication venue
Publication date: 01/01/2012
Field of study

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff

CiteSeerX

Crossref

Compressed Self-indices Supporting Conjunctive Queries on Document Collections

Author: D. Benoit
E. Demaine
F. Claude
G. Manzini
J.M. Kleinberg
J.S. Culpepper
K. Sadakane
N. Välimäki
R. Baeza-Yates
R. González
S. Brin
T. Gagie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Improved compressed indexes for full-text document retrieval

Author: A. Apostolico
D.E. Willard
G. Manzini
G. Navarro
G. Navarro
I. Munro
J. Fischer
J.S. Culpepper
K. Sadakane
N. Välimäki
P. Ferragina
R. Grossi
T. Gagie
T. Gagie
U. Manber
Publication venue
Publication date: 01/01/2011
Field of study

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least |CSA | + O(n) or 2|CSA | + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA | + O(n lg lg lg D) bits. We also improve current solutions that use 2|CSA | + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies

CiteSeerX

Crossref

Efficient Indexing and Representation of Web Access Logs

Author: B. Mobasher
C. Sumathi
G. Manzini
G. Navarro
J. Domènech
J. Fischer
J. Han
J. Pei
J.I. Munro
J.S. Culpepper
K. Sadakane
K. Sadakane
K. Sadakane
P. Ferragina
X. Dongshan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref

Dual-Sorted Inverted Lists

Author: B. Croft
C.D. Manning
D.A. Hull
G. Navarro
G. Zipf
H. Heaps
I. Witten
J. Xu
J.S. Culpepper
M. Persin
N. Brisaboa
R. Baeza-Yates
R. Baeza-Yates
R. Baeza-Yates
R. Baeza-Yates
S. Buettcher
T. Gagie
V. Anh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents

CiteSeerX

Crossref

Aggregate Exposure and Cumulative Risk Assessment—Integrating Occupational and Non-occupational Risk Factors

Crossref