Location of Repository

Efficient set intersection for inverted indexing

By J. Shane Culpepper and Alistair Moffat

Abstract

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this paper is to explore these tradeoffs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations

Topics: Categories and Subject Descriptors, E.2 [Data Storage Representations, Composite structures, H.3.2 [Information Storage, File organization, H.3.3 [Information Search and Retrieval, Search process Additional Key Words and Phrases, Compact data structures, information retrieval, set intersection, set representation
Year: 2010
OAI identifier: oai:CiteSeerX.psu:10.1.1.415.2636
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://goanna.cs.rmit.edu.au/~... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.