17 research outputs found

    From Theory to Practice: Plug and Play with Succinct Data Structures

    Full text link
    Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.Comment: 10 pages, 4 figures, 3 table

    The Potential of Learned Index Structures for Index Compression

    No full text
    Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research

    A compressed self-indexed representation of xml documents

    Get PDF
    Abstract. This paper presents a structure we call XML Wavelet Tree (XWT) to represent any XML document in a compressed and self-indexed form. Therefore, any query or procedure that could be performed over the original document can be performed more efficiently over the XWT representation because it is shorter and has some indexing properties. In fact, XWT permits to answer XPath queries more efficiently than using the uncompressed version of the documents. XWT is also competitive when comparing it with inverted indexes over the XML document (if both structures use the same space).

    Space-Efficient Top-k Document Retrieval

    No full text
    Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff

    Improved compressed indexes for full-text document retrieval

    No full text
    We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least |CSA | + O(n) or 2|CSA | + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA | + O(n lg lg lg D) bits. We also improve current solutions that use 2|CSA | + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies

    Dual-Sorted Inverted Lists

    No full text
    Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents
    corecore