858 research outputs found

    Compressed Representations of Conjunctive Query Results

    Full text link
    Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern. In particular, we initiate the study of an important tradeoff: minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay for an access request over the result. Our main contribution is a novel parameterized data structure, which can be tuned to trade off space for answer time. The tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the access pattern. We show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries.Comment: To appear in PODS'18; 35 pages; comments welcom

    The Potential of Learned Index Structures for Index Compression

    Full text link
    Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research.Comment: Will appear in the proceedings of ADCS'1

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Compressed k2-Triples for Full-In-Memory RDF Engines

    Get PDF
    Current "data deluge" has flooded the Web of Data with very large RDF datasets. They are hosted and queried through SPARQL endpoints which act as nodes of a semantic net built on the principles of the Linked Data project. Although this is a realistic philosophy for global data publishing, its query performance is diminished when the RDF engines (behind the endpoints) manage these huge datasets. Their indexes cannot be fully loaded in main memory, hence these systems need to perform slow disk accesses to solve SPARQL queries. This paper addresses this problem by a compact indexed RDF structure (called k2-triples) applying compact k2-tree structures to the well-known vertical-partitioning technique. It obtains an ultra-compressed representation of large RDF graphs and allows SPARQL queries to be full-in-memory performed without decompression. We show that k2-triples clearly outperforms state-of-the-art compressibility and traditional vertical-partitioning query resolution, remaining very competitive with multi-index solutions.Comment: In Proc. of AMCIS'201
    corecore