37,298 research outputs found

    Inverted index compression based on term and document identifier reassignment

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An inverted index consists of lists of document identifiers, also referred as posting lists, for each term. Compressing an inverted index reduces the size of the index, which also improves the query performance due to the reduction on disk access times. In recent studies, it is shown that reassigning document identifiers has great effect in compression of an inverted index. In this work, we propose a novel technique that reassigns both term and document identifiers of an inverted index by transforming the matrix representation of the index into a block-diagonal form, which improves the compression ratio dramatically. We adapted row-net hypergraph-partitioning model for the transformation into block-diagonal form, which improves the compression ratio by as much as 50%. To the best of our knowledge, this method performs more effectively than previous inverted index compression techniques.Baykan, İzzet ÇağrıM.S

    On inverted index compression for search engine efficiency

    Get PDF
    Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries . While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this paper, we experiment with different modern integer compression algorithms, integrating these into a modern IR system. Through comprehensive experiments conducted on two large, widely used document corpora and large query sets, our results show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Overall, we find that the simple Frame of Reference compression scheme results in the best query response times for all types of posting information. Moreover, we observe that the frequency and position posting information in Web corpora that have large volumes of anchor text are more challenging to compress, yet compression is beneficial in reducing average query response times

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Fast dictionary-based compression for inverted indexes

    Get PDF
    Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed

    INVERTED INDEX COMPRESSION DENGAN METODE GAMMA CODE PADA INFORMATION RETRIEVAL SYSTEM

    Get PDF
    ABSTRAKSI: Pada information retrieval system, inverted index digunakan untuk mengevaluasi suatu query. Semakin banyak dokumen yang harus disimpan, maka semakin besar pula inverted index yang terbentuk. Dan semakin banyak pula query yang harus diproses pada pencarian dokumen-dokumen tersebut. Maka, dibutuhkan suatu cara optimisasi performansi untuk memenuhi kebutuhan dalam penyimpanan inverted index yang semakin besar dan pemrosesan query yang semakin banyak, salah satunya adalah kompresi inverted index. Kompresi inverted index diharapkan dapat mengurangi kebutuhan ruang penyimpanan inverted index dan meningkatkan penggunaan cache di memori. Salah satu metode kompresi inverted index adalah Gamma code, yang mengubah integer menjadi binary codeword. Data yang dikompresi berupa ID dokumen dan frekuensi term. Pada tugas akhir ini, dilakukan pengujian penerapan kompresi inverted index pada information retrieval system dengan koleksi dokumen yang berukuran kecil dan koleksi dokumen yang berukuran besar. Dari analisis hasil pengujian, diperoleh kesimpulan bahwa Gamma code dapat menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena term-termnya tersebar di banyak dokumen, sehingga pengkodean Gamma lebih pendek pada setiap posting. Juga menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena rata-rata rasio waktu pemrosesan query-nya lebih kecil dibandingkan pada koleksi dokumen yang kecil . Kata Kunci : kompresi, inverted index, Gamma code, integerABSTRACT: In information retrieval system, inverted index is used to evaluate query. More documents to be store can causes larger inverted index to be create and more queries that must be processed in search system. So, needed an optimization query, one of which is inverted index compression. Inverted index is expected to reduce storage space requirements and increase the usage of memory cache, thus avoiding the full access to the disk during query evaluation. One of the inverted index compression method is the Gamma code. Gamma code is one of the compression technique that turns an integer into a binary codeword. Compressed data is document ID and term frequency. The testing is an implementation of inverted index compression in information retrieval system with a small document collection and a large document collection. From the analysis of test results, we conclude that Gamma code can increase performance in size of inverted index and query processing time on a large document collection. In inverted index size, the terms of large document collection are distributed in many documents, so it result shorter encoding of Gamma code. In query processing, average query processing time ratio of large document collection is lower than small document collection.Keyword: compression, inverted index, gamma code, intege

    Using Inverted Files to Compress Text

    Get PDF
    This is the first report on a new approach to text compression. It consists of representing the text file with compressed inverted file index in conjunction with very compact lexicon, where lexicon includes every word in the text. The index is compressed using standard index compression techniques, and lexicon is compressed by original dictionary compression method that gives better compression results than existing procedures. Compression procedure is complex, but decompression time is linear with the file size, although it requires two passes and hence can not be performed online. First experiments show that this method, when refined, can be competitive for larger texts that only need to be decompressed in the real time
    corecore