12 research outputs found

    Re-Pair Compression of Inverted Lists

    Full text link
    Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

    Map algebra on raster datasets represented by compact data structures

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract]: The increase in the size of data repositories has forced the design of new computing paradigms to be able to process large volumes of data in a reasonable amount of time. One of them is in-memory computing, which advocates storing all the data in main memory to avoid the disk I/O bottleneck. Compression is one of the key technologies for this approach. For raster data, a compact data structure, called (Formula presented.) -raster, have been recently been proposed. It compresses raster maps while still supporting fast retrieval of a given datum or a portion of the data directly from the compressed data. (Formula presented.) -raster's original work introduced several queries in which it was superior to competitors. However, to be used as the basis of an in-memory system for raster data, it is mandatory to demonstrate its efficiency when performing more complex operations such as the map algebra operators. In this work, we present the algorithms to run a set of these operators directly on (Formula presented.) -raster without a decompression procedure.This work was supported by the National Natural Science Foundation of China (Grant Nos. 31171944, 31640068), Anhui Provincial Natural Science Foundation (Grant No. 2019B319), Earmarked Fund for Anhui Science and Technology Major Project (202003b06020016). Information CITIC, Ministerio de Ciencia e Innovación, Grant/Award Numbers: PID2020-114635RB-I00; PDC2021-120917-C21; PDC2021-121239-C31; PID2019-105221RB-C41; TED2021-129245-C21; Xunta de Galicia, Grant/Award Numbers: ED431C 2021/53; IN852D 2021/3 (CO3)This work was partially supported by CITIC, CITIC is funded by the Xunta de Galicia through the collaboration agreement between the Department of Culture, Education, Vocational Training and Universities and the Galician universities for the reinforcement of the research centers of the Galician University System (CIGUS). IN852D 2021/3(CO3): partially funded by UE, (ERDF), GAIN, convocatoria Conecta COVID. GRC: ED431C 2021/53: partially funded by GAIN/Xunta de Galicia. TED2021-129245B-C21; PDC2021-121239-C31; PDC2021-120917-C21: partially funded by MCIN/AEI/10.13039/501100011033 and “NextGenerationEU”/PRTR. PID2020-114635RB-I00; PID2019-105221RB-C41: partially funded by MCIN/AEI/10.13039/501100011033. Funding for open access charge: Universidadeda Coruña/CISUG.Xunta de Galicia; ED431C 2021/53Xunta de Galicia; IN852D 2021/3 (CO3)National Natural Science Foundation of China; 31171944National Natural Science Foundation of China; 31640068Anhui Provincial Natural Science Foundation; 2019B31

    Space-Efficient Data Structures for Information Retrieval

    Get PDF
    The amount of data that people and companies store has grown exponentially over the last few years. Storing this information alone is not enough, because in order to make it useful we need to be able to efficiently search inside it. Furthermore, it is highly valuable to keep the historic data of each document stored, allowing to not only access and search inside the newest version, but also over the whole history of the documents. Grammar-based compression has proven to be very effective for repetitive data, which is the case for versioned documents. In this thesis we present several results on representing textual information and searching in it. In particular, we present text indexes for grammar-based compressed text that support searching for a pattern and extracting substrings of the input text. These are the first general indexes for grammar-based compressed text that support searching in sublinear time. In order to build our indexes, we present new results on representing binary relations in a space-efficient manner, and construction algorithms that use little space to achieve their goal. These two results have a wide range of applications. In particular, the representations for binary relations can be used as a building block for several structures in computer science, such as graphs, inverted indexes, etc. Finally, we present a new index, that uses on grammar-based compression, to solve the document listing problem. This problem deals with representing a collection of texts and searching for the documents that contain a given pattern. In spite of being similar to the classical text indexing problem, this problem has proven to be a challenge when we do not want to pay time proportional to the number of occurrences, but time proportional to the size of the result. Our proposal is designed particularly for versioned text, allowing the storage of a collection of documents with all their historic versions in little space. This is currently the smallest structure for such a purpose in practice

    Efficient compression of large repetitive strings

    Get PDF
    When is comes to managing large volumes of data, general-purpose compressors such as gzip are ubiquitous. They are fast, practical and available on every modern platform from standard desktops to mobile devices. These tools exploit local redundancy in a text using a fixed-size sliding window. This window is usually very small relative to the text, however, in principle it can be as large as available memory. The window acts as a dictionary. Compression is achieved by replacing substrings with pointers to previous occurrences found in the dictionary. This type of algorithm becomes problematic when dealing with collections that are larger than physical memory, as it fails to capture any non-local redundancy, that is, repetition that occurs outside of its search window. With rapid growth in the already enormous amount of data we store and process there is a pressing need for improving compression effectiveness, reducing both storage requirements and decompression costs. However, many systems still use general-purpose compression tools on large highly repetitive data collections. In this thesis we focus on addressing this issue. We explore compression in a variety of domains where large volumes of data need to be stored and accessed, and general-purpose compression tools are cannon. First we discuss our work on web corpus compression, then we discuss the implementation of a practical index for repetitive texts that gives strong theoretical bounds in terms of size and access, and finally, we discuss our work on compression of high-throughput sequencing reads. We show that in all cases, our new methods improve on current techniques in both run-time and compression effectiveness, and provide important functionality such as fast decoding and random access
    corecore