3 research outputs found

    Select-based random access to variable-byte encodings

    Get PDF
    Enormous datasets are a common occurence today and compressing them is often beneficial. Fast direct access to any element in the compressed data is a requirement in the field of compressed data structures, which is not easily supported with traditional compression methods. Variable-byte encoding is a method for compressing integers of different byte lengths. It removes unused leading bytes and adds an additional continuation bit to each byte to denote whether the compressed integer continues to the next byte or not. An existing solution using a rank data structure performs well in this given task. This thesis introduces an alternative solution using a select data structure and compares the two implementations. An experimentation is also done on retrieving a subarray from the compressed data structure. The rank implementation performs better on data containing mostly small integers. The select implementation benefits on larger integers. The select implementation has significant advantages on subarray fetching due to how the data is compressed

    Librer铆a de Estructuras de Datos Compactas en Rust

    Get PDF
    [Resumen]: El crecimiento exponencial de los datos en la actualidad plantea desaf铆os significativos en t茅rminos de almacenamiento y procesamiento eficiente de los mismos. Este trabajo fin de grado se centra en la importancia de las estructuras de datos compactas como una soluci贸n clave en el tratamiento de datos a gran escala. A diferencia de las t茅cnicas cl谩sicas de compresi贸n, estas estructuras permiten operar con datos sin necesidad de descomprimirlos por completo, lo que ahorra tiempo y espacio en memoria. Este enfoque se ha vuelto esencial en campos como la recuperaci贸n de informaci贸n y la bioinform谩tica debido al crecimiento masivo de datos. El lenguaje de programaci贸n Rust, conocido por su seguridad, gesti贸n autom谩tica de memoria y eficiencia, se ha convertido en una de las opciones preferidas en la actualidad en t茅rminos de innovaci贸n y modernizaci贸n en la industria de la tecnolog铆a. Ante la falta de una librer铆a de estructuras de datos compactas en Rust que sea competitiva con el estado del arte en otros lenguajes de programaci贸n, este proyecto aprovechar谩 las ventajas que nos proporciona este lenguaje para desarrollar una librer铆a de estructuras de datos compactas de c贸digo abierto, proporcionando as铆 a la comunidad cient铆fica y a los desarrolladores de Rust una herramienta flexible, potente y f谩cil de usar para sus proyectos. Adem谩s, con este trabajo fin de grado se busca fomentar la reproducibilidad, la reutilizaci贸n y el avance en la investigaci贸n en el campo de investigaci贸n en estructuras de datos compactas. De esta manera, se contribuir谩 a la expansi贸n y adopci贸n de Rust en la investigaci贸n y al desarrollo de software cient铆fico eficiente y confiable.[Abstract]: The exponential growth of data nowadays poses significant challenges in terms of efficient storage and processing. This undergraduate thesis focuses on the importance of compact data structures as a key solution in handling large scale data. Unlike classical compression techniques, these structures allow for operations on data without the need for complete decompression, saving time and memory space. This approach has become essential in fields such as information retrieval and bioinformatics due to the massive growth of data. The Rust programming language, known for its safety, automatic memory management and efficiency, has become one of the preferred options at present for innovation and modernization in the technology industry. In the absence of a competitive library of compact data structures in Rust compared to the state of the art in other programming languages, this project will leverage the advantages provided by this language to develop an open source compact data structures library. This will provide the scientific community and Rust developers a flexible, powerful, and easy to use tool for their projects. Furthermore, this undergraduate thesis aims to promote reproducibility, reuse, and progress in research on the field of compact data structures. In this way, it will contribute to the expansion and adoption of Rust in research and the development of efficient and reliable scientific software.Traballo fin de grao (UDC.FIC). Enxe帽ar铆a Inform谩tica. Curso 2022/202

    Inverted treaps

    No full text
    We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using similar space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression, we represent the treap topology using different alternative compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. We also show how to extend this representation to support incremental updates over the index. Results show that, under the tf-idf scoring scheme, our index uses about the same space as state-of-the-art compact representations, while performing up to 2-20 times faster on ranked single-word, union, or intersection queries. Under the BM25 scoring scheme, our index may use up to 40% more space than the others and outperforms them less frequently but still reaches improvement factors of 2-20 in the best cases. The index supporting incremental updates poses an overhead of 50%-100% over the static variants in terms of space, construction, and query time
    corecore