5 research outputs found

    Sorting suffixes of a text via its Lyndon Factorization

    Full text link
    The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013

    A Quick Tour on Suffix Arrays and Compressed Suffix Arrays

    Get PDF
    AbstractSuffix arrays are a key data structure for solving a run of problems on texts and sequences, from data compression and information retrieval to biological sequence analysis and pattern discovery. In their simplest version, they can just be seen as a permutation of the elements in {1,2,…,n}, encoding the sorted sequence of suffixes from a given text of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and sophisticated suffix trees. Over the years, many interesting combinatorial properties have been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the n! permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Librería de Estructuras de Datos Compactas en Rust

    Get PDF
    [Resumen]: El crecimiento exponencial de los datos en la actualidad plantea desafíos significativos en términos de almacenamiento y procesamiento eficiente de los mismos. Este trabajo fin de grado se centra en la importancia de las estructuras de datos compactas como una solución clave en el tratamiento de datos a gran escala. A diferencia de las técnicas clásicas de compresión, estas estructuras permiten operar con datos sin necesidad de descomprimirlos por completo, lo que ahorra tiempo y espacio en memoria. Este enfoque se ha vuelto esencial en campos como la recuperación de información y la bioinformática debido al crecimiento masivo de datos. El lenguaje de programación Rust, conocido por su seguridad, gestión automática de memoria y eficiencia, se ha convertido en una de las opciones preferidas en la actualidad en términos de innovación y modernización en la industria de la tecnología. Ante la falta de una librería de estructuras de datos compactas en Rust que sea competitiva con el estado del arte en otros lenguajes de programación, este proyecto aprovechará las ventajas que nos proporciona este lenguaje para desarrollar una librería de estructuras de datos compactas de código abierto, proporcionando así a la comunidad científica y a los desarrolladores de Rust una herramienta flexible, potente y fácil de usar para sus proyectos. Además, con este trabajo fin de grado se busca fomentar la reproducibilidad, la reutilización y el avance en la investigación en el campo de investigación en estructuras de datos compactas. De esta manera, se contribuirá a la expansión y adopción de Rust en la investigación y al desarrollo de software científico eficiente y confiable.[Abstract]: The exponential growth of data nowadays poses significant challenges in terms of efficient storage and processing. This undergraduate thesis focuses on the importance of compact data structures as a key solution in handling large scale data. Unlike classical compression techniques, these structures allow for operations on data without the need for complete decompression, saving time and memory space. This approach has become essential in fields such as information retrieval and bioinformatics due to the massive growth of data. The Rust programming language, known for its safety, automatic memory management and efficiency, has become one of the preferred options at present for innovation and modernization in the technology industry. In the absence of a competitive library of compact data structures in Rust compared to the state of the art in other programming languages, this project will leverage the advantages provided by this language to develop an open source compact data structures library. This will provide the scientific community and Rust developers a flexible, powerful, and easy to use tool for their projects. Furthermore, this undergraduate thesis aims to promote reproducibility, reuse, and progress in research on the field of compact data structures. In this way, it will contribute to the expansion and adoption of Rust in research and the development of efficient and reliable scientific software.Traballo fin de grao (UDC.FIC). Enxeñaría Informática. Curso 2022/202

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF
    corecore