157 research outputs found

    When a Dollar Makes a BWT

    Get PDF
    TheBurrows-Wheeler-Transform(BWT)isareversiblestring transformation which plays a central role in text compression and is fun- damental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string. It is easy to see that not every string is a BWT image, and exact charac- terizations of BWT images are known. We investigate a related combi- natorial question. In many applications, a sentinel character isaddedtomarktheendofthestring,andthustheBWTofastringendingwith is added to mark the end of the string, and thus the BWT of a string ending with contains exactly one character.Weask,givenastringw,inwhichpositions,ifany,canthe character. We ask, given a string w, in which positions, if any, can the -character be inserted to turn w into the BWT image of a word ending with the sentinel character. We show that this depends only on the standard permutation of w and give a combinatorial characterization of such positions via this permutation. We then develop an O(n log n)-time algorithm for identifying all such positions, improving on the naive quadratic time algorithm

    Bicriteria data compression

    Get PDF
    The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice

    Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

    Get PDF
    In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

    Sensitivity of the Burrows-Wheeler Transform to small modifications, and other problems on string compressors in Bioinformatics

    Get PDF
    Extensive amount of data is produced in textual form nowadays, especially in bioinformatics. Several algorithms exist to store and process this data efficiently in compressed space. In this thesis, we focus on both combinatorial and practical aspects of two of the most widely used algorithms for compressing text in bioinformatics: the Burrows-Wheeler Transform (BWT) and Lempel-Ziv compression (LZ77). In the first part, we focus on combinatorial aspects of the BWT. Given a word v, r = r(v) denotes the number of maximal equal-letter runs in BWT(v). First, we investigate the relationship between r of a word and r of its reverse. We prove that there exist words for which these two values differ by a logarithmic factor in the length of the word. In other words, although the repetitiveness in the two words is preserved, the number of runs can change by a non-constant factor. This suggests that the number of runs may not be an ideal repetitiveness measure. The second combinatorial aspect we are interested in is how small alterations in a word may affect its BWT in a relevant way. We prove that the number of runs of the BWT of a word can change (increase or decrease) by up to a logarithmic factor in the length of the word by just adding, removing, or substituting a single character. We then consider the special character usedinreal−lifeapplicationstomarktheendofaword.WeinvestigatetheimpactofthischaracteronwordswithrespecttotheBWT.Wecharacterizepositionsinawordwhere used in real-life applications to mark the end of a word. We investigate the impact of this character on words with respect to the BWT. We characterize positions in a word where can be inserted in order to turn it into the BWT of a −terminatedwordoverthesamealphabet.Weshowthat,whetherandwhere-terminated word over the same alphabet. We show that, whether and where is allowed, depends entirely on the structure of a specific permutation of the indices of the word, which is called the standard permutation of the word. The final part of this thesis treats more applied aspects of text compressors. In bioinformatics, BWT-based compressed data structures are widely used for pattern matching. We give an algorithm based on the BWT to find Maximal Unique Matches (MUMs) of a pattern with respect to a reference text in compressed space, extending an existing tool called PHONI [Boucher et. al, DCC 2021]. Finally, we study some aspects of the Lempel-Ziv 77 (LZ77) factorization of a word. Modeling DNA short reads, we provide a bound on the compression size of the concatenation of regular samples of a word

    Parallel Suffix Arrays for Corpus Exploration

    Get PDF

    Processing and indexing large biological datasets using the Burrows-Wheeler Transform of string collections

    Get PDF
    In the last few decades, the advent of next-generation sequencing technologies (NGS) has dramatically reduced the cost of DNA sequencing. This has made it possible to sequence many genomes in very little time, paving the way for projects which aim at the creation of large and repetitive collections of genomic sequences. The abundance of biological data is driving the development of new memory-efficient algorithms and data structures that can scale for large datasets, thus tackling the high computational burden related to processing these data. This trend has a strong impact on the text algorithms area. In this thesis, we will study the Burrows-Wheeler Transform for processing, indexing, and compressing collections of strings. Data compression addresses the problem of encoding the input to reduce the space needed for storing it, while text indexing focuses on finding ways to efficiently process and extract information from the data. In bioinformatics, these two concepts have been frequently used together since they allow the design of data structures that can efficiently process biological data while keeping the input compressed. The Burrows-Wheeler Transform (BWT) is a reversible transformation on strings introduced by Michael Burrows and David J. Wheeler in 1994 that plays a central role in this area. It is the key component of several compressed data structures for text processing, like the FM-index [Ferraggina and Manzini, SODA, 2000] or the r-index [Gagie et al., SODA, 2018], and some of the most important software in bioinformatics, such as the well-known Bowtie [Langmead et al., Genome Biology, 2009] and BWA [Li and Durbin, Bioinformatics, 2010]. The BWT was originally defined for individual strings, so when the focus moved from single sequences to string collections, there was the need to extend this transform. Over the years, several different tools and algorithms for computing BWT of string collections were introduced. However, even if the transforms generated by these tools frequently differ from each other, the problem of characterizing the BWT variants was never addressed properly. In this thesis, we close this gap by presenting the first systematic study of the BWT of string collections. We identified five non-equivalent variants computed by the tools in current use and analyzed their properties to show how exactly they differ. We complete our theoretical analysis by comparing the five BWT variants on several real-life biological datasets. We show that not only the differences among the resulting transforms can be extensive, but they also lead to significant changes in the compressibility of the BWT of the underlying string collection. As a further complication, the BWT variants in use often depend on the input order of the sequences. This significantly impacts the number of runs r, which defines the size of BWT-based compressed data structures. In this thesis, we address the problem of reordering the input sequences by providing the first implementation of the algorithm of Bentley et al. [ESA 2020], which computes the order minimizing the number of runs of the BWT. This leads to the creation of the first tool for computing the optimal BWT, i.e., the BWT variant which guarantees the minimum number of runs. We show experimentally that the input order can dramatically affect the final result: on our real-life datasets, the optimal BWT had up to 31 times fewer runs than the BWT computed without reordering the input sequences. The extended BWT (eBWT) of Mantaci et al. [Theor. Comput. Sci. 2007] is one of the first BWT variants explicitly designed to process string collections. Even though this transform is mathematically sound and has useful properties, its construction has been a problem for more than a decade. In this thesis, we present two linear-time algorithms for computing the eBWT of large string collections. The first is an improvement of the Bijective BWT construction algorithm of Bannai et al. [CPM 2019], while the second uses the Prefix-free parsing (PFP) method [Boucher et al., Algorithms Mol. Biol., 2019] to specifically process large and repetitive genomic sequence collections. In the final part of the thesis, we conclude by studying, for the first time, how to index string collections using the eBWT. We present the extended r-index, an extension of the r-index to the eBWT, which maintains the same performance as the original r-index while inheriting the properties of the eBWT. We implemented this data structure using a variant of the PFP algorithm and tested it on real-life biological datasets containing circular bacterial genomes and plasmids. We show experimentally that our index has competitive query times compared to the r-index on different pattern lengths while supporting advanced pattern matching functionalities on circular sequences

    Personal genome editing algorithms to identify increased variant-induced off-target potential

    Get PDF
    Clustered regularly interspaced short palindromic repeats (CRISPR) technologies allow for facile genomic modification in a site-specific manner. A key step in this process is the in-silico design of single guide RNAs (sgRNAs) to efficiently and specifically target a site of interest. To this end, it is necessary to enumerate all potential off-target sites within a given genome that could be inadvertently altered by nuclease-mediated cleavage. Off-target sites are quasi-complementary regions of the genome in which the specified sgRNA can bind, even without a perfect complementary nucleotides sequence. This problem is known as off-target sites enumeration and became common after discovery of CRISPR technology. To solve this problem, many in-silico solutions were proposed in the last years but, currently available software for this task are limited by computational efficiency, variant support, genetic annotation, assessment of the functional impact of potential off-target effects at population and individual level, and a user-friendly graphical interface designed to be usable by non-informatician without any programming knowledge. This thesis addresses all these topics by proposing two software to directly answer the off-target enumeration problem and perform all the related analysis. In details, the thesis proposes CRISPRitz, a tool designed and developed to compute fast and exhaustive searches on reference and alternative genome to enumerate all the possible off-target for a user-defined set of sgRNAs with specific thresholds of mismatches (non-complementary bps in RNA-DNA binding) and bulges (bubbles that alters the physical structure of RNA and DNA limiting the binding activity). The thesis also proposes CRISPRme, a tool developed starting from CRISPRitz, which answers the requests of professionals and technicians to implement a comprehensive and easy to use interface to perform off-target enumeration, analysis and assessment, with graphical reports, a graphical interface and the capability of performing real-time query on the resulting data to extract desired targets, with a focus on individual and personalized genome analysis

    Efficient approximate string matching techniques for sequence alignment

    Get PDF
    One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas

    Efficient approximate string matching techniques for sequence alignment

    Get PDF
    One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas.Postprint (published version
    • …
    corecore