7 research outputs found

    XBWT Tricks

    Get PDF
    The eXtended Burrows-Wheeler Transform (XBWT) is a data transformation introduced in [Ferragina et al., FOCS 2005] to com- pactly represent a labeled tree and simultaneously support navigation and path-search operations over its label structure. A natural application of the XBWT is to store a dictionary of strings. A recent extensive experimental study [Martı́nez-Prieto et al., Informa- tion Systems, 2016] shows that, among the available string dictionary implementations, the XBWT is attractive because of its good tradeoff between small space usage, speed, and support for substring searches. In this paper we further investigate the use of the XBWT for storing a string dictionary. Our first contribution is to show how to add suffix links (aka failure links) to a XBWT string dictionary. For a XBWT dictionary with n internal nodes our suffix links can be traversed in constant time and only take 2n + o(n) bits of space. Our second contribution are practical construction algorithms for the XBWT, including the additional data structure supporting the traver- sal of suffix links. Our algorithms build on the many well engineered algorithms for Suffix Array and BWT construction and offer different tradeoffs between running time and working space

    Processing and indexing large biological datasets using the Burrows-Wheeler Transform of string collections

    Get PDF
    In the last few decades, the advent of next-generation sequencing technologies (NGS) has dramatically reduced the cost of DNA sequencing. This has made it possible to sequence many genomes in very little time, paving the way for projects which aim at the creation of large and repetitive collections of genomic sequences. The abundance of biological data is driving the development of new memory-efficient algorithms and data structures that can scale for large datasets, thus tackling the high computational burden related to processing these data. This trend has a strong impact on the text algorithms area. In this thesis, we will study the Burrows-Wheeler Transform for processing, indexing, and compressing collections of strings. Data compression addresses the problem of encoding the input to reduce the space needed for storing it, while text indexing focuses on finding ways to efficiently process and extract information from the data. In bioinformatics, these two concepts have been frequently used together since they allow the design of data structures that can efficiently process biological data while keeping the input compressed. The Burrows-Wheeler Transform (BWT) is a reversible transformation on strings introduced by Michael Burrows and David J. Wheeler in 1994 that plays a central role in this area. It is the key component of several compressed data structures for text processing, like the FM-index [Ferraggina and Manzini, SODA, 2000] or the r-index [Gagie et al., SODA, 2018], and some of the most important software in bioinformatics, such as the well-known Bowtie [Langmead et al., Genome Biology, 2009] and BWA [Li and Durbin, Bioinformatics, 2010]. The BWT was originally defined for individual strings, so when the focus moved from single sequences to string collections, there was the need to extend this transform. Over the years, several different tools and algorithms for computing BWT of string collections were introduced. However, even if the transforms generated by these tools frequently differ from each other, the problem of characterizing the BWT variants was never addressed properly. In this thesis, we close this gap by presenting the first systematic study of the BWT of string collections. We identified five non-equivalent variants computed by the tools in current use and analyzed their properties to show how exactly they differ. We complete our theoretical analysis by comparing the five BWT variants on several real-life biological datasets. We show that not only the differences among the resulting transforms can be extensive, but they also lead to significant changes in the compressibility of the BWT of the underlying string collection. As a further complication, the BWT variants in use often depend on the input order of the sequences. This significantly impacts the number of runs r, which defines the size of BWT-based compressed data structures. In this thesis, we address the problem of reordering the input sequences by providing the first implementation of the algorithm of Bentley et al. [ESA 2020], which computes the order minimizing the number of runs of the BWT. This leads to the creation of the first tool for computing the optimal BWT, i.e., the BWT variant which guarantees the minimum number of runs. We show experimentally that the input order can dramatically affect the final result: on our real-life datasets, the optimal BWT had up to 31 times fewer runs than the BWT computed without reordering the input sequences. The extended BWT (eBWT) of Mantaci et al. [Theor. Comput. Sci. 2007] is one of the first BWT variants explicitly designed to process string collections. Even though this transform is mathematically sound and has useful properties, its construction has been a problem for more than a decade. In this thesis, we present two linear-time algorithms for computing the eBWT of large string collections. The first is an improvement of the Bijective BWT construction algorithm of Bannai et al. [CPM 2019], while the second uses the Prefix-free parsing (PFP) method [Boucher et al., Algorithms Mol. Biol., 2019] to specifically process large and repetitive genomic sequence collections. In the final part of the thesis, we conclude by studying, for the first time, how to index string collections using the eBWT. We present the extended r-index, an extension of the r-index to the eBWT, which maintains the same performance as the original r-index while inheriting the properties of the eBWT. We implemented this data structure using a variant of the PFP algorithm and tested it on real-life biological datasets containing circular bacterial genomes and plasmids. We show experimentally that our index has competitive query times compared to the r-index on different pattern lengths while supporting advanced pattern matching functionalities on circular sequences

    Succinct Data Structures for Parameterized Pattern Matching and Related Problems

    Get PDF
    Let T be a fixed text-string of length n and P be a varying pattern-string of length |P| \u3c= n. Both T and P contain characters from a totally ordered alphabet Sigma of size sigma \u3c= n. Suffix tree is the ubiquitous data structure for answering a pattern matching query: report all the positions i in T such that T[i + k - 1] = P[k], 1 \u3c= k \u3c= |P|. Compressed data structures support pattern matching queries, using much lesser space than the suffix tree, mainly by relying on a crucial property of the leaves in the tree. Unfortunately, in many suffix tree variants (such as parameterized suffix tree, order-preserving suffix tree, and 2-dimensional suffix tree), this property does not hold. Consequently, compressed representations of these suffix tree variants have been elusive. We present the first compressed data structures for two important variants of the pattern matching problem: (1) Parameterized Matching -- report a position i in T if T[i + k - 1] = f(P[k]), 1 \u3c= k \u3c= |P|, for a one-to-one function f that renames the characters in P to the characters in T[i,i+|P|-1], and (2) Order-preserving Matching -- report a position i in T if T[i + j - 1] and T[i + k -1] have the same relative order as that of P[j] and P[k], 1 \u3c= j \u3c k \u3c= |P|. For each of these two problems, the existing suffix tree variant requires O(n*log n) bits of space and answers a query in O(|P|*log sigma + occ) time, where occ is the number of starting positions where a match exists. We present data structures that require O(n*log sigma) bits of space and answer a query in O((|P|+occ) poly(log n)) time. As a byproduct, we obtain compressed data structures for a few other variants, as well as introduce two new techniques (of independent interest) for designing compressed data structures for pattern matching
    corecore