577 research outputs found

    A framework for space-efficient read clustering in metagenomic samples

    Get PDF
    Background: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length l each, on an alphabet of total size sigma, our algorithms take O(n(t + log sigma)) time and just 2n + o(n) + O(max{l sigma log n, K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.Peer reviewe

    Parallel Construction of Wavelet Trees on Multicore Architectures

    Get PDF
    The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree's construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel in O(n)O(n) time and O(nlgσ+σlgn)O(n\lg\sigma + \sigma\lg n) bits of working space, where nn is the size of the input sequence and σ\sigma is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain-decomposition fashion, using our first algorithm in each segment, reaching O(lgn)O(\lg n) time and O(nlgσ+pσlgn/lgσ)O(n\lg\sigma + p\sigma\lg n/\lg\sigma) bits of extra space, where pp is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Tilatehokas metagenomisten DNA-fragmenttien ryhmittely

    Get PDF
    The collection of all genomes in an environment is called the metagenome of the environment. In the past 15 years, high-throughput sequencing has made it feasible to sequence entire environments at once for the first time in history, which has resulted in a variety of interesting new algorithmic problems. This thesis focuses on the basic problem of clustering the reads from an environment according to which species, or more generally, taxonomic unit they originate from. In this work, we identify and formalize two fundamental string processing tasks useful in clustering metagenomic read sets. We solve the two problems with space efficiency in mind using the recently developed bidirectional Burrows-Wheeler index. The algorithms were implemented in a way which makes parallel processing possible. Our tool is experimentally shown to give good results for simple simulated datasets, and to use less than 10 times less space and time compared to two recently published metagenome clustering tools.Kaikkien ympäristössä esiintyvien genomien joukkoa kutsutaan kyseisen ympäristön \emph{metagenomiksi}. Viimeisen 15 vuoden aikana kehitetyt korkean läpisyötön sekvenssoriteknologiat ovat mahdollistaneet ensimmäistä kertaa historiassa kokonaisen ympäristön metagenomin kartoittamisen. Tämä kehityssuunta on johtanut uusiin mielenkiintoisiin algoritmisiin ongelmiin. Tämä työ käsittelee ympäristöistä näytteistettyjen DNA-fragmenttejen ryhmittelyä lajien, tai yleisemmin taksonomisten yksiköiden mukaan. Työssä tunnistetaan ja formalisoidaan kaksi merkkijono-ongelmaa, jotka ilmentyvät metagenomisten DNA-fragmentteja ryhmittelyssä. Ongelmiin esitetään tilatehokkaat ratkaisut käyttäen hiljattain kehitettyä kaksisuuntaista Burrows-Wheeler indeksiä. Algoritmit toteutettiin pitäen silmällä rinnakkaista laskentaa. Työssä osoitetaan, että uusi toteutus antaa hyviä tuloksia yksinkertaisille simuloiduille näytteille, ja että työkalu on kymmenen kertaa nopeampi ja tilatehokkaampi, kuin kaksi hiljattain julkaistua metagenomisten näytteiden ryhmittelyyn tarkoitettua työkalua

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    Improving Data Structures and Algorithms

    Get PDF
    This thesis addresses important algorithms and data structures used in sequence analysis for applications such as read mapping. First, we give an overview on state-of-the-art FM indices and present the latest improvements. In particular, we will introduce a recently published FM index based on a new data structure: EPR dictionaries. This rank data structures allows search steps in constant time for unidirectional and bidirectional FM indices. To our knowledge this is the first and only constant-time implementation of a bidirectional FM index at the time of writing. We show that its running time is not only optimal in theory, but currently also outperforms all available FM index implementations in practice. Second, we cover approximate string matching in bidirectional indices. To improve the running time and make higher error rates suitable for index-based searches, we introduce an integer linear program for finding optimal search strategies. We show that it is significantly faster than other search strategies in indices and cover additional improvements such as hybrid approaches of index-based searches with in-text verification, i.e., at some point the partially matched string is located and verified directly in the text. Finally, we present a yet unpublished algorithm for fast computation of the mappability of genomic sequences. Mappability is a measure for the uniqueness of a genome by counting how often each kk-mer of the sequence occurs with a certain error threshold in the genome itself. We suggest two applications of mappability with prototype implementations: First, a read mapper incorporating the mappability information to improve the running time when mapping reads that match highly repetitive regions, and second, we use the mappability information to identify phylogenetic markers in a set of similar strains of the same species by the example of E. coli. Unique regions allow identifying and distinguishing even highly similar strains using unassembled sequencing data. The findings in this thesis can speed up many applications in bioinformatics as we demonstrate for read mapping and computation of mappability, and give suggestions for further research in this field
    corecore