264 research outputs found
Sorting suffixes of a text via its Lyndon Factorization
The process of sorting the suffixes of a text plays a fundamental role in
Text Algorithms. They are used for instance in the constructions of the
Burrows-Wheeler transform and the suffix array, widely used in several fields
of Computer Science. For this reason, several recent researches have been
devoted to finding new strategies to obtain effective methods for such a
sorting. In this paper we introduce a new methodology in which an important
role is played by the Lyndon factorization, so that the local suffixes inside
factors detected by this factorization keep their mutual order when extended to
the suffixes of the whole word. This property suggests a versatile technique
that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013
Efficient estimation of evolutionary distances
The advent of high throughput sequencers has lead to a dramatic increase in the size of available genomic data. Standard methods, which have worked well for many years, are not suitable for the analysis of big data sets, due to their reliance on a time-consuming alignment step. In this thesis, a new alignment-free approach for phylogeny reconstruction is introduced. The corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment-free methods. The central data structure in andi is the enhanced suffix array. It is used to find long exact matches between sequences. In this thesis, various approaches to the construction of enhanced suffix arrays, including novel ones, are evaluated with respect to performance. Additionally, a new parallel algorithm for the computation of suffix arrays is introduced
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field
Novel computational techniques for mapping and classifying Next-Generation Sequencing data
Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.
In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.
An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.
Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
- …