35 research outputs found

    Fast Spaced Seed Hashing

    Get PDF
    Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed

    Improving metagenomic classification by boosting reference k-mers

    Get PDF
    In this thesis a solution is developed to the problem of improving the metagenomic classification trying to boosting the reference k-mers. The purpose of this report is to understand if increasing the information available to the classifier help improving the classification at genus and species level

    Comparison of Metagenomics and Metatranscriptomics Tools: A Guide to Making the Right Choice

    Get PDF
    The study of microorganisms is a field of great interest due to their environmental (e.g., soil contamination) and biomedical (e.g., parasitic diseases, autism) importance. The advent of revolutionary next-generation sequencing techniques, and their application to the hypervariable regions of the 16S, 18S or 23S ribosomal subunits, have allowed the research of a large variety of organisms more in-depth, including bacteria, archaea, eukaryotes and fungi. Additionally, together with the development of analysis software, the creation of specific databases (e.g., SILVA or RDP) has boosted the enormous growth of these studies. As the cost of sequencing per sample has continuously decreased, new protocols have also emerged, such as shotgun sequencing, which allows the profiling of all taxonomic domains in a sample. The sequencing of hypervariable regions and shotgun sequencing are technologies that enable the taxonomic classification of microorganisms from the DNA present in microbial communities. However, they are not capable of measuring what is actively expressed. Conversely, we advocate that metatranscriptomics is a “new” technology that makes the identification of the mRNAs of a microbial community possible, quantifying gene expression levels and active biological pathways. Furthermore, it can be also used to characterise symbiotic interactions between the host and its microbiome. In this manuscript, we examine the three technologies above, and discuss the implementation of different software and databases, which greatly impact the obtaining of reliable results. Finally, we have developed two easy-to-use pipelines leveraging Nextflow technology. These aim to provide everything required for an average user to perform a metagenomic analysis of marker genes with QIMME2 and a metatranscriptomic study using Kraken2/Bracken.regional Andalusian GovernmentPOSTDOC_21 _0039

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Development of bioinformatics tools for the rapid and sensitive detection of known and unknown pathogens from next generation sequencing data

    Get PDF
    Infectious diseases still remain one of the main causes of death across the globe. Despite huge advances in clinical diagnostics, establishing a clear etiology remains impossible in a proportion of cases. Since the emergence of next generation sequencing (NGS), a multitude of new research fields based on this technology have evolved. Especially its application in metagenomics – denoting the research on genomic material taken directly from its environment – has led to a rapid development of new applications. Metagenomic NGS has proven to be a promising tool in the field of pathogen related research and diagnostics. In this thesis, I present different approaches for the detection of known and the discovery of unknown pathogens from NGS data. These contributions subdivide into three newly developed methods and one publication on a real-world use case of methodology we developed and data analysis based on it. First, I present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. In the second project, we designed and implemented PathoLive, a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples before the sequencing procedure is finished. We adapted the core algorithm of HiLive, a real-time read mapper, and enhanced its accuracy for our use case. Furthermore, probably irrelevant sequences automatically marked. The results are visualized in an interactive taxonomic tree that provides an intuitive overview and detailed metrics regarding the relevance of each identified pathogen. Testing PathoLive on the sequencing of a real plasma sample spiked with viruses, we could prove that we ranked the results more accurately throughout the complete sequencing run than any other tested tool did at the end of the sequencing run. With PathoLive, we shift the focus of NGS-based diagnostics from read quantification towards a more meaningful assessment of results in unprecedented turnaround time. The third project aims at the detection of novel pathogens from NGS data. We developed RAMBO-K, a tool which allows rapid and sensitive removal of unwanted host sequences from NGS datasets. RAMBO-K is faster than any tool we tested, while showing a consistently high sensitivity and specificity across different datasets. RAMBO-K rapidly and reliably separates reads from different species. It is suitable as a straightforward standard solution for workflows dealing with mixed datasets. In the fourth project, we used RAMBO-K as well as several other data analyses to discover Berlin squirrelpox virus, a deviant new poxvirus establishing a new genus of poxviridae. Near Berlin, Germany, several juvenile red squirrels (Sciurus vulgaris) were found with moist, crusty skin lesions. Histology, electron microscopy, and cell culture isolation revealed an orthopoxvirus-like infection. After standard workflows yielded no significant results, poxviral reads were assigned using RAMBO-K, enabling the assembly of the genome of the novel virus. With these projects, we established three new application-related methods each of which closes different research gaps. Taken together, we enhance the available repertoire of NGS-based pathogen related research tools and alleviate and fasten a variety of research projects

    Genetics of Halophilic Microorganisms

    Get PDF
    Halophilic microorganisms are found in all domains of life and thrive in hypersaline (high salt content) environments. These unusual microbes have been a subject of study for many years due to their interesting properties and physiology. Studies of the genetics of halophilic microorganisms (from gene expression and regulation to genomics) have provided understanding into the mechanisms of how life can exist at high salinity levels. Here, we highlight recent studies that advance the knowledge of biological function through examination of the genetics of halophilic microorganisms and their viruses

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore