590 research outputs found

    ULTRA-FAST AND MEMORY-EFFICIENT LOOKUPS FOR CLOUD, NETWORKED SYSTEMS, AND MASSIVE DATA MANAGEMENT

    Get PDF
    Systems that process big data (e.g., high-traffic networks and large-scale storage) prefer data structures and algorithms with small memory and fast processing speed. Efficient and fast algorithms play an essential role in system design, despite the improvement of hardware. This dissertation is organized around a novel algorithm called Othello Hashing. Othello Hashing supports ultra-fast and memory-efficient key-value lookup, and it fits the requirements of the core algorithms of many large-scale systems and big data applications. Using Othello hashing, combined with domain expertise in cloud, computer networks, big data, and bioinformatics, I developed the following applications that resolve several major challenges in the area. Concise: Forwarding Information Base. A Forwarding Information Base is a data structure used by the data plane of a forwarding device to determine the proper forwarding actions for packets. The polymorphic property of Othello Hashing the separation of its query and control functionalities, which is a perfect match to the programmable networks such as Software Defined Networks. Using Othello Hashing, we built a fast and scalable FIB named \textit{Concise}. Extensive evaluation results on three different platforms show that Concise outperforms other FIB designs. SDLB: Cloud Load Balancer. In a cloud network, the layer-4 load balancer servers is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. We built a software load balancer with Othello Hashing techniques named SDLB. SDLB is able to accomplish two functionalities of the SDLB using one Othello query: to find the designated server for packets of ongoing sessions and to distribute new or session-free packets. MetaOthello: Taxonomic Classification of Metagenomic Sequences. Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. We built a system to support efficient classification of taxonomic sequences using its k-mer signatures. SeqOthello: RNA-seq Sequence Search Engine. Advances in the study of functional genomics produced a vast supply of RNA-seq datasets. However, how to quickly query and extract information from sequencing resources remains a challenging problem and has been the bottleneck for the broader dissemination of sequencing efforts. The challenge resides in both the sheer volume of the data and its nature of unstructured representation. Using the Othello Hashing techniques, we built the SeqOthello sequence search engine. SeqOthello is a reference-free, alignment-free, and parameter-free sequence search system that supports arbitrary sequence query against large collections of RNA-seq experiments, which enables large-scale integrative studies using sequence-level data

    NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION

    Get PDF
    Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures

    Recovering complete and draft population genomes from metagenome datasets.

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    Taxonomic classification of metagenomic sequences

    Get PDF
    Gerlach W. Taxonomic classification of metagenomic sequences. Bielefeld: Universität; 2012.Bacteria, archaea and microeukaryotes can be found in almost every habitat present in nature, in particular in soil, sediments and sea water. They typically live in complex communities with different kinds of symbiotic associations which include relationships with larger organisms like animals or plants. Examples are microbial communities in the gut or on the skin of animals and humans, or bacteria that live in symbiosis with plants. The vast majority of such microbes are unculturable and thus cannot be sequenced by means of traditional methods. The recently upcoming discipline of metagenomics provides various in vivo- and in silico-tools to overcome this limitation. In particular, high-throughput sequencing techniques like 454 or Solexa-Illumina make it possible to explore those microbes by studying whole natural microbial communities and analysing their biological diversity as well as the underlying metabolic pathways. A current limitation of theses technologies is that they can sequence only DNA fragments of a limited length. With this limitation it is usually not possible to recover complete microbial genomes. In addition, the DNA fragments are drawn randomly from the microbial communities and the exact species of origin is unknown. Over the past few years, different methods have been developed for the taxonomic and functional characterization of metagenomic shotgun sequences. However, the taxonomic classification of metagenomic sequences from novel species without close homologues in the biological sequence databases poses a challenge due to the high number of wrong taxonomic predictions on lower taxonomic ranks. In this thesis we present CARMA3, a novel method for the taxonomic classification of assembled and unassembled metagenomic sequences that has been adapted to work with both BLAST and HMMER3 homology searches. CARMA3 accepts protein-encoding DNA sequences, protein sequences, and 16S-rDNA sequences as input. In addition, we present WebCARMA, a web application for the analysis of protein-encoding DNA sequences with CARMA3 without the need for a local installation. We evaluate our novel method in different experiments using simulated and real shotgun metagenomes and show that CARMA3 makes fewer wrong taxonomic predictions (at the same sensitivity) than other BLAST-based methods. In the last experiment we show that also very short reads can, in principle, be used to describe the taxonomic content of a metagenome

    Metagenomics : tools and insights for analyzing next-generation sequencing data derived from biodiversity studies

    Get PDF
    Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of “metagenomics”, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards

    Computational approaches for metagenomic analysis of high-throughput sequencing data

    Get PDF
    High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This “data deluge” has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.Open Acces

    Expanding the ancient DNA bioinformatics toolbox, and its applications to archeological microbiomes

    Get PDF
    The 1980s were very prolific years not only for music, but also for molecular biology and genetics, with the first publications on the microbiome and ancient DNA. Several technical revolutions later, the field of ancient metagenomics is now progressing full steam ahead, at a never seen before pace. While generating sequencing data is becoming cheaper every year, the bioinformatics methods and the compute power needed to analyze them are struggling to catch up. In this thesis, I propose new methods to reduce the sequencing to analysis gap, by introducing scalable and parallelized softwares for ancient DNA metagenomics analysis. In manuscript A, I first introduce a method for estimating the mixtures of different sources in a sequencing sample, a problem known as source tracking. I then apply this method to predict the original sources of paleofeces in manuscript B. In manuscript C, I propose a new method to scale the lowest common ancestor calling from sequence alignment files, which brings a solution for the computational intractability of fitting ever growing metagenomic reference database indices in memory. In manuscript D, I present a method to statistically estimate in parallel the ancient DNA deamination damage, and test it in the context of de novo assembly. Finally, in manuscript E, I apply some of the methods developed in this thesis to the analyis of ancient wine fermentation samples, and present the first ancient genomes of ancient fermentation bacteria. Taken together, the tools developed in this thesis will help the researchers working in the field of ancient DNA metagenomics to scale their analysis to the massive amount of sequencing data routinely produced nowadays

    Isolation and characterization of bacteriophages with therapeutic potential

    Get PDF

    Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software

    Get PDF
    International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions
    corecore