206 research outputs found

    Multiple Comparative Metagenomics using Multiset k-mer Counting

    Get PDF
    Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

    Streaming histogram sketching for rapid microbiome analytics

    Get PDF
    Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space

    A resource-frugal probabilistic dictionary and applications in (meta)genomics

    Get PDF
    Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.Comment: Submitted to PSC 201

    A Study on the Effects of Using Sampling for Metagenomic Comparison

    Get PDF
    openNowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed

    AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks

    Get PDF
    The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility

    Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement

    Get PDF
    Citation: Liu, S. Z., Zheng, J., Migeon, P., Ren, J., Hu, Y., He, C., . . . Wang, G. Y. (2017). Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement. Scientific Reports, 7, 15. https://doi.org/10.1038/srep42444The major component of complex genomes is repetitive elements, which remain recalcitrant to characterization. Using maize as a model system, we analyzed whole genome shotgun (WGS) sequences for the two maize inbred lines B73 and Mo17 using k-mer analysis to quantify the differences between the two genomes. Significant differences were identified in highly repetitive sequences, including centromere, 45S ribosomal DNA (rDNA), knob, and telomere repeats. Genotype specific 45S rDNA sequences were discovered. The B73 and Mo17 polymorphic k-mers were used to examine allelespecific expression of 45S rDNA in the hybrids. Although Mo17 contains higher copy number than B73, equivalent levels of overall 45S rDNA expression indicates that transcriptional or post-transcriptional regulation mechanisms operate for the 45S rDNA in the hybrids. Using WGS sequences of B73xMo17 doubled haploids, genomic locations showing differential repetitive contents were genetically mapped, which displayed different organization of highly repetitive sequences in the two genomes. In an analysis of WGS sequences of HapMap2 lines, including maize wild progenitor, landraces, and improved lines, decreases and increases in abundance of additional sets of k-mers associated with centromere, 45S rDNA, knob, and retrotransposons were found among groups, revealing global evolutionary trends of genomic repeats during maize domestication and improvement

    Resources for the analysis of bacterial and microbial genomic data with a focus on antibiotic resistance

    Get PDF
    Antibiotics are drugs which inhibit the growth of bacterial cells. Their discovery was one of the most significant achievements in medicine: it allowed the development of successful treatment options for severe bacterial infections, which has helped to significantly increase our life expectancy. However, bacteria have the ability to adapt to changing environmental conditions through genetic modifications, and can, therefore, become resistant to an antibiotic. Extensive use of antibiotics promotes the development of antibiotic resistance and, since some genetic factors can be exchanged between the cells, emergence of new resistance mechanisms and their spread have become a serious global problem. Counteractive measures have been initiated, focusing on the different factors contributing to the antibiotic resistance crisis. These include the study of bacterial isolates and complete microbial communities using whole-genome sequencing (WGS) data. In both cases, there are specific challenges and requirements for different analytical approaches. The goal of the present thesis was the implementation of multiple resources which should facilitate further microbiological studies, with a focus on bacteria and antibiotic resistance. The main project, GEAR-base, included an analysis of WGS and resistance data of around eleven thousand bacterial clinical isolates covering the main human pathogens and antibiotics from different drug classes. The dataset consisted of WGS data, antibiotic susceptibility profiles and meta-information, along with additional taxonomic characterization of a sample subset. The analysis of this isolate collection allowed for the identification of bacterial species demonstrating increasing resistance rates, to construct species pan-genomes from the de novo assembled genomes, and to link gene presence or absence to the available antibiotic resistance profiles. The generated data and results were made available through the online resource GEAR-base. This resource provides access to the resistance information and genomic data, and implements functionality to compare submitted genes or genomes to the data included in the resource. In microbial community studies, the metagenome obtained through WGS is analyzed to determine its taxonomic composition. For this task, genomic sequences are clustered, or binned, to represent sequences belonging to specific organisms or closely-related organism groups. BusyBee Web was developed to provide an automatic binning pipeline using frequencies of k-mers (subsequences of length k) and bootstrapped supervised clustering. It also includes further data annotation, such as taxonomic classification of the input sequences, presence of know resistance factors, and bin quality. Plasmids, extra-chromosomal DNA molecules found in some bacteria, play an important role in antibiotic resistance spread. As the classification of sequences from WGS data as chromosomal or plasmid-derived is challenging, demonstrated by evaluating four tools implementing three different approaches, having a reference dataset to detect the plasmids which are already known is therefore desirable. To this end, an online resource for complete bacterial plasmids (PLSDB) was implemented. In summary, the herein described online resources represent valuable datasets and/or tools for the analysis of microbial genomic data and, especially, bacterial pathogens and antibiotic resistance.Antibiotika sind Medikamente, die das Wachstum von Bakterienzellen hemmen. Ihre Entdeckung war eine der bedeutendsten Leistungen der Medizin: Es erlaubte die Entwicklung von erfolgreichen Behandlungsmöglichkeiten von schwerwiegenden bakteriellen Infektionen, was geholfen hat, unsere Lebenserwartung zu erhöhen. Allerdings sind Bakterien in der Lage sich den wechselnden Umweltbedingungen anzupassen und können dadurch resistent gegen ein Antibiotikum werden. Der extensive Gebrauch von Antibiotika fördert die Entwicklung von Antibiotikaresistenzen und, da einige genetische Faktoren zwischen den Zellen ausgetauscht werden können, sind das Auftauchen von neuen Resistenzmechanismen und deren Verbreitung zu einem seriösen globalen Problem geworden. Gegenmaßnahmen wurden ergriffen, die sich auf die verschiedenen Faktoren fokussieren, die zur Antibiotikaresistenzkrise beitragen. Diese umfassen Studien von bakteriellen Isolaten und ganzen Mikrobengemeinschaften mithilfe von Gesamt-Genom-Sequenzierung (GGS). In beiden FĂ€llen gibt es spezifische Herausforderungen und BedĂŒrfnisse fĂŒr verschiedene analytische Methoden. Das Ziel dieser Dissertation war die Implementierung von mehreren Ressourcen, die weitere mikrobielle Studien erleichtern sollen und einen Fokus auf Bakterien und Antibiotikaresistenz haben. Das Hauptprojekt, GEAR-base, beinhaltete eine Analyse von GGS- und Resistenzdaten von ungefĂ€hr elftausend klinischen Bakterienisolaten und umfasste die wichtigen menschlichen Pathogene und Antibiotika aus verschiedenen Medikamentenklassen. Neben den GGS-Daten, Empfindlichkeitsprofilen fĂŒr die Antibiotika und Metainformation, beinhaltete der Datensatz zusĂ€tzliche taxonomische Charakterisierung von einer Teilmenge der Proben. Die Analyse dieser Sammlung an Isolaten erlaubte die Identifizierung von Spezies mit ansteigenden Resistenzraten, die Konstruktion von den Spezies-Pan-Genomen aus den de novo assemblierten Genomen und die VerknĂŒpfung vom Vorhandensein oder Fehlen von Genen mit den Antibiotikaresistenzprofilen. Die generierten Daten und Ergebnisse wurden durch die Online-Ressource GEAR-base bereitgestellt. Diese Ressource bietet Zugang zur Resistenzinformation und den gesammelten genomischen Daten und implementiert Funktionen zum Vergleich von hochgeladenen Genen oder Genomen zu den Daten, die in der Ressource enthalten sind. In den Studien von Mikrobengemeinschaften wird das durch GGS erhaltene Metagenom analysiert, um seine taxonomische Zusammensetzung zu bestimmen. DafĂŒr werden die genomischen Sequenzen in sogenannte Bins gruppiert (Binning), die die Zugehörigkeit von den Sequenzen zu bestimmten Organismen oder zu Gruppen von nah verwandten Organismen reprĂ€sentieren. BusyBee Web wurde entwickelt, um eine automatische Binning-Pipeline anzubieten, die die HĂ€ufigkeitsprofile von k-meren (Teilsequenzen der LĂ€nge k) und eine auf dem Bootstrap-Verfahren basierte Methode fĂŒr die Gruppierung der Sequenzen nutzt. ZusĂ€tzlich wird eine Annotation der Daten durchgefĂŒhrt, wie die taxonomische Klassifizierung der hochgeladenen Sequenzen, das Vorhandensein von bekannten Resistenzfaktoren und die QualitĂ€t der Bins. Plasmide, DNA-MolekĂŒle, die zusĂ€tzlich zum Chromosom in einigen Bakterien vorhanden sind, spielen eine wichtige Rolle in der Verbreitung von Antibiotikaresistenzen. Die Klassifizierung von Sequenzen aus der GGS als von einem Chromosom oder einem Plasmid stammend ist herausfordernd, wie es in einer Evaluation von vier Tools, die drei verschiedene AnsĂ€tze implementieren, demonstriert wurde. Deshalb ist das Vorhandensein von einem Referenzdatensatz, um schon bekannte Plasmide zu detektieren, sehr wĂŒnschenswert. Zu diesem Zweck wurde eine Online-Ressource von vollstĂ€ndigen bakteriellen Plasmiden implementiert (PLSDB). Die hier beschriebenen Online-Ressourcen stellen nĂŒtzliche DatensĂ€tze und/oder Werkzeuge dar, die fĂŒr die Analyse von mikrobiellen genomischen Daten, insbesondere von bakteriellen Pathogenen und Antibiotikaresistenzen, eingesetzt werden können
    • 

    corecore