Search CORE

206 research outputs found

Multiple Comparative Metagenomics using Multiset k-mer Counting

Author: Benoit Gaëtan
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Mariadassou Mahendra
Peterlongo Pierre
Schbath Sophie
Publication venue
Publication date: 28/04/2016
Field of study

Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Directory of Open Access Journals

Streaming histogram sketching for rapid microbiome analytics

Author: A Sczyrba
AG Shaw
AL Greninger
AP Carrieri
B Grüning
BD Ondov
C Alcon-Giner
C Kakkanatt
D Yang
DB Rusch
F Pedregosa
G Benoit
G Cormode
H Mulcahy-O’Grady
Human Microbiome Project Consortium
I Koychev
JD Forbes
K Sim
LP Coelho
LR Thompson
M Bawa
MW Libbrecht
Q Zhang
R Bovee
S Ioffe
S Seth
SY Anvar
T Brown
T Haveliwala
VB Dubinkina
W Wu
XC Morgan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2019
Field of study

Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space

University of Liverpool Repository

Crossref

University of Birmingham Research Portal

Directory of Open Access Journals

Spiral - Imperial College Digital Repository

University of East Anglia digital repository

A resource-frugal probabilistic dictionary and applications in (meta)genomics

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue
Publication date: 26/05/2016
Field of study

Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.Comment: Submitted to PSC 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

A Study on the Effects of Using Sampling for Metagenomic Comparison

Author: GALLINA GIORGIO
Publication venue
Publication date: 26/04/2023
Field of study

openNowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed

Padua Thesis and Dissertation Archive

Recommended from our members

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons

Author: Bomhoff Matthew
Choi Illyoung
Hartman John H
Hurwitz Bonnie L
Ponsero Alise J
Youens-Clark Ken
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2019
Field of study

Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.National Science Foundation [1640775]Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona

AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks

Author: Colin Price
Joseph A. Russell
Publication venue: 'Frontiers Media SA'
Publication date: 01/07/2023
Field of study

The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility

Directory of Open Access Journals

Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement

Author: Fu J. J.
He C.
Hu Y.
Liu H. J.
Liu Sanzhen Z.
Migeon P.
Ren J.
Toomajian Christopher
Wang G. Y.
White F. F.
Zheng J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Citation: Liu, S. Z., Zheng, J., Migeon, P., Ren, J., Hu, Y., He, C., . . . Wang, G. Y. (2017). Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement. Scientific Reports, 7, 15. https://doi.org/10.1038/srep42444The major component of complex genomes is repetitive elements, which remain recalcitrant to characterization. Using maize as a model system, we analyzed whole genome shotgun (WGS) sequences for the two maize inbred lines B73 and Mo17 using k-mer analysis to quantify the differences between the two genomes. Significant differences were identified in highly repetitive sequences, including centromere, 45S ribosomal DNA (rDNA), knob, and telomere repeats. Genotype specific 45S rDNA sequences were discovered. The B73 and Mo17 polymorphic k-mers were used to examine allelespecific expression of 45S rDNA in the hybrids. Although Mo17 contains higher copy number than B73, equivalent levels of overall 45S rDNA expression indicates that transcriptional or post-transcriptional regulation mechanisms operate for the 45S rDNA in the hybrids. Using WGS sequences of B73xMo17 doubled haploids, genomic locations showing differential repetitive contents were genetically mapped, which displayed different organization of highly repetitive sequences in the two genomes. In an analysis of WGS sequences of HapMap2 lines, including maize wild progenitor, landraces, and improved lines, decreases and increases in abundance of additional sets of k-mers associated with centromere, 45S rDNA, knob, and retrotransposons were found among groups, revealing global evolutionary trends of genomic repeats during maize domestication and improvement

K-State Research Exchange

PubMed Central

Figure S6: Results of Simka on low covered samples from the Global Ocean Sampling project (GOS)

Author: Altschul
Arumugam
Borg
Boutin
Broder
Břinda
Cai
Chao
Costello
Coveley
Deorowicz
Deutsch
Drezen
Dubinkina
Fofanov
Genitsaris
Gomez-Alvarez
Human Microbiome Project Consortium
Human Microbiome Project Consortium
Karsenti
Kent
Koren
Legendre
Liles
Maillet
Maillet
Nielsen
Ondov
Pavoine
Piganeau
Rizk
Segata
Seth
Shade
Teeling
Ulyantsev
Whittaker
Wood
Wu
Yooseph
Publication venue: 'PeerJ'
Publication date
Field of study

Crossref

Resources for the analysis of bacterial and microbial genomic data with a focus on antibiotic resistance

Author: Galata Valentina
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2019
Field of study

Antibiotics are drugs which inhibit the growth of bacterial cells. Their discovery was one of the most significant achievements in medicine: it allowed the development of successful treatment options for severe bacterial infections, which has helped to significantly increase our life expectancy. However, bacteria have the ability to adapt to changing environmental conditions through genetic modifications, and can, therefore, become resistant to an antibiotic. Extensive use of antibiotics promotes the development of antibiotic resistance and, since some genetic factors can be exchanged between the cells, emergence of new resistance mechanisms and their spread have become a serious global problem. Counteractive measures have been initiated, focusing on the different factors contributing to the antibiotic resistance crisis. These include the study of bacterial isolates and complete microbial communities using whole-genome sequencing (WGS) data. In both cases, there are specific challenges and requirements for different analytical approaches. The goal of the present thesis was the implementation of multiple resources which should facilitate further microbiological studies, with a focus on bacteria and antibiotic resistance. The main project, GEAR-base, included an analysis of WGS and resistance data of around eleven thousand bacterial clinical isolates covering the main human pathogens and antibiotics from different drug classes. The dataset consisted of WGS data, antibiotic susceptibility profiles and meta-information, along with additional taxonomic characterization of a sample subset. The analysis of this isolate collection allowed for the identification of bacterial species demonstrating increasing resistance rates, to construct species pan-genomes from the de novo assembled genomes, and to link gene presence or absence to the available antibiotic resistance profiles. The generated data and results were made available through the online resource GEAR-base. This resource provides access to the resistance information and genomic data, and implements functionality to compare submitted genes or genomes to the data included in the resource. In microbial community studies, the metagenome obtained through WGS is analyzed to determine its taxonomic composition. For this task, genomic sequences are clustered, or binned, to represent sequences belonging to specific organisms or closely-related organism groups. BusyBee Web was developed to provide an automatic binning pipeline using frequencies of k-mers (subsequences of length k) and bootstrapped supervised clustering. It also includes further data annotation, such as taxonomic classification of the input sequences, presence of know resistance factors, and bin quality. Plasmids, extra-chromosomal DNA molecules found in some bacteria, play an important role in antibiotic resistance spread. As the classification of sequences from WGS data as chromosomal or plasmid-derived is challenging, demonstrated by evaluating four tools implementing three different approaches, having a reference dataset to detect the plasmids which are already known is therefore desirable. To this end, an online resource for complete bacterial plasmids (PLSDB) was implemented. In summary, the herein described online resources represent valuable datasets and/or tools for the analysis of microbial genomic data and, especially, bacterial pathogens and antibiotic resistance.Antibiotika sind Medikamente, die das Wachstum von Bakterienzellen hemmen. Ihre Entdeckung war eine der bedeutendsten Leistungen der Medizin: Es erlaubte die Entwicklung von erfolgreichen Behandlungsmöglichkeiten von schwerwiegenden bakteriellen Infektionen, was geholfen hat, unsere Lebenserwartung zu erhöhen. Allerdings sind Bakterien in der Lage sich den wechselnden Umweltbedingungen anzupassen und können dadurch resistent gegen ein Antibiotikum werden. Der extensive Gebrauch von Antibiotika fördert die Entwicklung von Antibiotikaresistenzen und, da einige genetische Faktoren zwischen den Zellen ausgetauscht werden können, sind das Auftauchen von neuen Resistenzmechanismen und deren Verbreitung zu einem seriösen globalen Problem geworden. Gegenmaßnahmen wurden ergriffen, die sich auf die verschiedenen Faktoren fokussieren, die zur Antibiotikaresistenzkrise beitragen. Diese umfassen Studien von bakteriellen Isolaten und ganzen Mikrobengemeinschaften mithilfe von Gesamt-Genom-Sequenzierung (GGS). In beiden Fällen gibt es spezifische Herausforderungen und Bedürfnisse für verschiedene analytische Methoden. Das Ziel dieser Dissertation war die Implementierung von mehreren Ressourcen, die weitere mikrobielle Studien erleichtern sollen und einen Fokus auf Bakterien und Antibiotikaresistenz haben. Das Hauptprojekt, GEAR-base, beinhaltete eine Analyse von GGS- und Resistenzdaten von ungefähr elftausend klinischen Bakterienisolaten und umfasste die wichtigen menschlichen Pathogene und Antibiotika aus verschiedenen Medikamentenklassen. Neben den GGS-Daten, Empfindlichkeitsprofilen für die Antibiotika und Metainformation, beinhaltete der Datensatz zusätzliche taxonomische Charakterisierung von einer Teilmenge der Proben. Die Analyse dieser Sammlung an Isolaten erlaubte die Identifizierung von Spezies mit ansteigenden Resistenzraten, die Konstruktion von den Spezies-Pan-Genomen aus den de novo assemblierten Genomen und die Verknüpfung vom Vorhandensein oder Fehlen von Genen mit den Antibiotikaresistenzprofilen. Die generierten Daten und Ergebnisse wurden durch die Online-Ressource GEAR-base bereitgestellt. Diese Ressource bietet Zugang zur Resistenzinformation und den gesammelten genomischen Daten und implementiert Funktionen zum Vergleich von hochgeladenen Genen oder Genomen zu den Daten, die in der Ressource enthalten sind. In den Studien von Mikrobengemeinschaften wird das durch GGS erhaltene Metagenom analysiert, um seine taxonomische Zusammensetzung zu bestimmen. Dafür werden die genomischen Sequenzen in sogenannte Bins gruppiert (Binning), die die Zugehörigkeit von den Sequenzen zu bestimmten Organismen oder zu Gruppen von nah verwandten Organismen repräsentieren. BusyBee Web wurde entwickelt, um eine automatische Binning-Pipeline anzubieten, die die Häufigkeitsprofile von k-meren (Teilsequenzen der Länge k) und eine auf dem Bootstrap-Verfahren basierte Methode für die Gruppierung der Sequenzen nutzt. Zusätzlich wird eine Annotation der Daten durchgeführt, wie die taxonomische Klassifizierung der hochgeladenen Sequenzen, das Vorhandensein von bekannten Resistenzfaktoren und die Qualität der Bins. Plasmide, DNA-Moleküle, die zusätzlich zum Chromosom in einigen Bakterien vorhanden sind, spielen eine wichtige Rolle in der Verbreitung von Antibiotikaresistenzen. Die Klassifizierung von Sequenzen aus der GGS als von einem Chromosom oder einem Plasmid stammend ist herausfordernd, wie es in einer Evaluation von vier Tools, die drei verschiedene Ansätze implementieren, demonstriert wurde. Deshalb ist das Vorhandensein von einem Referenzdatensatz, um schon bekannte Plasmide zu detektieren, sehr wünschenswert. Zu diesem Zweck wurde eine Online-Ressource von vollständigen bakteriellen Plasmiden implementiert (PLSDB). Die hier beschriebenen Online-Ressourcen stellen nützliche Datensätze und/oder Werkzeuge dar, die für die Analyse von mikrobiellen genomischen Daten, insbesondere von bakteriellen Pathogenen und Antibiotikaresistenzen, eingesetzt werden können

Universaar

Acronym