206 research outputs found
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
Streaming histogram sketching for rapid microbiome analytics
Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed âhistosketchâ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a âreal lifeâ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3âs. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2âGB microbiome in 50âs on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space
A resource-frugal probabilistic dictionary and applications in (meta)genomics
Genomic and metagenomic fields, generating huge sets of short genomic
sequences, brought their own share of high performance problems. To extract
relevant pieces of information from the huge data sets generated by current
sequencing techniques, one must rely on extremely scalable methods and
solutions. Indexing billions of objects is a task considered too expensive
while being a fundamental need in this field. In this paper we propose a
straightforward indexing structure that scales to billions of element and we
propose two direct applications in genomics and metagenomics. We show that our
proposal solves problem instances for which no other known solution scales-up.
We believe that many tools and applications could benefit from either the
fundamental data structure we provide or from the applications developed from
this structure.Comment: Submitted to PSC 201
A Study on the Effects of Using Sampling for Metagenomic Comparison
openNowadays, ecological sciences depend heavily on genetic studies. Among these,
analysis of environmental genetic material â i.e., metagenomics â is becoming
increasingly popular for inferring essential information about microbial life and its
interaction with ecosystems. An interesting application of metagenomics in this field
is metagenomic comparison, that is the assessment of biotic dissimilarity between
microbial environments. Current technologies allow us to produce Terabytes of
metagenomic data with little effort. Consequently, the analysis of datasets of such
size requires a large amount of computational resources. This led to the development
and application of several strategies of dimensionality reduction, which are now
being exploited for metagenomic comparison too.
In this thesis, we analyse three different methods of reducing dimensionality to see
what an impact they have in relation to reference-based methods. Our results show
that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have
almost no impact on both abundance-based and presence-absence-based comparison
for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of
results decreases. On SPRISSâ sampling scheme, in which reads are selected uniformly
at random with replacement, abundance-based Bray-Curtis dissimilarity showed
no significant variations on moderated sampling rates â e.g., above 2% â and a
marked quality decline on lower sampling rates. When the k-mers used are too short,
12 bp for instance, this sampling scheme seems to improve drastically dissimilarity
measures. On the presence-absence Jaccard distance, instead, SPRISSâ subsampling
scheme improves the correlation between reference-based and compositional-based
methods at moderate sampling rates. Lastly, comparison of approximate sets of
frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based
dissimilarities, except on very short k-mers.
Overall, our study suggests that rare k-mers are of both types: weakly informative
and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas
the noisy part of them affect negatively the quality of the Jaccard index, which
benefits from a moderate subsampling indeed.Nowadays, ecological sciences depend heavily on genetic studies. Among these,
analysis of environmental genetic material â i.e., metagenomics â is becoming
increasingly popular for inferring essential information about microbial life and its
interaction with ecosystems. An interesting application of metagenomics in this field
is metagenomic comparison, that is the assessment of biotic dissimilarity between
microbial environments. Current technologies allow us to produce Terabytes of
metagenomic data with little effort. Consequently, the analysis of datasets of such
size requires a large amount of computational resources. This led to the development
and application of several strategies of dimensionality reduction, which are now
being exploited for metagenomic comparison too.
In this thesis, we analyse three different methods of reducing dimensionality to see
what an impact they have in relation to reference-based methods. Our results show
that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have
almost no impact on both abundance-based and presence-absence-based comparison
for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of
results decreases. On SPRISSâ sampling scheme, in which reads are selected uniformly
at random with replacement, abundance-based Bray-Curtis dissimilarity showed
no significant variations on moderated sampling rates â e.g., above 2% â and a
marked quality decline on lower sampling rates. When the k-mers used are too short,
12 bp for instance, this sampling scheme seems to improve drastically dissimilarity
measures. On the presence-absence Jaccard distance, instead, SPRISSâ subsampling
scheme improves the correlation between reference-based and compositional-based
methods at moderate sampling rates. Lastly, comparison of approximate sets of
frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based
dissimilarities, except on very short k-mers.
Overall, our study suggests that rare k-mers are of both types: weakly informative
and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas
the noisy part of them affect negatively the quality of the Jaccard index, which
benefits from a moderate subsampling indeed
Recommended from our members
Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons
Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.National Science Foundation [1640775]Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks
The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility
Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement
Citation: Liu, S. Z., Zheng, J., Migeon, P., Ren, J., Hu, Y., He, C., . . . Wang, G. Y. (2017). Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement. Scientific Reports, 7, 15.
https://doi.org/10.1038/srep42444The major component of complex genomes is repetitive elements, which remain recalcitrant to characterization. Using maize as a model system, we analyzed whole genome shotgun (WGS) sequences for the two maize inbred lines B73 and Mo17 using k-mer analysis to quantify the differences between the two genomes. Significant differences were identified in highly repetitive sequences, including centromere, 45S ribosomal DNA (rDNA), knob, and telomere repeats. Genotype specific 45S rDNA sequences were discovered. The B73 and Mo17 polymorphic k-mers were used to examine allelespecific expression of 45S rDNA in the hybrids. Although Mo17 contains higher copy number than B73, equivalent levels of overall 45S rDNA expression indicates that transcriptional or post-transcriptional regulation mechanisms operate for the 45S rDNA in the hybrids. Using WGS sequences of B73xMo17 doubled haploids, genomic locations showing differential repetitive contents were genetically mapped, which displayed different organization of highly repetitive sequences in the two genomes. In an analysis of WGS sequences of HapMap2 lines, including maize wild progenitor, landraces, and improved lines, decreases and increases in abundance of additional sets of k-mers associated with centromere, 45S rDNA, knob, and retrotransposons were found among groups, revealing global evolutionary trends of genomic repeats during maize domestication and improvement
Resources for the analysis of bacterial and microbial genomic data with a focus on antibiotic resistance
Antibiotics are drugs which inhibit the growth of bacterial cells. Their
discovery was one of the most significant achievements in medicine:
it allowed the development of successful treatment options for severe
bacterial infections, which has helped to significantly increase our life
expectancy. However, bacteria have the ability to adapt to changing
environmental conditions through genetic modifications, and can,
therefore, become resistant to an antibiotic. Extensive use of antibiotics
promotes the development of antibiotic resistance and, since
some genetic factors can be exchanged between the cells, emergence
of new resistance mechanisms and their spread have become a serious
global problem.
Counteractive measures have been initiated, focusing on the different
factors contributing to the antibiotic resistance crisis. These
include the study of bacterial isolates and complete microbial communities
using whole-genome sequencing (WGS) data. In both cases,
there are specific challenges and requirements for different analytical
approaches. The goal of the present thesis was the implementation
of multiple resources which should facilitate further microbiological
studies, with a focus on bacteria and antibiotic resistance. The main
project, GEAR-base, included an analysis of WGS and resistance data
of around eleven thousand bacterial clinical isolates covering the main
human pathogens and antibiotics from different drug classes. The
dataset consisted of WGS data, antibiotic susceptibility profiles and
meta-information, along with additional taxonomic characterization
of a sample subset. The analysis of this isolate collection allowed
for the identification of bacterial species demonstrating increasing
resistance rates, to construct species pan-genomes from the de novo
assembled genomes, and to link gene presence or absence to the
available antibiotic resistance profiles. The generated data and results
were made available through the online resource GEAR-base. This
resource provides access to the resistance information and genomic
data, and implements functionality to compare submitted genes or
genomes to the data included in the resource.
In microbial community studies, the metagenome obtained through
WGS is analyzed to determine its taxonomic composition. For this
task, genomic sequences are clustered, or binned, to represent sequences
belonging to specific organisms or closely-related organism
groups. BusyBee Web was developed to provide an automatic binning
pipeline using frequencies of k-mers (subsequences of length k)
and bootstrapped supervised clustering. It also includes further data
annotation, such as taxonomic classification of the input sequences,
presence of know resistance factors, and bin quality.
Plasmids, extra-chromosomal DNA molecules found in some bacteria,
play an important role in antibiotic resistance spread. As
the classification of sequences from WGS data as chromosomal or
plasmid-derived is challenging, demonstrated by evaluating four tools
implementing three different approaches, having a reference dataset
to detect the plasmids which are already known is therefore desirable.
To this end, an online resource for complete bacterial plasmids
(PLSDB) was implemented.
In summary, the herein described online resources represent valuable
datasets and/or tools for the analysis of microbial genomic data
and, especially, bacterial pathogens and antibiotic resistance.Antibiotika sind Medikamente, die das Wachstum von Bakterienzellen
hemmen. Ihre Entdeckung war eine der bedeutendsten Leistungen
der Medizin: Es erlaubte die Entwicklung von erfolgreichen
Behandlungsmöglichkeiten von schwerwiegenden bakteriellen Infektionen,
was geholfen hat, unsere Lebenserwartung zu erhöhen. Allerdings
sind Bakterien in der Lage sich den wechselnden Umweltbedingungen
anzupassen und können dadurch resistent gegen ein Antibiotikum
werden. Der extensive Gebrauch von Antibiotika fördert die Entwicklung
von Antibiotikaresistenzen und, da einige genetische Faktoren
zwischen den Zellen ausgetauscht werden können, sind das Auftauchen
von neuen Resistenzmechanismen und deren Verbreitung zu
einem seriösen globalen Problem geworden.
GegenmaĂnahmen wurden ergriffen, die sich auf die verschiedenen
Faktoren fokussieren, die zur Antibiotikaresistenzkrise beitragen.
Diese umfassen Studien von bakteriellen Isolaten und ganzen
Mikrobengemeinschaften mithilfe von Gesamt-Genom-Sequenzierung
(GGS). In beiden FĂ€llen gibt es spezifische Herausforderungen und
BedĂŒrfnisse fĂŒr verschiedene analytische Methoden. Das Ziel dieser
Dissertation war die Implementierung von mehreren Ressourcen, die
weitere mikrobielle Studien erleichtern sollen und einen Fokus auf
Bakterien und Antibiotikaresistenz haben. Das Hauptprojekt, GEAR-base,
beinhaltete eine Analyse von GGS- und Resistenzdaten von
ungefÀhr elftausend klinischen Bakterienisolaten und umfasste die
wichtigen menschlichen Pathogene und Antibiotika aus verschiedenen
Medikamentenklassen. Neben den GGS-Daten, Empfindlichkeitsprofilen
fĂŒr die Antibiotika und Metainformation, beinhaltete der
Datensatz zusÀtzliche taxonomische Charakterisierung von einer Teilmenge
der Proben. Die Analyse dieser Sammlung an Isolaten erlaubte
die Identifizierung von Spezies mit ansteigenden Resistenzraten, die
Konstruktion von den Spezies-Pan-Genomen aus den de novo assemblierten
Genomen und die VerknĂŒpfung vom Vorhandensein oder
Fehlen von Genen mit den Antibiotikaresistenzprofilen. Die generierten
Daten und Ergebnisse wurden durch die Online-Ressource
GEAR-base bereitgestellt. Diese Ressource bietet Zugang zur Resistenzinformation
und den gesammelten genomischen Daten und
implementiert Funktionen zum Vergleich von hochgeladenen Genen
oder Genomen zu den Daten, die in der Ressource enthalten sind.
In den Studien von Mikrobengemeinschaften wird das durch GGS
erhaltene Metagenom analysiert, um seine taxonomische Zusammensetzung
zu bestimmen. DafĂŒr werden die genomischen Sequenzen
in sogenannte Bins gruppiert (Binning), die die Zugehörigkeit
von den Sequenzen zu bestimmten Organismen oder zu Gruppen von
nah verwandten Organismen reprÀsentieren. BusyBee Web wurde entwickelt,
um eine automatische Binning-Pipeline anzubieten, die die
HĂ€ufigkeitsprofile von k-meren (Teilsequenzen der LĂ€nge k) und eine
auf dem Bootstrap-Verfahren basierte Methode fĂŒr die Gruppierung
der Sequenzen nutzt. ZusÀtzlich wird eine Annotation der Daten
durchgefĂŒhrt, wie die taxonomische Klassifizierung der hochgeladenen
Sequenzen, das Vorhandensein von bekannten Resistenzfaktoren
und die QualitÀt der Bins.
Plasmide, DNA-MolekĂŒle, die zusĂ€tzlich zum Chromosom in einigen
Bakterien vorhanden sind, spielen eine wichtige Rolle in der
Verbreitung von Antibiotikaresistenzen. Die Klassifizierung von Sequenzen
aus der GGS als von einem Chromosom oder einem Plasmid
stammend ist herausfordernd, wie es in einer Evaluation von vier
Tools, die drei verschiedene AnsÀtze implementieren, demonstriert
wurde. Deshalb ist das Vorhandensein von einem Referenzdatensatz,
um schon bekannte Plasmide zu detektieren, sehr wĂŒnschenswert.
Zu diesem Zweck wurde eine Online-Ressource von vollstÀndigen
bakteriellen Plasmiden implementiert (PLSDB).
Die hier beschriebenen Online-Ressourcen stellen nĂŒtzliche DatensĂ€tze
und/oder Werkzeuge dar, die fĂŒr die Analyse von mikrobiellen
genomischen Daten, insbesondere von bakteriellen Pathogenen und
Antibiotikaresistenzen, eingesetzt werden können
- âŠ