Search CORE

192 research outputs found

Recommended from our members

Prediction of microbial communities for urban metagenomics using neural network approach.

Author: Jiang Jyun-Yu
Ju Chelsea J-T
Wang Wei
Zhou Guangyu
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

BACKGROUND:Microbes are greatly associated with human health and disease, especially in densely populated cities. It is essential to understand the microbial ecosystem in an urban environment for cities to monitor the transmission of infectious diseases and detect potentially urgent threats. To achieve this goal, the DNA sample collection and analysis have been conducted at subway stations in major cities. However, city-scale sampling with the fine-grained geo-spatial resolution is expensive and laborious. In this paper, we introduce MetaMLAnn, a neural network based approach to infer microbial communities at unsampled locations given information reflecting different factors, including subway line networks, sampling material types, and microbial composition patterns. RESULTS:We evaluate the effectiveness of MetaMLAnn based on the public metagenomics dataset collected from multiple locations in the New York and Boston subway systems. The experimental results suggest that MetaMLAnn consistently performs better than other five conventional classifiers under different taxonomic ranks. At genus level, MetaMLAnn can achieve F1 scores of 0.63 and 0.72 on the New York and the Boston datasets, respectively. CONCLUSIONS:By exploiting heterogeneous features, MetaMLAnn captures the hidden interactions between microbial compositions and the urban environment, which enables precise predictions of microbial communities at unmeasured locations

eScholarship - University of California

Machine learning and data-parallel processing for viral metagenomics

Author: Bzhalava Zurab
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 03/04/2020
Field of study

More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio- informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets. To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a meth- odology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning tech- niques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification. For identification of viral genomes among raw metagenomic sequences, we devel- oped the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is con- siderably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio- informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year. In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large- scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease

Publications from Karolinska Institutet

Advanced Methods for Real-time Metagenomic Analysis of Nanopore Sequencing Data

Author: Ulrich Jens-Uwe
Publication venue
Publication date: 01/01/2023
Field of study

Whole shotgun metagenomics sequencing allows researchers to retrieve information about all organisms in a complex sample. This method enables microbiologists to detect pathogens in clinical samples, study the microbial diversity in various environments, and detect abundance differences of certain microbes under different living conditions. The emergence of nanopore sequencing has offered many new possibilities for clinical and environmental microbiologists. In particular, the portability of the small nanopore sequencing devices and the ability to selectively sequence only DNA from interesting organisms are expected to make a significant contribution to the field. However, both options require memory-efficient methods that perform real-time data analysis on commodity hardware like usual laptops. In this thesis, I present new methods for real-time analysis of nanopore sequencing data in a metagenomic context. These methods are based on optimized algorithmic approaches querying the sequenced data against large sets of reference sequences. The main goal of those contributions is to improve the sequencing and analysis of underrepresented organisms in complex metagenomic samples and enable this analysis in low-resource settings in the field. First, I introduce ReadBouncer as a new tool for nanopore adaptive sampling, which can reject uninteresting DNA molecules during the sequencing process. ReadBouncer improves read classification compared to other adaptive sampling tools and has fewer memory requirements. These improvements enable a higher enrichment of underrepresented sequences while performing adaptive sampling in the field. I further show that, besides host sequence removal and enrichment of low-abundant microbes, adaptive sampling can enrich underrepresented plasmid sequences in bacterial samples. These plasmids play a crucial role in the dissemination of antibiotic resistance genes. However, their characterization requires expensive and time-consuming lab protocols. I describe how adaptive sampling can be used as a cheap method for the enrichment of plasmids, which can make a significant contribution to the point-of-care sequencing of bacterial pathogens. Finally, I introduce a novel memory- and space-efficient algorithm for real-time taxonomic profiling of nanopore reads that was implemented in Taxor. It improves the taxonomic classification of nanopore reads compared to other taxonomic profiling tools and tremendously reduces the memory footprint. The resulting database index for thousands of microbial species is small enough to fit into the memory of a small laptop, enabling real-time metagenomics analysis of nanopore sequencing data with large reference databases in the field

Institutional Repository of the Freie Universität Berlin

A probabilistic model to recover individual genomes from metagenomes

Author: Dröge Johannes
McHardy Alice C.
Schönhuth Alexander
Publication venue: 'PeerJ'
Publication date: 01/01/2017
Field of study

Dröge J, Schönhuth A, McHardy AC. A probabilistic model to recover individual genomes from metagenomes. PeerJ Computer Science. 2017;3: e117.Shotgun metagenomics of microbial communities reveal information about strains of relevance for applications in medicine, biotechnology and ecology. Recovering their genomes is a crucial but very challenging step due to the complexity of the underlying biological system and technical factors. Microbial communities are heterogeneous, with oftentimes hundreds of present genomes deriving from different species or strains, all at varying abundances and with different degrees of similarity to each other and reference data. We present a versatile probabilistic model for genome recovery and analysis, which aggregates three types of information that are commonly used for genome recovery from metagenomes. As potential applications we showcase metagenome contig classification, genome sample enrichment and genome bin comparisons. The open source implementation MGLEX is available via the Python Package Index and on GitHub and can be embedded into metagenome analysis workflows and programs.</jats:p

Publications at Bielefeld University

A probabilistic model to recover individual genomes from metagenomes

Author: Dröge J. (Johannes)
McHardy A.C. (Alice)
Schönhuth A. (Alexander)
Publication venue: 'PeerJ'
Publication date: 09/12/2016
Field of study

Shotgun metagenomics of microbial communities reveal information about strains of relevance for applications in medicine, biotechnology and ecology. Recovering their genomes is a crucial but very challenging step due to the complexity of the underlying biological system and technical factors. Microbial communities are heterogeneous, with oftentimes hundreds of present genomes deriving from different speci

Helmholtz Zentrum für Infektionsforschung Repository

CWI's Institutional Repository

Directory of Open Access Journals

SeqDeχ : a sequence deconvolution tool for genome separation of endosymbionts from mixed sequencing samples

Author: A. Chiodi
C. Bandi
D. Sassera
F. Comandatore
G. Petroni
M. Brilli
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2019
Field of study

In recent years, the advent of NGS technology has made genome sequencing much cheaper than in the past; the high parallelization capability and the possibility to sequence more than one organism at once have opened the door to processing whole symbiotic consortia. However, this approach needs the development of specific bioinformatics tools able to analyze these data. In this work, we describe SeqDex, a tool that starts from a preliminary assembly obtained from sequencing a mixture of DNA from different organisms, to identify the contigs coming from one organism of interest. SeqDex is a fully automated machine learning-based tool exploiting partial taxonomic affiliations and compositional analysis to predict the taxonomic affiliations of contigs in an assembly. In literature, there are few methods able to deconvolve host-symbiont datasets, and most of them heavily rely on user curation and are therefore time consuming. The problem has strong similarities with metagenomic studies, where mixed samples are sequenced and the bioinformatics challenge is trying to separate contigs on the basis of their source organism; however, in symbiotic systems, additional information can be exploited to improve the output. To assess the ability of SeqDex to deconvolve host-symbiont datasets, we compared it to state-of-the-art methods for metagenomic binning and for host-symbiont deconvolution on three study cases. The results point out the good performances of the presented tool that, in addition to the ease of use and customization potential, make SeqDex a useful tool for rapid identification of endosymbiont sequences

AIR Universita degli studi di Milano

Archivio Istituzionale della Ricerca - Università degli Studi di Pavia

Archivio della Ricerca - Università di Pisa

Supplemental Information 1: Supplementary material.

Author: Albertsen
Alneberg
Baran
Brady
Chatterji
Dröge
Dröge
Goodwin
Gregor
Hagen
Huang
Imelfort
Kang
Karlin
Kim
Kislyuk
Lander
Langmead
Lu
McHardy
Nielsen
Patil
Przyborowski
Rosen
Schloss
Sczyrba
Teeling
Tyson
Van der Walt
Wang
Wood
Wu
Publication venue: 'PeerJ'
Publication date
Field of study

Crossref

Better quality score compression through sequence-based quality smoothing

Author: Comin Matteo
Shibuya Yoshihiro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/06/2018
Field of study

Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling

Archivio istituzionale della ricerca - Università di Padova

Taxonomic and environmental annotation of bacterial 16S rRNA gene sequences via Shannon entropy and database metadata terms

Author: Ijaz Ali Z.
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2017
Field of study

Microbial ecology seeks to describe the diversity and distribution of microorganisms in various habitats within the context of environmental variables. High throughput sequencing has greatly boosted the number and scope of projects aiming to study and analyse these organisms, with ever-increasing amounts of data being generated. Amplicon based taxonomic analysis, which determines the presence of microbial taxa in different environments on the basis of marker gene annotations, often uses percentage identity as the main metric to determine sequence similarity against databases. This data is then used to study the distribution of biodiversity as well as the response of microbial communities to stressors. However, the 16S rRNA gene displays varying degrees of sequence conservation along its length and is therefore prone to provide different results depending on the part of 16S rRNA gene used for sequencing and analysis. Furthermore, sequence alignment is primarily performed using the popular BLAST sequence alignment tool, which incurs a great computational performance penalty although newer, more efficient tools are being developed. A new approach that is fast and more accurate is critically needed to process the avalanche of data. Additionally, repositories of environmental metadata can provide contextual information to sequence annotations, potentially enhancing analysis if they can be incorporated into bioinformatics pipelines. The overarching aim of this work was to enhance the taxonomic annotation of bacterial sequences by developing a weighted scheme that utilizes inherent evolutionary conservation in the bacterial 16S rRNA gene sequences and by adding contextual, environmental information pertaining to these sequences in a systematic fashion

Western Sydney ResearchDirect