193 research outputs found

    Nonlinear machine learning pattern recognition and bacteria-metabolite multilayer network analysis of perturbed gastric microbiome

    Get PDF
    The stomach is inhabited by diverse microbial communities, co-existing in a dynamic balance. Long-term use of drugs such as proton pump inhibitors (PPIs), or bacterial infection such as Helicobacter pylori, cause significant microbial alterations. Yet, studies revealing how the commensal bacteria re-organize, due to these perturbations of the gastric environment, are in early phase and rely principally on linear techniques for multivariate analysis. Here we disclose the importance of complementing linear dimensionality reduction techniques with nonlinear ones to unveil hidden patterns that remain unseen by linear embedding. Then, we prove the advantages to complete multivariate pattern analysis with differential network analysis, to reveal mechanisms of bacterial network re-organizations which emerge from perturbations induced by a medical treatment (PPIs) or an infectious state (H. pylori). Finally, we show how to build bacteria-metabolite multilayer networks that can deepen our understanding of the metabolite pathways significantly associated to the perturbed microbial communities

    Resources for the analysis of bacterial and microbial genomic data with a focus on antibiotic resistance

    Get PDF
    Antibiotics are drugs which inhibit the growth of bacterial cells. Their discovery was one of the most significant achievements in medicine: it allowed the development of successful treatment options for severe bacterial infections, which has helped to significantly increase our life expectancy. However, bacteria have the ability to adapt to changing environmental conditions through genetic modifications, and can, therefore, become resistant to an antibiotic. Extensive use of antibiotics promotes the development of antibiotic resistance and, since some genetic factors can be exchanged between the cells, emergence of new resistance mechanisms and their spread have become a serious global problem. Counteractive measures have been initiated, focusing on the different factors contributing to the antibiotic resistance crisis. These include the study of bacterial isolates and complete microbial communities using whole-genome sequencing (WGS) data. In both cases, there are specific challenges and requirements for different analytical approaches. The goal of the present thesis was the implementation of multiple resources which should facilitate further microbiological studies, with a focus on bacteria and antibiotic resistance. The main project, GEAR-base, included an analysis of WGS and resistance data of around eleven thousand bacterial clinical isolates covering the main human pathogens and antibiotics from different drug classes. The dataset consisted of WGS data, antibiotic susceptibility profiles and meta-information, along with additional taxonomic characterization of a sample subset. The analysis of this isolate collection allowed for the identification of bacterial species demonstrating increasing resistance rates, to construct species pan-genomes from the de novo assembled genomes, and to link gene presence or absence to the available antibiotic resistance profiles. The generated data and results were made available through the online resource GEAR-base. This resource provides access to the resistance information and genomic data, and implements functionality to compare submitted genes or genomes to the data included in the resource. In microbial community studies, the metagenome obtained through WGS is analyzed to determine its taxonomic composition. For this task, genomic sequences are clustered, or binned, to represent sequences belonging to specific organisms or closely-related organism groups. BusyBee Web was developed to provide an automatic binning pipeline using frequencies of k-mers (subsequences of length k) and bootstrapped supervised clustering. It also includes further data annotation, such as taxonomic classification of the input sequences, presence of know resistance factors, and bin quality. Plasmids, extra-chromosomal DNA molecules found in some bacteria, play an important role in antibiotic resistance spread. As the classification of sequences from WGS data as chromosomal or plasmid-derived is challenging, demonstrated by evaluating four tools implementing three different approaches, having a reference dataset to detect the plasmids which are already known is therefore desirable. To this end, an online resource for complete bacterial plasmids (PLSDB) was implemented. In summary, the herein described online resources represent valuable datasets and/or tools for the analysis of microbial genomic data and, especially, bacterial pathogens and antibiotic resistance.Antibiotika sind Medikamente, die das Wachstum von Bakterienzellen hemmen. Ihre Entdeckung war eine der bedeutendsten Leistungen der Medizin: Es erlaubte die Entwicklung von erfolgreichen Behandlungsmöglichkeiten von schwerwiegenden bakteriellen Infektionen, was geholfen hat, unsere Lebenserwartung zu erhöhen. Allerdings sind Bakterien in der Lage sich den wechselnden Umweltbedingungen anzupassen und können dadurch resistent gegen ein Antibiotikum werden. Der extensive Gebrauch von Antibiotika fördert die Entwicklung von Antibiotikaresistenzen und, da einige genetische Faktoren zwischen den Zellen ausgetauscht werden können, sind das Auftauchen von neuen Resistenzmechanismen und deren Verbreitung zu einem seriösen globalen Problem geworden. Gegenmaßnahmen wurden ergriffen, die sich auf die verschiedenen Faktoren fokussieren, die zur Antibiotikaresistenzkrise beitragen. Diese umfassen Studien von bakteriellen Isolaten und ganzen Mikrobengemeinschaften mithilfe von Gesamt-Genom-Sequenzierung (GGS). In beiden Fällen gibt es spezifische Herausforderungen und Bedürfnisse für verschiedene analytische Methoden. Das Ziel dieser Dissertation war die Implementierung von mehreren Ressourcen, die weitere mikrobielle Studien erleichtern sollen und einen Fokus auf Bakterien und Antibiotikaresistenz haben. Das Hauptprojekt, GEAR-base, beinhaltete eine Analyse von GGS- und Resistenzdaten von ungefähr elftausend klinischen Bakterienisolaten und umfasste die wichtigen menschlichen Pathogene und Antibiotika aus verschiedenen Medikamentenklassen. Neben den GGS-Daten, Empfindlichkeitsprofilen für die Antibiotika und Metainformation, beinhaltete der Datensatz zusätzliche taxonomische Charakterisierung von einer Teilmenge der Proben. Die Analyse dieser Sammlung an Isolaten erlaubte die Identifizierung von Spezies mit ansteigenden Resistenzraten, die Konstruktion von den Spezies-Pan-Genomen aus den de novo assemblierten Genomen und die Verknüpfung vom Vorhandensein oder Fehlen von Genen mit den Antibiotikaresistenzprofilen. Die generierten Daten und Ergebnisse wurden durch die Online-Ressource GEAR-base bereitgestellt. Diese Ressource bietet Zugang zur Resistenzinformation und den gesammelten genomischen Daten und implementiert Funktionen zum Vergleich von hochgeladenen Genen oder Genomen zu den Daten, die in der Ressource enthalten sind. In den Studien von Mikrobengemeinschaften wird das durch GGS erhaltene Metagenom analysiert, um seine taxonomische Zusammensetzung zu bestimmen. Dafür werden die genomischen Sequenzen in sogenannte Bins gruppiert (Binning), die die Zugehörigkeit von den Sequenzen zu bestimmten Organismen oder zu Gruppen von nah verwandten Organismen repräsentieren. BusyBee Web wurde entwickelt, um eine automatische Binning-Pipeline anzubieten, die die Häufigkeitsprofile von k-meren (Teilsequenzen der Länge k) und eine auf dem Bootstrap-Verfahren basierte Methode für die Gruppierung der Sequenzen nutzt. Zusätzlich wird eine Annotation der Daten durchgeführt, wie die taxonomische Klassifizierung der hochgeladenen Sequenzen, das Vorhandensein von bekannten Resistenzfaktoren und die Qualität der Bins. Plasmide, DNA-Moleküle, die zusätzlich zum Chromosom in einigen Bakterien vorhanden sind, spielen eine wichtige Rolle in der Verbreitung von Antibiotikaresistenzen. Die Klassifizierung von Sequenzen aus der GGS als von einem Chromosom oder einem Plasmid stammend ist herausfordernd, wie es in einer Evaluation von vier Tools, die drei verschiedene Ansätze implementieren, demonstriert wurde. Deshalb ist das Vorhandensein von einem Referenzdatensatz, um schon bekannte Plasmide zu detektieren, sehr wünschenswert. Zu diesem Zweck wurde eine Online-Ressource von vollständigen bakteriellen Plasmiden implementiert (PLSDB). Die hier beschriebenen Online-Ressourcen stellen nützliche Datensätze und/oder Werkzeuge dar, die für die Analyse von mikrobiellen genomischen Daten, insbesondere von bakteriellen Pathogenen und Antibiotikaresistenzen, eingesetzt werden können

    Doctor of Philosophy

    Get PDF
    dissertationWith the ever-increasing amount of available computing resources and sensing devices, a wide variety of high-dimensional datasets are being produced in numerous fields. The complexity and increasing popularity of these data have led to new challenges and opportunities in visualization. Since most display devices are limited to communication through two-dimensional (2D) images, many visualization methods rely on 2D projections to express high-dimensional information. Such a reduction of dimension leads to an explosion in the number of 2D representations required to visualize high-dimensional spaces, each giving a glimpse of the high-dimensional information. As a result, one of the most important challenges in visualizing high-dimensional datasets is the automatic filtration and summarization of the large exploration space consisting of all 2D projections. In this dissertation, a new type of algorithm is introduced to reduce the exploration space that identifies a small set of projections that capture the intrinsic structure of high-dimensional data. In addition, a general framework for summarizing the structure of quality measures in the space of all linear 2D projections is presented. However, identifying the representative or informative projections is only part of the challenge. Due to the high-dimensional nature of these datasets, obtaining insights and arriving at conclusions based solely on 2D representations are limited and prone to error. How to interpret the inaccuracies and resolve the ambiguity in the 2D projections is the other half of the puzzle. This dissertation introduces projection distortion error measures and interactive manipulation schemes that allow the understanding of high-dimensional structures via data manipulation in 2D projections

    GraphKKE: graph Kernel Koopman embedding for human microbiome analysis

    Get PDF
    More and more diseases have been found to be strongly correlated with disturbances in the microbiome constitution, e.g., obesity, diabetes, or some cancer types. Thanks to modern high-throughput omics technologies, it becomes possible to directly analyze human microbiome and its influence on the health status. Microbial communities are monitored over long periods of time and the associations between their members are explored. These relationships can be described by a time-evolving graph. In order to understand responses of the microbial community members to a distinct range of perturbations such as antibiotics exposure or diseases and general dynamical properties, the time-evolving graph of the human microbial communities has to be analyzed. This becomes especially challenging due to dozens of complex interactions among microbes and metastable dynamics. The key to solving this problem is the representation of the time-evolving graphs as fixed-length feature vectors preserving the original dynamics. We propose a method for learning the embedding of the time-evolving graph that is based on the spectral analysis of transfer operators and graph kernels. We demonstrate that our method can capture temporary changes in the time-evolving graph on both synthetic data and real-world data. Our experiments demonstrate the efficacy of the method. Furthermore, we show that our method can be applied to human microbiome data to study dynamic processes

    Pattern Recognition for Complex Heterogeneous Time-Series Data: An Analysis of Microbial Community Dynamics

    Get PDF
    Microbial life is the most wide-spread and the most abundant life form on earth. They exist in complex and diverse communities in environments from the deep ocean trenches to Himalayan snowfields. Microbial life is essential for other forms of life as well. Scientific studies of microbial activity include diverse communities such as plant root microbiome, insect gut microbiome and human skin microbiome. In the human body alone, the number of microbial life forms supersedes the number of human body cells. Hence it is essential to understand microbial community dynamics. With the advent of 16S rRNA sequencing, we have access to a plethora of data on the microbiome, warrantying a shift from in-vitro analysis to in-silico analysis. This thesis focuses on challenges in analysing microbial community dynamics through complex, heterogeneous and temporal data. Firstly, we look at the mathematical modelling of microbial community dynamics and inference of microbial interaction networks by analysing longitudinal sequencing data. We look at the problem with the aims of minimising the assumptions involved and improving the accuracy of the inferred interaction networks. Secondly, we explore the temporally dynamic nature of microbial interaction networks. We look at the fallacies of static microbial interaction networks and approaches suitable for modelling temporally dynamic microbial interaction networks. Thirdly, we study multiple temporal microbial datasets from similar environments to understand macro and micro patterns apparent in these communities. We explore the individuality and conformity of microbial communities through visualisation techniques. Finally, we explore the possibility and challenges in representing heterogeneous microbial temporal activity in unique signatures. In summary, in this work, we have explored various aspects of complex, heterogeneous and time-series data through microbial temporal abundance datasets and have enhanced the knowledge about these complex and diverse communities through a pattern recognition approach

    A network perspective on the ecology of gut microbiota and progression of type 2 diabetes: Linkages to keystone taxa in a Mexican cohort

    Get PDF
    IntroductionThe human gut microbiota (GM) is a dynamic system which ecological interactions among the community members affect the host metabolism. Understanding the principles that rule the bidirectional communication between GM and its host, is one of the most valuable enterprise for uncovering how bacterial ecology influences the clinical variables in the host.MethodsHere, we used SparCC to infer association networks in 16S rRNA gene amplicon data from the GM of a cohort of Mexican patients with type 2 diabetes (T2D) in different stages: NG (normoglycemic), IFG (impaired fasting glucose), IGT (impaired glucose tolerance), IFG + IGT (impaired fasting glucose plus impaired glucose tolerance), T2D and T2D treated (T2D with a 5-year ongoing treatment).ResultsBy exploring the network topology from the different stages of T2D, we observed that, as the disease progress, the networks lose the association between bacteria. It suggests that the microbial community becomes highly sensitive to perturbations in individuals with T2D. With the purpose to identify those genera that guide this transition, we computationally found keystone taxa (driver nodes) and core genera for a Mexican T2D cohort. Altogether, we suggest a set of genera driving the progress of the T2D in a Mexican cohort, among them Ruminococcaceae NK4A214 group, Ruminococcaceae UCG-010, Ruminococcaceae UCG-002, Ruminococcaceae UCG-005, Alistipes, Anaerostipes, and Terrisporobacter.DiscussionBased on a network approach, this study suggests a set of genera that can serve as a potential biomarker to distinguish the distinct degree of advances in T2D for a Mexican cohort of patients. Beyond limiting our conclusion to one population, we present a computational pipeline to link ecological networks and clinical stages in T2D, and desirable aim to advance in the field of precision medicine

    Efficient Grouping Methods for the Annotation and Sorting of Single Cells

    Get PDF
    Lux M. Efficient Grouping Methods for the Annotation and Sorting of Single Cells. Bielefeld: Universität Bielefeld; 2018.Insights into large-scale biological data require computational methods which reliably and efficiently recognize latent structures and patterns. In many cases, it is necessary to find homogeneous subgroups of the data in order to solve complex problems and to enable the discovery of novel knowledge. Here, clustering and classification techniques are commonly employed in all fields of research. Confounding factors often complicate data analysis and require a thorough choice of methods and parameters. This thesis is focused on methods around single-cell research - I developed, evaluated, compared and adapted grouping methods for open problems from three different technologies: First, metagenomics is typically confronted with the problem of detecting clusters representing involved species in a given sample (binning). Albeit powerful technologies exist for the identification of known taxa, de novo binning is still in its infancy. In this context, I evaluated optimal choices of techniques and parameters regarding the integration of modern machine learning methods, such as dimensionality reduction and clustering, resulting in an automated binning pipeline. Second, in single-cell sequencing, a major problem is given by sample contamination with foreign genomic material. From a computational point of view, in both metagenomics and single-cell genome assemblies, genomes can be represented as clusters. Contrary to metagenomics, the clustering task for single cells is a fundamentally different one. Here, I developed a methodology to automatically detect contamination and estimate confidences in single-cell genome assemblies. A third challenge can be seen in the field of flow cytometry. Here, the precise identification of cell populations in a sample is crucial and requires manual, tedious, and possibly biased cell annotation. Automated methods exist, however they require difficult fine-tuning of hyper-parameters to obtain the best results. To overcome this limitation, I developed a semi-supervised tool for cell population identification, with few very robust parameters, being fast, accurate and interpretable at the same time

    Models and Algorithms for Metagenomics Analysis and Plasmid Classification

    Get PDF
    Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyze metagenomics data, binning is considered a crucial step to characterize the different species of microorganisms present. Metagenomics binning can be extended further towards determination of plasmids and chromosomes to study environmental adaptations. The field of metagenomics binning is mostly done on contigs from genome assemblies. Metagenomics studies are mostly performed with short read sequencing. Direct binning of short reads suffers from insufficient species-specific signal, thus they are usually assembled into longer contigs before binning. Therefore, the emergence of long-read sequencing technologies gives us the opportunity to study the binning of long reads directly, where such studies have been carried out in limited numbers. Firstly, this thesis presents the challenges in binning long reads compared to contigs assembled from short reads. One key challenge in binning long reads is the absence of coverage information, which is typically obtained from assembly. Moreover, the scale of long reads compared to contigs demands more computationally efficient methods for binning. Therefore, we develop MetaBCC-LR to address these challenges and perform metagenomics binning of long reads. We introduce the concept of k-mer coverage histogram to estimate the coverage of long reads without alignments and use a sampling strategy to handle the immense number of long reads. Since MetaBCC-LR is limited by the use of coverage and composition information in a stepwise manner, we further develop LRBinner to combine the coverage and composition information. This enables LRBinner to effectively combine coverage and composition features and use them simultaneously for binning. LRBinner also implemented a novel clustering algorithm that performs better on binning long-read datasets from species with varying abundances. Moreover, we propose OBLR to improve the coverage estimation of long reads via a read-overlap graph instead of k-mers. The read-overlap graph also enables OBLR to perform probabilistic sampling to better recover low-abundant species. Secondly, we investigate opportunities to improve plasmid detection which is considered as a binary plasmid-chromosome classification problem. We introduce PlasLR that enables adaptation of plasmid prediction tools designed for contigs to classify long and error-prone reads. We also develop GraphPlas that uses the assembly graph to improve plasmid classification results for assembled contigs. In summary, this thesis presents the progressive development of models and algorithms for metagenomics binning and plasmid classification
    corecore