146 research outputs found

    Variational inference for sparse network reconstruction from count data

    Full text link
    In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology. To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts. To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix. We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network

    APPLICATIONS OF MACHINE LEARNING IN MICROBIAL FORENSICS

    Get PDF
    Microbial ecosystems are complex, with hundreds of members interacting with each other and the environment. The intricate and hidden behaviors underlying these interactions make research questions challenging – but can be better understood through machine learning. However, most machine learning that is used in microbiome work is a black box form of investigation, where accurate predictions can be made, but the inner logic behind what is driving prediction is hidden behind nontransparent layers of complexity. Accordingly, the goal of this dissertation is to provide an interpretable and in-depth machine learning approach to investigate microbial biogeography and to use micro-organisms as novel tools to detect geospatial location and object provenance (previous known origin). These contributions follow with a framework that allows extraction of interpretable metrics and actionable insights from microbiome-based machine learning models. The first part of this work provides an overview of machine learning in the context of microbial ecology, human microbiome studies and environmental monitoring – outlining common practice and shortcomings. The second part of this work demonstrates a field study to demonstrate how machine learning can be used to characterize patterns in microbial biogeography globally – using microbes from ports located around the world. The third part of this work studies the persistence and stability of natural microbial communities from the environment that have colonized objects (vessels) and stay attached as they travel through the water. Finally, the last part of this dissertation provides a robust framework for investigating the microbiome. This framework provides a reasonable understanding of the data being used in microbiome-based machine learning and allows researchers to better apprehend and interpret results. Together, these extensive experiments assist an understanding of how to carry an in-silico design that characterizes candidate microbial biomarkers from real world settings to a rapid, field deployable diagnostic assay. The work presented here provides evidence for the use of microbial forensics as a toolkit to expand our basic understanding of microbial biogeography, microbial community stability and persistence in complex systems, and the ability of machine learning to be applied to downstream molecular detection platforms for rapid and accurate detection

    Efficient Grouping Methods for the Annotation and Sorting of Single Cells

    Get PDF
    Lux M. Efficient Grouping Methods for the Annotation and Sorting of Single Cells. Bielefeld: Universität Bielefeld; 2018.Insights into large-scale biological data require computational methods which reliably and efficiently recognize latent structures and patterns. In many cases, it is necessary to find homogeneous subgroups of the data in order to solve complex problems and to enable the discovery of novel knowledge. Here, clustering and classification techniques are commonly employed in all fields of research. Confounding factors often complicate data analysis and require a thorough choice of methods and parameters. This thesis is focused on methods around single-cell research - I developed, evaluated, compared and adapted grouping methods for open problems from three different technologies: First, metagenomics is typically confronted with the problem of detecting clusters representing involved species in a given sample (binning). Albeit powerful technologies exist for the identification of known taxa, de novo binning is still in its infancy. In this context, I evaluated optimal choices of techniques and parameters regarding the integration of modern machine learning methods, such as dimensionality reduction and clustering, resulting in an automated binning pipeline. Second, in single-cell sequencing, a major problem is given by sample contamination with foreign genomic material. From a computational point of view, in both metagenomics and single-cell genome assemblies, genomes can be represented as clusters. Contrary to metagenomics, the clustering task for single cells is a fundamentally different one. Here, I developed a methodology to automatically detect contamination and estimate confidences in single-cell genome assemblies. A third challenge can be seen in the field of flow cytometry. Here, the precise identification of cell populations in a sample is crucial and requires manual, tedious, and possibly biased cell annotation. Automated methods exist, however they require difficult fine-tuning of hyper-parameters to obtain the best results. To overcome this limitation, I developed a semi-supervised tool for cell population identification, with few very robust parameters, being fast, accurate and interpretable at the same time

    Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks

    Get PDF
    Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Fast Algorithms for Large-Scale Phylogenetic Reconstruction

    Get PDF
    One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms. The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining. The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast. In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved

    Ecological and evolutionary drivers of microbial community structure in termite guts

    Get PDF
    Presumably descending from subsocial cockroaches 150 million years ago, termites are an order of social insects that gained the ability to digest wood through the acquisition of cellulolytic flagellates. These eukaryotic protists fill up the bulk of the hindgut volume and are the major habitat of the prokaryotic community present in the digestive tract of lower termites. The complete loss of gut flagellates in the youngest termite family Termitidae, also called higher termites, led to an entirely prokaryotic gut microbiota as well as a substantial dietary diversification and enormous ecological success. While the subfamily Macrotermitinae established a symbiosis with wood-degrading fungi of the genus Termitomyces, other higher termites exploit diets with a higher degree of humification. Previous studies on the gut communities of termites have observed that while the gut microbiota of closely related hosts is very similar, those of more distantly related hosts are characterized by considerable differences in gut communities. Since these observations are based on highly limited samplings of hosts, it is uncertain if these differences reflect important evolutionary patterns. This dissertation includes studies examining the archaeal and bacterial diversity of the gut microbiota over a wide range of termites using high-throughput sequencing of the 16S rRNA genes. In comparison to the rather simple archaeal communities, which were mainly composed of methanogens, the bacterial gut microbiota were characterized by considerably higher diversity. At the phylum-level, Bacteroidetes, Firmicutes, Proteobacteria and Spirochaetes were ubiquitously distributed among the termites, albeit with differences in relative abundance. Other phyla, however, such as Elusimicrobia, Fibrobacteres and the candidate division TG3, occured only in certain host groups of termites. The distribution pattern of archaeal and bacterial lineages reflects both host phylogeny and differences in the digestive strategy of the host. Although several genus-level bacterial lineages showed a certain degree of host-specificity, phylogenetic analyses of the amplified rRNA genes showed that these bacterial lineages do not appear to be cospeciating with their hosts. The findings of studies included in this dissertation and other published studies were evaluated to identify potential drivers of community structure and other shaping mechanisms. Thus, gut community structure in termites is primarily shaped by habitat and niche selection. The stochastic element of these mechanisms, however, is strongly attenuated by proctodeal trophallaxis, which facilitates coevolution and might ultimately lead to cospeciation. While coevolution is likely true for many lineages and documented by host-specific microbial lineages, there is only little evidence of cospeciation in the gut microbiota of termites. If present, it is restricted almost exclusively to flagellates and their symbionts in lower termites. The higher wood-feeding termites have long been associated with a marked abundance of the phyla Fibrobacteres and cand. div. TG3. Although these phyla have been shown to be members of a specific cellulolytic community associated with wood particles in the hindguts of higher termites, their full functional potential still remains unknown. In order to elucidate the role of these organisms, a study in this dissertation carries out metagenomic analyses of various higher termites. In wood-feeding representatives, Fibrobacteres and cand. div. TG3 were in fact highly abundant, but only a few or no genes could be assigned to both groups by the usual database-dependent classification programs due to the lack of suitable genomes in these databases. In response, a new study was conceived to compensate this discrepancy. By further development of a new reference-independent method, over 30 population genomes of Fibrobacteres and cand. div. TG3 could be reconstructed from the metagenomic data sets. Subsequent comparative analysis revealed that organisms of both groups differ in their potential of wood degradation, but likely complement each other. Further analyses indicate that representatives of both groups might be able to fix nitrogen and respire under hypoxic conditions — two favourable adaptations to the unique termite gut environment

    Network topology and community function in spatial microbial communities

    Full text link
    Complex communities of microbes act collectively to regulate human health, provide sources of clean energy, and ripen aromatic cheese. The efficient functioning of these communities can be directly related to competitive and cooperative interactions between species. Physical constraints and local environment affect the stability of these interactions. Here we explore the role of spatial habitat and interaction networks in microbial ecology and human disease. In the first part of the dissertation, we model mutualism to understand how spatial microbial communities survive number fluctuations in physical habitats. We explicitly account for the production, consumption, and diffusion of public goods in a two-species microbial community. We show that increased sharing of nutrients breaks down coexistence, and that species may benefit from making slower-diffusing nutrients. In multi-species communities, indirect and higher order interactions may affect community function. We find that the requirement for spatial proximity severely restricts the network of possible microbial interactions. While cooperation between two species is stable, higher-order mutualism requiring three or more species succumbs easily to number fluctuations. Additional cyclic or reciprocal interactions between pairs can stabilize multi-species communities. Inter-species interactions also affect human health via the human microbiome: microbial communities in the gut, lungs and skin. In the second part of the dissertation, we use machine learning and statistics to establish links between microbiota abundance and composition, and the incidence of chronic diseases. We study the gut fungal profile to probe the effects of diet and fungal dysbiosis in a cohort of Saudi children with Crohn's disease. While statistical microbiome studies established that each disease phenotype is associated with a distinct state of intestinal dysbiosis, they often produced conflicting results and identified a very large number of microbes associated with disease. We show that a handful of taxa could drive the dynamics of ecosystem-level abundance changes due to strong inter-species interactions. Using maximum entropy methods, we propose a simple statistical approach (Direct Association Analysis or DAA) to account for interspecific interactions. When applied to the largest dataset on IBD, DAA detects a small subset of associations directly linked to the disease, avoids p-value inflation and identifies most predictive features of the microbiome
    • …
    corecore