146 research outputs found
Variational inference for sparse network reconstruction from count data
In multivariate statistics, the question of finding direct interactions can
be formulated as a problem of network inference - or network reconstruction -
for which the Gaussian graphical model (GGM) provides a canonical framework.
Unfortunately, the Gaussian assumption does not apply to count data which are
encountered in domains such as genomics, social sciences or ecology.
To circumvent this limitation, state-of-the-art approaches use two-step
strategies that first transform counts to pseudo Gaussian observations and then
apply a (partial) correlation-based approach from the abundant literature of
GGM inference. We adopt a different stance by relying on a latent model where
we directly model counts by means of Poisson distributions that are conditional
to latent (hidden) Gaussian correlated variables. In this multivariate Poisson
lognormal-model, the dependency structure is completely captured by the latent
layer. This parametric model enables to account for the effects of covariates
on the counts.
To perform network inference, we add a sparsity inducing constraint on the
inverse covariance matrix of the latent Gaussian vector. Unlike the usual
Gaussian setting, the penalized likelihood is generally not tractable, and we
resort instead to a variational approach for approximate likelihood
maximization. The corresponding optimization problem is solved by alternating a
gradient ascent on the variational parameters and a graphical-Lasso step on the
covariance matrix.
We show that our approach is highly competitive with the existing methods on
simulation inspired from microbiological data. We then illustrate on three
various data sets how accounting for sampling efforts via offsets and
integrating external covariates (which is mostly never done in the existing
literature) drastically changes the topology of the inferred network
APPLICATIONS OF MACHINE LEARNING IN MICROBIAL FORENSICS
Microbial ecosystems are complex, with hundreds of members interacting with each other and the environment. The intricate and hidden behaviors underlying these interactions make research questions challenging – but can be better understood through machine learning. However, most machine learning that is used in microbiome work is a black box form of investigation, where accurate predictions can be made, but the inner logic behind what is driving prediction is hidden behind nontransparent layers of complexity.
Accordingly, the goal of this dissertation is to provide an interpretable and in-depth machine learning approach to investigate microbial biogeography and to use micro-organisms as novel tools to detect geospatial location and object provenance (previous known origin). These contributions follow with a framework that allows extraction of interpretable metrics and actionable insights from microbiome-based machine learning models. The first part of this work provides an overview of machine learning in the context of microbial ecology, human microbiome studies and environmental monitoring – outlining common practice and shortcomings. The second part of this work demonstrates a field study to demonstrate how machine learning can be used to characterize patterns in microbial biogeography globally – using microbes from ports located around the world. The third part of this work studies the persistence and stability of natural microbial communities from the environment that have colonized objects (vessels) and stay attached as they travel through the water. Finally, the last part of this dissertation provides a robust framework for investigating the microbiome. This framework provides a reasonable understanding of the data being used in microbiome-based machine learning and allows researchers to better apprehend and interpret results.
Together, these extensive experiments assist an understanding of how to carry an in-silico design that characterizes candidate microbial biomarkers from real world settings to a rapid, field deployable diagnostic assay. The work presented here provides evidence for the use of microbial forensics as a toolkit to expand our basic understanding of microbial biogeography, microbial community stability and persistence in complex systems, and the ability of machine learning to be applied to downstream molecular detection platforms for rapid and accurate detection
Efficient Grouping Methods for the Annotation and Sorting of Single Cells
Lux M. Efficient Grouping Methods for the Annotation and Sorting of Single Cells. Bielefeld: Universität Bielefeld; 2018.Insights into large-scale biological data require computational methods which reliably and efficiently recognize latent structures and patterns. In many cases, it is necessary to find homogeneous subgroups of the data in order to solve complex problems and to enable the discovery of novel knowledge. Here, clustering and classification techniques are commonly employed in all fields of research. Confounding factors often complicate data analysis and require a thorough choice of methods and parameters. This thesis is focused on methods around single-cell research - I developed, evaluated, compared and adapted grouping methods for open problems from three different technologies:
First, metagenomics is typically confronted with the problem of detecting clusters representing involved species in a given sample (binning). Albeit powerful technologies exist for the identification of known taxa, de novo binning is still in its infancy. In this context, I evaluated optimal choices of techniques and parameters regarding the integration of modern machine learning methods, such as dimensionality reduction and clustering, resulting in an automated binning pipeline.
Second, in single-cell sequencing, a major problem is given by sample contamination with foreign genomic material. From a computational point of view, in both metagenomics and single-cell genome assemblies, genomes can be represented as clusters. Contrary to metagenomics, the clustering task for single cells is a fundamentally different one. Here, I developed a methodology to automatically detect contamination and estimate confidences in single-cell genome assemblies.
A third challenge can be seen in the field of flow cytometry. Here, the precise identification of cell populations in a sample is crucial and requires manual, tedious, and possibly biased cell annotation. Automated methods exist, however they require difficult fine-tuning of hyper-parameters to obtain the best results. To overcome this limitation, I developed a semi-supervised tool for cell population identification, with few very robust parameters, being fast, accurate and interpretable at the same time
Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks
Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are
among the most important analysis tools. The Intrinsic Dimension provides a measure of the
dimensionality of the hidden manifold from which data are sampled, even if the manifold is
embedded in a space with a much higher number of features. The present Thesis tackles the
still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised
by non-Euclidean metrics. In particular, we focus on datasets where the distances between
points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can
only assume discrete values. This peculiarity has deep consequences on the way datapoints
populate the neighbourhoods and on the structure on the manifold. For this reason we
develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar
features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is
computed and a validation procedure to check the reliability of the provided estimate. We
thus specialise the estimator to lattice spaces, where the volume is measured by means of the
Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we
apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the
evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to
mutate. This same framework is then employed to profile the scaling of the ID of unweighted
networks. The diversity of the obtained ID signatures prompted us into using it as a signature
to characterise the networks. Concretely, we employ the ID as a summary statistics within
an Approximate Bayesian Computation framework in order to pinpoint the parameters
of network mechanistic generative models of increasing complexity. We discover that, by
targeting the ID of a given network, other typical network properties are also fairly retrieved.
As a last methodological development, we improved the ID estimator by adaptively selecting,
for each datapoint, the largest neighbourhoods with an approximately constant density. This
offers a quantitative criterion to automatically select a meaningful scale at which the ID is
computed and, at the same time, allows to enforce the hypothesis of the method, implying
more reliable estimates
Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements
Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)
Fast Algorithms for Large-Scale Phylogenetic Reconstruction
One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms.
The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining.
The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast.
In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved
Ecological and evolutionary drivers of microbial community structure in termite guts
Presumably descending from subsocial cockroaches 150 million years ago, termites are an order of social
insects that gained the ability to digest wood through the acquisition of cellulolytic flagellates. These
eukaryotic protists fill up the bulk of the hindgut volume and are the major habitat of the prokaryotic
community present in the digestive tract of lower termites. The complete loss of gut flagellates in the
youngest termite family Termitidae, also called higher termites, led to an entirely prokaryotic gut
microbiota as well as a substantial dietary diversification and enormous ecological success. While the
subfamily Macrotermitinae established a symbiosis with wood-degrading fungi of the genus Termitomyces,
other higher termites exploit diets with a higher degree of humification.
Previous studies on the gut communities of termites have observed that while the gut microbiota of
closely related hosts is very similar, those of more distantly related hosts are characterized by considerable
differences in gut communities. Since these observations are based on highly limited samplings of hosts, it
is uncertain if these differences reflect important evolutionary patterns. This dissertation includes studies
examining the archaeal and bacterial diversity of the gut microbiota over a wide range of termites using
high-throughput sequencing of the 16S rRNA genes. In comparison to the rather simple archaeal
communities, which were mainly composed of methanogens, the bacterial gut microbiota were
characterized by considerably higher diversity. At the phylum-level, Bacteroidetes, Firmicutes,
Proteobacteria and Spirochaetes were ubiquitously distributed among the termites, albeit with differences in
relative abundance. Other phyla, however, such as Elusimicrobia, Fibrobacteres and the candidate division
TG3, occured only in certain host groups of termites. The distribution pattern of archaeal and bacterial
lineages reflects both host phylogeny and differences in the digestive strategy of the host. Although several
genus-level bacterial lineages showed a certain degree of host-specificity, phylogenetic analyses of the
amplified rRNA genes showed that these bacterial lineages do not appear to be cospeciating with their
hosts. The findings of studies included in this dissertation and other published studies were evaluated to
identify potential drivers of community structure and other shaping mechanisms. Thus, gut community
structure in termites is primarily shaped by habitat and niche selection. The stochastic element of these
mechanisms, however, is strongly attenuated by proctodeal trophallaxis, which facilitates coevolution and
might ultimately lead to cospeciation. While coevolution is likely true for many lineages and documented
by host-specific microbial lineages, there is only little evidence of cospeciation in the gut microbiota of
termites. If present, it is restricted almost exclusively to flagellates and their symbionts in lower termites.
The higher wood-feeding termites have long been associated with a marked abundance of the phyla
Fibrobacteres and cand. div. TG3. Although these phyla have been shown to be members of a specific
cellulolytic community associated with wood particles in the hindguts of higher termites, their full
functional potential still remains unknown. In order to elucidate the role of these organisms, a study in this
dissertation carries out metagenomic analyses of various higher termites. In wood-feeding representatives,
Fibrobacteres and cand. div. TG3 were in fact highly abundant, but only a few or no genes could be
assigned to both groups by the usual database-dependent classification programs due to the lack of suitable
genomes in these databases. In response, a new study was conceived to compensate this discrepancy. By
further development of a new reference-independent method, over 30 population genomes of Fibrobacteres
and cand. div. TG3 could be reconstructed from the metagenomic data sets. Subsequent comparative
analysis revealed that organisms of both groups differ in their potential of wood degradation, but likely
complement each other. Further analyses indicate that representatives of both groups might be able to fix
nitrogen and respire under hypoxic conditions — two favourable adaptations to the unique termite gut
environment
Network topology and community function in spatial microbial communities
Complex communities of microbes act collectively to regulate human health, provide sources of clean energy, and ripen aromatic cheese. The efficient functioning of these communities can be directly related to competitive and cooperative interactions between
species. Physical constraints and local environment affect the stability of these interactions. Here we explore the role of spatial habitat and interaction networks in microbial ecology and human disease.
In the first part of the dissertation, we model mutualism to understand how spatial microbial communities survive number fluctuations in physical habitats. We explicitly account for the production, consumption, and diffusion of public goods in a two-species microbial community. We show that increased sharing of nutrients breaks down coexistence, and that species may benefit from making slower-diffusing nutrients. In multi-species communities, indirect and higher order interactions may affect community function. We find that the requirement for spatial proximity severely restricts the network of possible microbial interactions. While cooperation between two
species is stable, higher-order mutualism requiring three or more species succumbs easily to number fluctuations. Additional cyclic or reciprocal interactions between pairs can stabilize multi-species communities.
Inter-species interactions also affect human health via the human microbiome: microbial communities in the gut, lungs and skin. In the second part of the dissertation, we use machine learning and statistics to establish links between microbiota abundance and composition, and the incidence of chronic diseases. We study the gut fungal profile to probe the effects of diet and fungal dysbiosis in a cohort of Saudi children with Crohn's disease.
While statistical microbiome studies established that each disease phenotype is associated with a distinct state of intestinal dysbiosis, they often produced conflicting results and identified a very large number of microbes associated with disease. We show that a handful of taxa could drive the dynamics of ecosystem-level abundance changes due to strong inter-species interactions. Using maximum entropy methods, we propose a simple statistical approach (Direct Association Analysis or DAA) to account for interspecific interactions. When applied to the largest dataset on IBD, DAA detects a small subset of associations directly linked to the disease, avoids p-value
inflation and identifies most predictive features of the microbiome
- …