69 research outputs found
Correlates of the molecular vaginal microbiota composition of African women.
BACKGROUND: Sociodemographic, behavioral and clinical correlates of the vaginal microbiome (VMB) as characterized by molecular methods have not been adequately studied. VMB dominated by bacteria other than lactobacilli may cause inflammation, which may facilitate HIV acquisition and other adverse reproductive health outcomes. METHODS: We characterized the VMB of women in Kenya, Rwanda, South Africa and Tanzania (KRST) using a 16S rDNA phylogenetic microarray. Cytokines were quantified in cervicovaginal lavages. Potential sociodemographic, behavioral, and clinical correlates were also evaluated. RESULTS: Three hundred thirteen samples from 230 women were available for analysis. Five VMB clusters were identified: one cluster each dominated by Lactobacillus crispatus (KRST-I) and L. iners (KRST-II), and three clusters not dominated by a single species but containing multiple (facultative) anaerobes (KRST-III/IV/V). Women in clusters KRST-I and II had lower mean concentrations of interleukin (IL)-1α (p < 0.001) and Granulocyte Colony Stimulating Factor (G-CSF) (p = 0.01), but higher concentrations of interferon-γ-induced protein (IP-10) (p < 0.01) than women in clusters KRST-III/IV/V. A lower proportion of women in cluster KRST-I tested positive for bacterial sexually transmitted infections (STIs; ptrend = 0.07) and urinary tract infection (UTI; p = 0.06), and a higher proportion of women in clusters KRST-I and II had vaginal candidiasis (ptrend = 0.09), but these associations did not reach statistical significance. Women who reported unusual vaginal discharge were more likely to belong to clusters KRST-III/IV/V (p = 0.05). CONCLUSION: Vaginal dysbiosis in African women was significantly associated with vaginal inflammation; the associations with increased prevalence of STIs and UTI, and decreased prevalence of vaginal candidiasis, should be confirmed in larger studies
Improvement of Insulin Sensitivity after Lean Donor Feces in Metabolic Syndrome Is Driven by Baseline Intestinal Microbiota Composition
The intestinal microbiota has been implicated in insulin resistance, although evidence regarding causality in humans is scarce. We therefore studied the effect of lean donor (allogenic) versus own (autologous) fecal microbiota transplantation (FMT) to male recipients with the metabolic syndrome. Whereas we did not observe metabolic changes at 18 weeks after FMT, insulin sensitivity at 6 weeks after allogenic FMT was significantly improved, accompanied by altered microbiota composition. We also observed changes in plasma metabolites such as gamma-aminobutyric acid and show that metabolic response upon allogenic FMT (defined as improved insulin sensitivity 6 weeks after FMT) is dependent on decreased fecal microbial diversity at baseline. In conclusion, the beneficial effects of lean donor FMT on glucose metabolism are associated with changes in intestinal microbiota and plasma metabolites and can be predicted based on baseline fecal microbiota composition.Peer reviewe
A Statistical Framework For Nutriomics Data Analysis
Nutriomics is a new discipline that investigates the relationship between nutrition and health through the use of high throughput omics technologies. However, the inherent complexity of nutriomics data poses several challenges for data analysis. In this thesis, the author introduces nutriomics and the statistical challenges associated with its analysis. They propose statistical modelling and machine learning methods to tackle three main challenges: non-linearity, high dimensionality, and data heterogeneity.
To deal with these challenges, we first propose a statistical framework, that we coin LC-N2G, to test whether the association between nutrition intake and omics features of interest are significantly different from being unrelated. We use public data as an example to show LC-N2G's ability to discover non-linear associations between nutrition and gene expression. Then we propose a statistical method, coined eNODAL, to cluster high-dimensional omics features based on how they respond to nutrition intake. The application of eNODAL to a mouse proteomics nutrition study shows that eNODAL can identify interpretable clusters of proteins with similar responses to diet and drug treatment. Finally, a statistical model, which we call NEMoE, is proposed to uncover the heterogeneous interplay among diet, omics, and health outcomes. We use a microbiome Parkinson’s disease (PD) study to illustrate the method and show that NEMoE is able to identify diet-specific microbial signatures of PD.
Overall, this thesis proposes statistical methods to analyze nutriomics data and provides possible future extensions based on the research. The methods proposed in this thesis could help researchers better understand the complex relationships between nutrition and health, ultimately leading to improved health outcomes
Unsupervised approaches for time-evolving graph embeddings with application to human microbiome
More and more diseases have been found to be strongly correlated with disturbances in the microbiome constitution, e.g., obesity, diabetes, and even some types of cancer. Advances in high-throughput omics technologies have made it possible to directly analyze the human microbiome and its impact on human health and physiology. Microbial composition is usually observed over long periods of time and the interactions between their members are explored. Numerous studies have used microbiome data to accurately differentiate disease states and understand the differences in microbiome profiles between healthy and ill individuals. However, most of them mainly focus on various statistical approaches, omitting microbe-microbe interactions among a large number of microbiome species that, in principle, drive microbiome dynamics. Constructing and analyzing time-evolving graphs is needed to understand how microbial ecosystems respond to a range of distinct perturbations, such as antibiotic exposure, diseases, or other general dynamic properties. This becomes especially challenging due to dozens of complex interactions among microbes and metastable dynamics.
The key to addressing this challenge lies in representing time-evolving graphs constructed from microbiome data as fixed-length, low-dimensional feature vectors that preserve the original dynamics. Therefore, we propose two unsupervised approaches that map the time-evolving graph constructed from microbiome data into a low-dimensional space where the initial dynamic, such as the number of metastable states and their locations, is preserved. The first method relies on the spectral analysis of transfer operators, such as the Perron--Frobenius or Koopman operator, and graph kernels. These components enable us to extract topological information such as complex interactions of species from the time-evolving graph and take into account the dynamic changes in the human microbiome composition. Further, we study how deep learning techniques can contribute to the study of a complex network of microbial species. The method consists of two key components: 1) the Transformer, the state-of-the-art architecture used in the sequential data, that learns both structural patterns of the time-evolving graph and temporal changes of the microbiome system and 2) contrastive learning that allows the model to learn the low-dimensional representation while maintaining metastability in a low-dimensional space.
Finally, this thesis will address an important challenge in microbiome data, specifically identifying which species or interactions of species are responsible for or affected by the changes that the microbiome undergoes from one state (healthy) to another state (diseased or antibiotic exposure). Using interpretability techniques of deep learning models, which, at the outset, have been used as methods to prove the trustworthiness of a deep learning model, we can extract structural information of the time-evolving graph pertaining to particular metastable states
Recommended from our members
Large Scale Machine Learning in Biology
Rapid technological advances during the last two decades have led to a data-driven revolution in biology opening up a plethora of opportunities to infer informative patterns that could lead to deeper biological understanding. Large volumes of data provided by such technologies, however, are not analyzable using hypothesis-driven significance tests and other cornerstones of orthodox statistics. We present powerful tools in machine learning and statistical inference for extracting biologically informative patterns and clinically predictive models using this data. Motivated by an existing graph partitioning framework, we first derive relationships between optimizing the regularized min-cut cost function used in spectral clustering and the relevance information as defined in the Information Bottleneck method. For fast-mixing graphs, we show that the regularized min-cut cost functions introduced by Shi and Malik over a decade ago can be well approximated as the rate of loss of predictive information about the location of random walkers on the graph. For graphs drawn from a generative model designed to describe community structure, the optimal information-theoretic partition and the optimal min-cut partition are shown to be the same with high probability. Next, we formulate the problem of identifying emerging viral pathogens and characterizing their transmission in terms of learning linear models that can predict the host of a virus using its sequence information. Motivated by an existing framework for representing biological sequence information, we learn sparse, tree-structured models, built from decision rules based on subsequences, to predict viral hosts from protein sequence data using multi-class Adaboost, a powerful discriminative machine learning algorithm. Furthermore, the predictive motifs robustly selected by the learning algorithm are found to show strong host-specificity and occur in highly conserved regions of the viral proteome. We then extend this learning algorithm to the problem of predicting disease risk in humans using single nucleotide polymorphisms (SNP) -- single-base pair variations -- in their entire genome. While genome-wide association studies usually aim to infer individual SNPs that are strongly associated with disease, we use popular supervised learning algorithms to infer sufficiently complex tree-structured models, built from single-SNP decision rules, that are both highly predictive (for clinical goals) and facilitate biological interpretation (for basic science goals). In addition to high prediction accuracies, the models identify 'hotspots' in the genome that contain putative causal variants for the disease and also suggest combinatorial interactions that are relevant for the disease. Finally, motivated by the insufficiency of quantifying biological interpretability in terms of model sparsity, we propose a hierarchical Bayesian model that infers hidden structured relationships between features while simultaneously regularizing the classification model using the inferred group structure. The appropriate hidden structure maximizes the log-probability of the observed data, thus regularizing a classifier while increasing its predictive accuracy. We conclude by describing different extensions of this model that can be applied to various biological problems, specifically those described in this thesis, and enumerate promising directions for future research
Adapting Community Detection Approaches to Large, Multilayer, and Attributed Networks
Networks have become a common data mining tool to encode relational definitions between a set of entities. Whether studying biological correlations, or communication between individuals in a social network, network analysis tools enable interpretation, prediction, and visualization of patterns in the data. Community detection is a well-developed subfield of network analysis, where the objective is to cluster nodes into 'communities' based on their connectivity patterns. There are many useful and robust approaches for identifying communities in a single, moderately-sized network, but the ability to work with more complicated types of networks containing extra or a large amount of information poses challenges. In this thesis, we address three types of challenging network data and how to adapt standard community detection approaches to handle these situations. In particular, we focus on networks that are large, attributed, and multilayer. First, we present a method for identifying communities in multilayer networks, where there exist multiple relational definitions between a set of nodes. Next, we provide a pre-processing technique for reducing the size of large networks, where standard community detection approaches might have inconsistent results or be prohibitively slow. We then introduce an extension to a probabilistic model for community structure to take into account node attribute information and develop a test to quantify the extent to which connectivity and attribute information align. Finally, we demonstrate example applications of these methods in biological and social networks. This work helps to advance the understand of network clustering, network compression, and the joint modeling of node attributes and network connectivity.Doctor of Philosoph
Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
Subsurface datasets inherently possess big data characteristics such as vast
volume, diverse features, and high sampling speeds, further compounded by the
curse of dimensionality from various physical, engineering, and geological
inputs. Among the existing dimensionality reduction (DR) methods, nonlinear
dimensionality reduction (NDR) methods, especially Metric-multidimensional
scaling (MDS), are preferred for subsurface datasets due to their inherent
complexity. While MDS retains intrinsic data structure and quantifies
uncertainty, its limitations include unstabilized unique solutions invariant to
Euclidean transformations and an absence of out-of-sample points (OOSP)
extension. To enhance subsurface inferential and machine learning workflows,
datasets must be transformed into stable, reduced-dimension representations
that accommodate OOSP.
Our solution employs rigid transformations for a stabilized Euclidean
invariant representation for LDS. By computing an MDS input dissimilarity
matrix, and applying rigid transformations on multiple realizations, we ensure
transformation invariance and integrate OOSP. This process leverages a convex
hull algorithm and incorporates loss function and normalized stress for
distortion quantification. We validate our approach with synthetic data,
varying distance metrics, and real-world wells from the Duvernay Formation.
Results confirm our method's efficacy in achieving consistent LDS
representations. Furthermore, our proposed "stress ratio" (SR) metric provides
insight into uncertainty, beneficial for model adjustments and inferential
analysis. Consequently, our workflow promises enhanced repeatability and
comparability in NDR for subsurface energy resource engineering and associated
big data workflows.Comment: 30 pages, 17 figures, Submitted to Computational Geosciences Journa
A primer on machine learning techniques for genomic applications
High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous “omic” data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available
Predicting Urban Heat Island Mitigation with Random Forest Regression in Belgian Cities
peer reviewedAn abundance of impervious surfaces like building roofs in densely populated cities make green roofs a suitable solution for urban heat island (UHI) mitigation. Therefore, we employ random forest (RF) regression to predict the impact of green roofs on the surface UHI (SUHI) in Liege, Belgium. While there have been several studies identifying the impact of green roofs on UHI, fewer studies utilize a remote-sensing-based approach to measure impact on Land Surface Temperatures (LST) that are used to estimate SUHI. Moreover, the RF algorithm, can provide useful insights. In this study, we use LST obtained from Landsat-8 imagery and relate it to 2D and 3D morphological parameters that influence LST and UHI effects. Additionally, we utilise parameters that influence wind (e.g., frontal area index). We simulate the green roofs by assigning suitable values of normalised difference-vegetation index and built-up index to the buildings with flat roofs. Results suggest that green roofs decrease the average LST
- …