89 research outputs found
Recommended from our members
Prediction of microbial communities for urban metagenomics using neural network approach.
BACKGROUND:Microbes are greatly associated with human health and disease, especially in densely populated cities. It is essential to understand the microbial ecosystem in an urban environment for cities to monitor the transmission of infectious diseases and detect potentially urgent threats. To achieve this goal, the DNA sample collection and analysis have been conducted at subway stations in major cities. However, city-scale sampling with the fine-grained geo-spatial resolution is expensive and laborious. In this paper, we introduce MetaMLAnn, a neural network based approach to infer microbial communities at unsampled locations given information reflecting different factors, including subway line networks, sampling material types, and microbial composition patterns. RESULTS:We evaluate the effectiveness of MetaMLAnn based on the public metagenomics dataset collected from multiple locations in the New York and Boston subway systems. The experimental results suggest that MetaMLAnn consistently performs better than other five conventional classifiers under different taxonomic ranks. At genus level, MetaMLAnn can achieve F1 scores of 0.63 and 0.72 on the New York and the Boston datasets, respectively. CONCLUSIONS:By exploiting heterogeneous features, MetaMLAnn captures the hidden interactions between microbial compositions and the urban environment, which enables precise predictions of microbial communities at unmeasured locations
Nonnegative matrix factorization with applications to sequencing data analysis
A latent factor model for count data is popularly applied when deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the estimators can enjoy much better accuracy by utilizing the extra information. However, such an advantage quickly disappears in the presence of excessive zeros. To correctly account for such a phenomenon, we propose a zero-inflated non-negative matrix factorization that models excessive zeros in both mixed and pure samples and derive an effective multiplicative parameter updating rule. In simulation studies, our method yields smaller bias comparing to other deconvolution methods. We applied our approach to gene expression from brain tissue as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.
In zero-inflated non-negative matrix factorization (iNMF) for the deconvolution of mixed signals of biological data, pure-samples play a significant role by solving the identifiability issue as well as improving the accuracy of estimates. One of the main issues of using single-cell data is that the identities(labels) of the cells are not given. Thus, it is crucial to sort these cells into their correct types computationally. We propose a nonlinear latent variable model that can be used for sorting pure-samples as well as grouping mixed-samples via deep neural networks. The computational difficulty will be handled by adopting a method known as variational autoencoding. While doing so, we keep the NMF structure in a decoder neural network, which makes the output of the network interpretable
Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
Subsurface datasets inherently possess big data characteristics such as vast
volume, diverse features, and high sampling speeds, further compounded by the
curse of dimensionality from various physical, engineering, and geological
inputs. Among the existing dimensionality reduction (DR) methods, nonlinear
dimensionality reduction (NDR) methods, especially Metric-multidimensional
scaling (MDS), are preferred for subsurface datasets due to their inherent
complexity. While MDS retains intrinsic data structure and quantifies
uncertainty, its limitations include unstabilized unique solutions invariant to
Euclidean transformations and an absence of out-of-sample points (OOSP)
extension. To enhance subsurface inferential and machine learning workflows,
datasets must be transformed into stable, reduced-dimension representations
that accommodate OOSP.
Our solution employs rigid transformations for a stabilized Euclidean
invariant representation for LDS. By computing an MDS input dissimilarity
matrix, and applying rigid transformations on multiple realizations, we ensure
transformation invariance and integrate OOSP. This process leverages a convex
hull algorithm and incorporates loss function and normalized stress for
distortion quantification. We validate our approach with synthetic data,
varying distance metrics, and real-world wells from the Duvernay Formation.
Results confirm our method's efficacy in achieving consistent LDS
representations. Furthermore, our proposed "stress ratio" (SR) metric provides
insight into uncertainty, beneficial for model adjustments and inferential
analysis. Consequently, our workflow promises enhanced repeatability and
comparability in NDR for subsurface energy resource engineering and associated
big data workflows.Comment: 30 pages, 17 figures, Submitted to Computational Geosciences Journa
Unsupervised approaches for time-evolving graph embeddings with application to human microbiome
More and more diseases have been found to be strongly correlated with disturbances in the microbiome constitution, e.g., obesity, diabetes, and even some types of cancer. Advances in high-throughput omics technologies have made it possible to directly analyze the human microbiome and its impact on human health and physiology. Microbial composition is usually observed over long periods of time and the interactions between their members are explored. Numerous studies have used microbiome data to accurately differentiate disease states and understand the differences in microbiome profiles between healthy and ill individuals. However, most of them mainly focus on various statistical approaches, omitting microbe-microbe interactions among a large number of microbiome species that, in principle, drive microbiome dynamics. Constructing and analyzing time-evolving graphs is needed to understand how microbial ecosystems respond to a range of distinct perturbations, such as antibiotic exposure, diseases, or other general dynamic properties. This becomes especially challenging due to dozens of complex interactions among microbes and metastable dynamics.
The key to addressing this challenge lies in representing time-evolving graphs constructed from microbiome data as fixed-length, low-dimensional feature vectors that preserve the original dynamics. Therefore, we propose two unsupervised approaches that map the time-evolving graph constructed from microbiome data into a low-dimensional space where the initial dynamic, such as the number of metastable states and their locations, is preserved. The first method relies on the spectral analysis of transfer operators, such as the Perron--Frobenius or Koopman operator, and graph kernels. These components enable us to extract topological information such as complex interactions of species from the time-evolving graph and take into account the dynamic changes in the human microbiome composition. Further, we study how deep learning techniques can contribute to the study of a complex network of microbial species. The method consists of two key components: 1) the Transformer, the state-of-the-art architecture used in the sequential data, that learns both structural patterns of the time-evolving graph and temporal changes of the microbiome system and 2) contrastive learning that allows the model to learn the low-dimensional representation while maintaining metastability in a low-dimensional space.
Finally, this thesis will address an important challenge in microbiome data, specifically identifying which species or interactions of species are responsible for or affected by the changes that the microbiome undergoes from one state (healthy) to another state (diseased or antibiotic exposure). Using interpretability techniques of deep learning models, which, at the outset, have been used as methods to prove the trustworthiness of a deep learning model, we can extract structural information of the time-evolving graph pertaining to particular metastable states
Biology-guided algorithms:Improved cardiovascular risk prediction and biomarker discovery
Medical research has seen a stark increase in the amount of available data. The sheer volume and complexity of measured variables challenge the use of traditional statistical methods and are beyond the ability of any human to comprehend. Solving this problem demands powerful models capable of capturing the variable interactions and how those are non-linearly related to the condition under study. In this thesis, we first use Machine Learning (ML) methods to achieve better cardiovascular risk prediction/disease biomarker identification and then describe novel bio-inspired algorithms to solve some of the challenges. On the clinical side, we demonstrate how combining targeted plasma proteomics with ML models outperforms traditional clinical risk factors in predicting first-time acute myocardial infarction as well as recurrent atherosclerotic cardiovascular disease. We then shed some light on the pathophysiological pathways involved in heart failure development using a multi-domain ML model. To improve prediction, we develop a novel graph kernel that incorporates protein-protein interaction information, and suggest a manifold mixing algorithm to increase inter-domain information flow in multi-domain models. Finally, we address global model interpretability to uncover the most important variables governing the prediction. Permutation importance is an intuitive and scalable method commonly used in practice, but it is biased in the presence of covariates. We propose a novel framework to disentangle the shared information between covariates, making permutation importance competitive against methodologies where all marginal contributions of a feature are considered, such as SHAP
Perturbation-based Analysis of Compositional Data
Existing statistical methods for compositional data analysis are inadequate
for many modern applications for two reasons. First, modern compositional
datasets, for example in microbiome research, display traits such as
high-dimensionality and sparsity that are poorly modelled with traditional
approaches. Second, assessing -- in an unbiased way -- how summary statistics
of a composition (e.g., racial diversity) affect a response variable is not
straightforward. In this work, we propose a framework based on hypothetical
data perturbations that addresses both issues. Unlike existing methods for
compositional data, we do not transform the data and instead use perturbations
to define interpretable statistical functionals on the compositions themselves,
which we call average perturbation effects. These average perturbation effects,
which can be employed in many applications, naturally account for confounding
that biases frequently used marginal dependence analyses. We show how average
perturbation effects can be estimated efficiently by deriving a
perturbation-dependent reparametrization and applying semiparametric estimation
techniques. We analyze the proposed estimators empirically on simulated data
and demonstrate advantages over existing techniques on US census and microbiome
data. For all proposed estimators, we provide confidence intervals with uniform
asymptotic coverage guarantees
SPARSITY AND WEAK SUPERVISION IN QUANTUM MACHINE LEARNING
Quantum computing is an interdisciplinary field at the intersection of computer science, mathematics, and physics that studies information processing tasks on a quantum computer. A quantum computer is a device whose operations are governed by the laws of quantum mechanics. As building quantum computers is nearing the era of commercialization and quantum supremacy, it is essential to think of potential applications that we might benefit from. Among many applications of quantum computation, one of the emerging fields is quantum machine learning. We focus on predictive models for binary classification and variants of Support Vector Machines that we expect to be especially important when training data becomes so large that a quantum algorithm with a guaranteed speedup becomes useful. We present a quantum machine learning algorithm for training Sparse Support Vector Machine for problems with large datasets that require a sparse solution. We also present the first quantum semi-supervised algorithm, where we still have a large dataset, but only a small fraction is provided with labels. While the availability of data for training machine learning models is steadily increasing, oftentimes it is much easier to collect feature vectors to obtain the corresponding labels. Here, we present a quantum machine learning algorithm for training Semi-Supervised Kernel Support Vector Machines. The algorithm uses recent advances in quantum sample-based Hamiltonian simulation to extend the existing Quantum LS-SVM algorithm to handle the semi-supervised term in the loss while maintaining the same quantum speedup as the Quantum LS-SVM
- …