33 research outputs found
SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors
Existing tensor factorization methods assume that the input tensor follows
some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve
the factorization by minimizing some empirical loss functions defined based on
the corresponding distribution. However, it suffers from several drawbacks: 1)
In reality, the underlying distributions are complicated and unknown, making it
infeasible to be approximated by a simple distribution. 2) The correlation
across dimensions of the input tensor is not well utilized, leading to
sub-optimal performance. Although heuristics were proposed to incorporate such
correlation as side information under Gaussian distribution, they can not
easily be generalized to other distributions. Thus, a more principled way of
utilizing the correlation in tensor factorization models is still an open
challenge. Without assuming any explicit distribution, we formulate the tensor
factorization as an optimal transport problem with Wasserstein distance, which
can handle non-negative inputs.
We introduce SWIFT, which minimizes the Wasserstein distance that measures
the distance between the input tensor and that of the reconstruction. In
particular, we define the N-th order tensor Wasserstein loss for the widely
used tensor CP factorization and derive the optimization algorithm that
minimizes it. By leveraging sparsity structure and different equivalent
formulations for optimizing computational efficiency, SWIFT is as scalable as
other well-known CP algorithms. Using the factor matrices as features, SWIFT
achieves up to 9.65% and 11.31% relative improvement over baselines for
downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to
15% and 17% relative improvements over the best competitors for the prediction
tasks.Comment: Accepted by AAAI-2
PANTHER: Pathway Augmented Nonnegative Tensor factorization for HighER-order feature learning
Genetic pathways usually encode molecular mechanisms that can inform targeted
interventions. It is often challenging for existing machine learning approaches
to jointly model genetic pathways (higher-order features) and variants (atomic
features), and present to clinicians interpretable models. In order to build
more accurate and better interpretable machine learning models for genetic
medicine, we introduce Pathway Augmented Nonnegative Tensor factorization for
HighER-order feature learning (PANTHER). PANTHER selects informative genetic
pathways that directly encode molecular mechanisms. We apply genetically
motivated constrained tensor factorization to group pathways in a way that
reflects molecular mechanism interactions. We then train a softmax classifier
for disease types using the identified pathway groups. We evaluated PANTHER
against multiple state-of-the-art constrained tensor/matrix factorization
models, as well as group guided and Bayesian hierarchical models. PANTHER
outperforms all state-of-the-art comparison models significantly (p<0.05). Our
experiments on large scale Next Generation Sequencing (NGS) and whole-genome
genotyping datasets also demonstrated wide applicability of PANTHER. We
performed feature analysis in predicting disease types, which suggested
insights and benefits of the identified pathway groups.Comment: Accepted by 35th AAAI Conference on Artificial Intelligence (AAAI
2021
Recommended from our members
Learning and validating clinically meaningful phenotypes from electronic health data
The ever-growing adoption of electronic health records (EHR) to record patients' health journeys has resulted in vast amounts of heterogeneous, complex, and unwieldy information [Hripcsak and Albers, 2013]. Distilling this raw data into clinical insights presents great opportunities and challenges for the research and medical communities. One approach to this distillation is called computational phenotyping. Computational phenotyping is the process of extracting clinically relevant and interesting characteristics from a set of clinical documentation, such as that which is recorded in electronic health records (EHRs). Clinicians can use computational phenotyping, which can be viewed as a form of dimensionality reduction where a set of phenotypes form a latent space, to reason about populations, identify patients for randomized case-control studies, and extrapolate patient disease trajectories. In recent years, high-throughput computational approaches have made strides in extracting potentially clinically interesting phenotypes from data contained in EHR systems.
Tensor factorization methods have shown particular promise in deriving phenotypes. However, phenotyping methods via tensor factorization have the following weaknesses: 1) the extracted phenotypes can lack diversity, which makes them more difficult for clinicians to reason about and utilize in practice, 2) many of the tensor factorization methods are unsupervised and do not utilize side information that may be available about the population or about the relationships between the clinical characteristics in the data (e.g., diagnoses and medications), and 3) validating the clinical relevance of the extracted phenotypes requires domain training and expertise. This dissertation addresses all three of these limitations. First, we present tensor factorization methods that discover sparse and concise phenotypes in unsupervised, supervised, and semi-supervised settings. Second, via two tools we built, we show how to leverage domain expertise in the form of publicly available medical articles to evaluate the clinical validity of the discovered phenotypes. Third, we combine tensor factorization and the phenotype validation tools to guide the discovery process to more clinically relevant phenotypes.Computational Science, Engineering, and Mathematic
Wellness Representation of Users in Social Media: Towards Joint Modelling of Heterogeneity and Temporality
The increasing popularity of social media has encouraged health consumers to share, explore, and validate health and wellness information on social networks, which provide a rich repository of Patient Generated Wellness Data (PGWD). While data-driven healthcare has attracted a lot of attention from academia and industry for improving care delivery through personalized healthcare, limited research has been done on harvesting and utilizing PGWD available on social networks. Recently, representation learning has been widely used in many applications to learn low-dimensional embedding of users. However, existing approaches for representation learning are not directly applicable to PGWD due to its domain nature as characterized by longitudinality, incompleteness, and sparsity of observed data as well as heterogeneity of the patient population. To tackle these problems, we propose an approach which directly learns the embedding from longitudinal data of users, instead of vector-based representation. In particular, we simultaneously learn a low-dimensional latent space as well as the temporal evolution of users in the wellness space. The proposed method takes into account two types of wellness prior knowledge: (1) temporal progression of wellness attributes; and (2) heterogeneity of wellness attributes in the patient population. Our approach scales well to large datasets using parallel stochastic gradient descent. We conduct extensive experiments to evaluate our framework at tackling three major tasks in wellness domain: attribute prediction, success prediction, and community detection. Experimental results on two real-world datasets demonstrate the ability of our approach in learning effective user representations
Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data
We introduce a general tensor model suitable for data analytic tasks for
heterogeneous data sets, wherein there are joint low-rank structures within
groups of observations, but also discriminative structures across different
groups. To capture such complex structures, a double core tensor (DCOT)
factorization model is introduced together with a family of smoothing loss
functions. By leveraging the proposed smoothing function, the model accurately
estimates the model factors, even in the presence of missing entries. A
linearized ADMM method is employed to solve regularized versions of DCOT
factorizations, that avoid large tensor operations and large memory storage
requirements. Further, we establish theoretically its global convergence,
together with consistency of the estimates of the model parameters. The
effectiveness of the DCOT model is illustrated on several real-world examples
including image completion, recommender systems, subspace clustering and
detecting modules in heterogeneous Omics multi-modal data, since it provides
more insightful decompositions than conventional tensor methods
Structured representation learning from complex data
This thesis advances several theoretical and practical aspects of the recently introduced restricted Boltzmann machine - a powerful probabilistic and generative framework for modelling data and learning representations. The contributions of this study represent a systematic and common theme in learning structured representations from complex data