1,377 research outputs found
Conditional Hierarchical Bayesian Tucker Decomposition
Our research focuses on studying and developing methods for reducing the
dimensionality of large datasets, common in biomedical applications. A major
problem when learning information about patients based on genetic sequencing
data is that there are often more feature variables (genetic data) than
observations (patients). This makes direct supervised learning difficult. One
way of reducing the feature space is to use latent Dirichlet allocation in
order to group genetic variants in an unsupervised manner. Latent Dirichlet
allocation is a common model in natural language processing, which describes a
document as a mixture of topics, each with a probability of generating certain
words. This can be generalized as a Bayesian tensor decomposition to account
for multiple feature variables. While we made some progress improving and
modifying these methods, our significant contributions are with hierarchical
topic modeling. We developed distinct methods of incorporating hierarchical
topic modeling, based on nested Chinese restaurant processes and Pachinko
Allocation Machine, into Bayesian tensor decompositions. We apply these models
to predict whether or not patients have autism spectrum disorder based on
genetic sequencing data. We examine a dataset from National Database for Autism
Research consisting of paired siblings -- one with autism, and the other
without -- and counts of their genetic variants. Additionally, we linked the
genes with their Reactome biological pathways. We combine this information into
a tensor of patients, counts of their genetic variants, and the membership of
these genes in pathways. Once we decompose this tensor, we use logistic
regression on the reduced features in order to predict if patients have autism.
We also perform a similar analysis of a dataset of patients with one of four
common types of cancer (breast, lung, prostate, and colorectal).Comment: 20 pages, added model evaluation and log-likelihood section
Bayesian Methods in Tensor Analysis
Tensors, also known as multidimensional arrays, are useful data structures in
machine learning and statistics. In recent years, Bayesian methods have emerged
as a popular direction for analyzing tensor-valued data since they provide a
convenient way to introduce sparsity into the model and conduct uncertainty
quantification. In this article, we provide an overview of frequentist and
Bayesian methods for solving tensor completion and regression problems, with a
focus on Bayesian methods. We review common Bayesian tensor approaches
including model formulation, prior assignment, posterior computation, and
theoretical properties. We also discuss potential future directions in this
field.Comment: 32 pages, 8 figures, 2 table
Dynamic Tensor Decomposition via Neural Diffusion-Reaction Processes
Tensor decomposition is an important tool for multiway data analysis. In
practice, the data is often sparse yet associated with rich temporal
information. Existing methods, however, often under-use the time information
and ignore the structural knowledge within the sparsely observed tensor
entries. To overcome these limitations and to better capture the underlying
temporal structure, we propose Dynamic EMbedIngs fOr dynamic Tensor
dEcomposition (DEMOTE). We develop a neural diffusion-reaction process to
estimate dynamic embeddings for the entities in each tensor mode. Specifically,
based on the observed tensor entries, we build a multi-partite graph to encode
the correlation between the entities. We construct a graph diffusion process to
co-evolve the embedding trajectories of the correlated entities and use a
neural network to construct a reaction process for each individual entity. In
this way, our model can capture both the commonalities and personalities during
the evolution of the embeddings for different entities. We then use a neural
network to model the entry value as a nonlinear function of the embedding
trajectories. For model estimation, we combine ODE solvers to develop a
stochastic mini-batch learning algorithm. We propose a stratified sampling
method to balance the cost of processing each mini-batch so as to improve the
overall efficiency. We show the advantage of our approach in both simulation
study and real-world applications. The code is available at
https://github.com/wzhut/Dynamic-Tensor-Decomposition-via-Neural-Diffusion-Reaction-Processes
Economic Complexity Unfolded: Interpretable Model for the Productive Structure of Economies
Economic complexity reflects the amount of knowledge that is embedded in the
productive structure of an economy. It resides on the premise of hidden
capabilities - fundamental endowments underlying the productive structure. In
general, measuring the capabilities behind economic complexity directly is
difficult, and indirect measures have been suggested which exploit the fact
that the presence of the capabilities is expressed in a country's mix of
products. We complement these studies by introducing a probabilistic framework
which leverages Bayesian non-parametric techniques to extract the dominant
features behind the comparative advantage in exported products. Based on
economic evidence and trade data, we place a restricted Indian Buffet Process
on the distribution of countries' capability endowment, appealing to a culinary
metaphor to model the process of capability acquisition. The approach comes
with a unique level of interpretability, as it produces a concise and
economically plausible description of the instantiated capabilities
Robust Bayesian Tensor Factorization with Zero-Inflated Poisson Model and Consensus Aggregation
Tensor factorizations (TF) are powerful tools for the efficient
representation and analysis of multidimensional data. However, classic TF
methods based on maximum likelihood estimation underperform when applied to
zero-inflated count data, such as single-cell RNA sequencing (scRNA-seq) data.
Additionally, the stochasticity inherent in TFs results in factors that vary
across repeated runs, making interpretation and reproducibility of the results
challenging. In this paper, we introduce Zero Inflated Poisson Tensor
Factorization (ZIPTF), a novel approach for the factorization of
high-dimensional count data with excess zeros. To address the challenge of
stochasticity, we introduce Consensus Zero Inflated Poisson Tensor
Factorization (C-ZIPTF), which combines ZIPTF with a consensus-based
meta-analysis. We evaluate our proposed ZIPTF and C-ZIPTF on synthetic
zero-inflated count data and synthetic and real scRNA-seq data. ZIPTF
consistently outperforms baseline matrix and tensor factorization methods in
terms of reconstruction accuracy for zero-inflated data. When the probability
of excess zeros is high, ZIPTF achieves up to better accuracy.
Additionally, C-ZIPTF significantly improves the consistency and accuracy of
the factorization. When tested on both synthetic and real scRNA-seq data, ZIPTF
and C-ZIPTF consistently recover known and biologically meaningful gene
expression programs
Detection of Review Abuse via Semi-Supervised Binary Multi-Target Tensor Decomposition
Product reviews and ratings on e-commerce websites provide customers with
detailed insights about various aspects of the product such as quality,
usefulness, etc. Since they influence customers' buying decisions, product
reviews have become a fertile ground for abuse by sellers (colluding with
reviewers) to promote their own products or to tarnish the reputation of
competitor's products. In this paper, our focus is on detecting such abusive
entities (both sellers and reviewers) by applying tensor decomposition on the
product reviews data. While tensor decomposition is mostly unsupervised, we
formulate our problem as a semi-supervised binary multi-target tensor
decomposition, to take advantage of currently known abusive entities. We
empirically show that our multi-target semi-supervised model achieves higher
precision and recall in detecting abusive entities as compared to unsupervised
techniques. Finally, we show that our proposed stochastic partial natural
gradient inference for our model empirically achieves faster convergence than
stochastic gradient and Online-EM with sufficient statistics.Comment: Accepted to the 25th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, 2019. Contains supplementary material. arXiv admin note: text
overlap with arXiv:1804.0383
- …