3,754 research outputs found
Sparse Linear Identifiable Multivariate Modeling
In this paper we consider sparse and identifiable linear latent variable
(factor) and linear Bayesian network models for parsimonious analysis of
multivariate data. We propose a computationally efficient method for joint
parameter and model inference, and model comparison. It consists of a fully
Bayesian hierarchy for sparse models using slab and spike priors (two-component
delta-function and continuous mixtures), non-Gaussian latent factors and a
stochastic search over the ordering of the variables. The framework, which we
call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and
bench-marked on artificial and real biological data sets. SLIM is closest in
spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in
inference, Bayesian network structure learning and model comparison.
Experimentally, SLIM performs equally well or better than LiNGAM with
comparable computational complexity. We attribute this mainly to the stochastic
search strategy used, and to parsimony (sparsity and identifiability), which is
an explicit part of the model. We propose two extensions to the basic i.i.d.
linear framework: non-linear dependence on observed variables, called SNIM
(Sparse Non-linear Identifiable Multivariate modeling) and allowing for
correlations between latent variables, called CSLIM (Correlated SLIM), for the
temporal and/or spatial data. The source code and scripts are available from
http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure
Neural Likelihoods via Cumulative Distribution Functions
We leverage neural networks as universal approximators of monotonic functions
to build a parameterization of conditional cumulative distribution functions
(CDFs). By the application of automatic differentiation with respect to
response variables and then to parameters of this CDF representation, we are
able to build black box CDF and density estimators. A suite of families is
introduced as alternative constructions for the multivariate case. At one
extreme, the simplest construction is a competitive density estimator against
state-of-the-art deep learning methods, although it does not provide an easily
computable representation of multivariate CDFs. At the other extreme, we have a
flexible construction from which multivariate CDF evaluations and
marginalizations can be obtained by a simple forward pass in a deep neural net,
but where the computation of the likelihood scales exponentially with
dimensionality. Alternatives in between the extremes are discussed. We evaluate
the different representations empirically on a variety of tasks involving tail
area probabilities, tail dependence and (partial) density estimation.Comment: 10 page
Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints
Unsupervised estimation of latent variable models is a fundamental problem
central to numerous applications of machine learning and statistics. This work
presents a principled approach for estimating broad classes of such models,
including probabilistic topic models and latent linear Bayesian networks, using
only second-order observed moments. The sufficient conditions for
identifiability of these models are primarily based on weak expansion
constraints on the topic-word matrix, for topic models, and on the directed
acyclic graph, for Bayesian networks. Because no assumptions are made on the
distribution among the latent variables, the approach can handle arbitrary
correlations among the topics or latent factors. In addition, a tractable
learning method via optimization is proposed and studied in numerical
experiments.Comment: 38 pages, 6 figures, 2 tables, applications in topic models and
Bayesian networks are studied. Simulation section is adde
Parametric Modelling of Multivariate Count Data Using Probabilistic Graphical Models
Multivariate count data are defined as the number of items of different
categories issued from sampling within a population, which individuals are
grouped into categories. The analysis of multivariate count data is a recurrent
and crucial issue in numerous modelling problems, particularly in the fields of
biology and ecology (where the data can represent, for example, children counts
associated with multitype branching processes), sociology and econometrics. We
focus on I) Identifying categories that appear simultaneously, or on the
contrary that are mutually exclusive. This is achieved by identifying
conditional independence relationships between the variables; II)Building
parsimonious parametric models consistent with these relationships; III)
Characterising and testing the effects of covariates on the joint distribution
of the counts. To achieve these goals, we propose an approach based on
graphical probabilistic models, and more specifically partially directed
acyclic graphs
A closed-form approach to Bayesian inference in tree-structured graphical models
We consider the inference of the structure of an undirected graphical model
in an exact Bayesian framework. More specifically we aim at achieving the
inference with close-form posteriors, avoiding any sampling step. This task
would be intractable without any restriction on the considered graphs, so we
limit our exploration to mixtures of spanning trees. We consider the inference
of the structure of an undirected graphical model in a Bayesian framework. To
avoid convergence issues and highly demanding Monte Carlo sampling, we focus on
exact inference. More specifically we aim at achieving the inference with
close-form posteriors, avoiding any sampling step. To this aim, we restrict the
set of considered graphs to mixtures of spanning trees. We investigate under
which conditions on the priors - on both tree structures and parameters - exact
Bayesian inference can be achieved. Under these conditions, we derive a fast an
exact algorithm to compute the posterior probability for an edge to belong to
{the tree model} using an algebraic result called the Matrix-Tree theorem. We
show that the assumption we have made does not prevent our approach to perform
well on synthetic and flow cytometry data
- …