29 research outputs found
Passing Expectation Propagation Messages with Kernel Methods
We propose to learn a kernel-based message operator which takes as input all
expectation propagation (EP) incoming messages to a factor node and produces an
outgoing message. In ordinary EP, computing an outgoing message involves
estimating a multivariate integral which may not have an analytic expression.
Learning such an operator allows one to bypass the expensive computation of the
integral during inference by directly mapping all incoming messages into an
outgoing message. The operator can be learned from training data (examples of
input and output messages) which allows automated inference to be made on any
kind of factor that can be sampled.Comment: Accepted to Advances in Variational Inference, NIPS 2014 Worksho
On Deep Set Learning and the Choice of Aggregations
Recently, it has been shown that many functions on sets can be represented by
sum decompositions. These decompositons easily lend themselves to neural
approximations, extending the applicability of neural nets to set-valued
inputs---Deep Set learning. This work investigates a core component of Deep Set
architecture: aggregation functions. We suggest and examine alternatives to
commonly used aggregation functions, including learnable recurrent aggregation
functions. Empirically, we show that the Deep Set networks are highly sensitive
to the choice of aggregation functions: beyond improved performance, we find
that learnable aggregations lower hyper-parameter sensitivity and generalize
better to out-of-distribution input size
Dataset2Vec: Learning Dataset Meta-Features
Meta-learning, or learning to learn, is a machine learning approach that
utilizes prior learning experiences to expedite the learning process on unseen
tasks. As a data-driven approach, meta-learning requires meta-features that
represent the primary learning tasks or datasets, and are estimated
traditonally as engineered dataset statistics that require expert domain
knowledge tailored for every meta-task. In this paper, first, we propose a
meta-feature extractor called Dataset2Vec that combines the versatility of
engineered dataset meta-features with the expressivity of meta-features learned
by deep neural networks. Primary learning tasks or datasets are represented as
hierarchical sets, i.e., as a set of sets, esp. as a set of predictor/target
pairs, and then a DeepSet architecture is employed to regress meta-features on
them. Second, we propose a novel auxiliary meta-learning task with abundant
data called dataset similarity learning that aims to predict if two batches
stem from the same dataset or different ones. In an experiment on a large-scale
hyperparameter optimization task for 120 UCI datasets with varying schemas as a
meta-learning task, we show that the meta-features of Dataset2Vec outperform
the expert engineered meta-features and thus demonstrate the usefulness of
learned meta-features for datasets with varying schemas for the first time
On The Identifiability of Mixture Models from Grouped Samples
Finite mixture models are statistical models which appear in many problems in
statistics and machine learning. In such models it is assumed that data are
drawn from random probability measures, called mixture components, which are
themselves drawn from a probability measure P over probability measures. When
estimating mixture models, it is common to make assumptions on the mixture
components, such as parametric assumptions. In this paper, we make no
assumption on the mixture components, and instead assume that observations from
the mixture model are grouped, such that observations in the same group are
known to be drawn from the same component. We show that any mixture of m
probability measures can be uniquely identified provided there are 2m-1
observations per group. Moreover we show that, for any m, there exists a
mixture of m probability measures that cannot be uniquely identified when
groups have 2m-2 observations. Our results hold for any sample space with more
than one element
Linear-time Learning on Distributions with Approximate Kernel Embeddings
Many interesting machine learning problems are best posed by considering
instances that are distributions, or sample sets drawn from distributions.
Previous work devoted to machine learning tasks with distributional inputs has
done so through pairwise kernel evaluations between pdfs (or sample sets).
While such an approach is fine for smaller datasets, the computation of an Gram matrix is prohibitive in large datasets. Recent scalable
estimators that work over pdfs have done so only with kernels that use
Euclidean metrics, like the distance. However, there are a myriad of
other useful metrics available, such as total variation, Hellinger distance,
and the Jensen-Shannon divergence. This work develops the first random features
for pdfs whose dot product approximates kernels using these non-Euclidean
metrics, allowing estimators using such kernels to scale to large datasets by
working in a primal space, without computing large Gram matrices. We provide an
analysis of the approximation error in using our proposed random features and
show empirically the quality of our approximation both in estimating a Gram
matrix and in solving learning tasks in real-world and synthetic data
3D Object Recognition with Ensemble Learning --- A Study of Point Cloud-Based Deep Learning Models
In this study, we present an analysis of model-based ensemble learning for 3D
point-cloud object classification and detection. An ensemble of multiple model
instances is known to outperform a single model instance, but there is little
study of the topic of ensemble learning for 3D point clouds. First, an ensemble
of multiple model instances trained on the same part of the
dataset was tested for seven deep learning, point
cloud-based classification algorithms: ,
, , ,
, , and . Second, the
ensemble of different architectures was tested. Results of our experiments show
that the tested ensemble learning methods improve over state-of-the-art on the
dataset, from to for the ensemble of
single architecture instances, for two different architectures, and
for five different architectures. We show that the ensemble of two
models with different architectures can be as effective as the ensemble of 10
models with the same architecture. Third, a study on classic bagging i.e. with
different subsets used for training multiple model instances) was tested and
sources of ensemble accuracy growth were investigated for best-performing
architecture, i.e. . We also investigate the ensemble learning
of approach in the task of 3D object detection,
increasing the average precision of 3D box detection on the
dataset from to using only three model instances. We measure
the inference time of all 3D classification architectures on a , a common embedded computer for mobile robots, to allude to the
use of these models in real-life applications
FuSSO: Functional Shrinkage and Selection Operator
We present the FuSSO, a functional analogue to the LASSO, that efficiently
finds a sparse set of functional input covariates to regress a real-valued
response against. The FuSSO does so in a semi-parametric fashion, making no
parametric assumptions about the nature of input functional covariates and
assuming a linear form to the mapping of functional covariates to the response.
We provide a statistical backing for use of the FuSSO via proof of asymptotic
sparsistency under various conditions. Furthermore, we observe good results on
both synthetic and real-world data
Causal effects based on distributional distances
We develop a novel framework for estimating causal effects based on the
discrepancy between unobserved counterfactual distributions. In our setting a
causal effect is defined in terms of the distance between different
counterfactual outcome distributions, rather than a mean difference in outcome
values. Directly comparing counterfactual outcome distributions can provide
more nuanced and valuable information about causality than a simple comparison
of means. We consider single- and multi-source randomized studies, as well as
observational studies, and analyze error bounds and asymptotic properties of
the proposed estimators. We further propose methods to construct confidence
intervals for the unknown mean distribution distance. Finally, we illustrate
the new methods and verify their effectiveness in empirical studies.Comment: 31 page
Causal inference using deep neural networks
Causal inference from observation data is a core problem in many scientific
fields. Here we present a general supervised deep learning framework that
infers causal interactions by transforming the input vectors to an image-like
representation for every pair of inputs. Given a training dataset we first
construct a normalized empirical probability density distribution (NEPDF)
matrix. We then train a convolutional neural network (CNN) on NEPDFs for
causality predictions. We tested the method on several different simulated and
real world data and compared it to prior methods for causal inference. As we
show, the method is general, can efficiently handle very large datasets and
improves upon prior methods
Kernel Mean Embedding of Distributions: A Review and Beyond
A Hilbert space embedding of a distribution---in short, a kernel mean
embedding---has recently emerged as a powerful tool for machine learning and
inference. The basic idea behind this framework is to map distributions into a
reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel
methods can be extended to probability measures. It can be viewed as a
generalization of the original "feature map" common to support vector machines
(SVMs) and other kernel methods. While initially closely associated with the
latter, it has meanwhile found application in fields ranging from kernel
machines and probabilistic modeling to statistical inference, causal discovery,
and deep learning. The goal of this survey is to give a comprehensive review of
existing work and recent advances in this research area, and to discuss the
most challenging issues and open problems that could lead to new research
directions. The survey begins with a brief introduction to the RKHS and
positive definite kernels which forms the backbone of this survey, followed by
a thorough discussion of the Hilbert space embedding of marginal distributions,
theoretical guarantees, and a review of its applications. The embedding of
distributions enables us to apply RKHS methods to probability measures which
prompts a wide range of applications such as kernel two-sample testing,
independent testing, and learning on distributional data. Next, we discuss the
Hilbert space embedding for conditional distributions, give theoretical
insights, and review some applications. The conditional mean embedding enables
us to perform sum, product, and Bayes' rules---which are ubiquitous in
graphical model, probabilistic inference, and reinforcement learning---in a
non-parametric way. We then discuss relationships between this framework and
other related areas. Lastly, we give some suggestions on future research
directions.Comment: 147 pages; this is a version of the manuscript after the review
proces