390 research outputs found
Coarse Grained Exponential Variational Autoencoders
Variational autoencoders (VAE) often use Gaussian or category distribution to
model the inference process. This puts a limit on variational learning because
this simplified assumption does not match the true posterior distribution,
which is usually much more sophisticated. To break this limitation and apply
arbitrary parametric distribution during inference, this paper derives a
\emph{semi-continuous} latent representation, which approximates a continuous
density up to a prescribed precision, and is much easier to analyze than its
continuous counterpart because it is fundamentally discrete. We showcase the
proposition by applying polynomial exponential family distributions as the
posterior, which are universal probability density function generators. Our
experimental results show consistent improvements over commonly used VAE
models
An Optimization Framework for Semi-Supervised and Transfer Learning using Multiple Classifiers and Clusterers
Unsupervised models can provide supplementary soft constraints to help
classify new, "target" data since similar instances in the target set are more
likely to share the same class label. Such models can also help detect possible
differences between training and target distributions, which is useful in
applications where concept drift may take place, as in transfer learning
settings. This paper describes a general optimization framework that takes as
input class membership estimates from existing classifiers learnt on previously
encountered "source" data, as well as a similarity matrix from a cluster
ensemble operating solely on the target data to be classified, and yields a
consensus labeling of the target data. This framework admits a wide range of
loss functions and classification/clustering methods. It exploits properties of
Bregman divergences in conjunction with Legendre duality to yield a principled
and scalable approach. A variety of experiments show that the proposed
framework can yield results substantially superior to those provided by popular
transductive learning techniques or by naively applying classifiers learnt on
the original task to the target data
Geometric Losses for Distributional Learning
Building upon recent advances in entropy-regularized optimal transport, and
upon Fenchel duality between measures and continuous functions , we propose a
generalization of the logistic loss that incorporates a metric or cost between
classes. Unlike previous attempts to use optimal transport distances for
learning, our loss results in unconstrained convex objective functions,
supports infinite (or very large) class spaces, and naturally defines a
geometric generalization of the softmax operator. The geometric properties of
this loss make it suitable for predicting sparse and singular distributions,
for instance supported on curves or hyper-surfaces. We study the theoretical
properties of our loss and show-case its effectiveness on two applications:
ordinal regression and drawing generation
Constructing the Matrix Multilayer Perceptron and its Application to the VAE
Like most learning algorithms, the multilayer perceptrons (MLP) is designed
to learn a vector of parameters from data. However, in certain scenarios we are
interested in learning structured parameters (predictions) in the form of
symmetric positive definite matrices. Here, we introduce a variant of the MLP,
referred to as the matrix MLP, that is specialized at learning symmetric
positive definite matrices. We also present an application of the model within
the context of the variational autoencoder (VAE). Our formulation of the VAE
extends the vanilla formulation to the cases where the recognition and the
generative networks can be from the parametric family of distributions with
dense covariance matrices. Two specific examples are discussed in more detail:
the dense covariance Gaussian and its generalization, the power exponential
distribution. Our new developments are illustrated using both synthetic and
real data
Relaxed Wasserstein with Applications to GANs
Wasserstein Generative Adversarial Networks (WGANs) provide a versatile class
of models, which have attracted great attention in various applications.
However, this framework has two main drawbacks: (i) Wasserstein-1 (or
Earth-Mover) distance is restrictive such that WGANs cannot always fit data
geometry well; (ii) It is difficult to achieve fast training of WGANs. In this
paper, we propose a new class of \textit{Relaxed Wasserstein} (RW) distances by
generalizing Wasserstein-1 distance with Bregman cost functions. We show that
RW distances achieve nice statistical properties while not sacrificing the
computational tractability. Combined with the GANs framework, we develop
Relaxed WGANs (RWGANs) which are not only statistically flexible but can be
approximated efficiently using heuristic approaches. Experiments on real images
demonstrate that the RWGAN with Kullback-Leibler (KL) cost function outperforms
other competing approaches, e.g., WGANs, even with gradient penalty.Comment: Accepted by ICASSP 2021; add the reference
On The Chain Rule Optimal Transport Distance
We define a novel class of distances between statistical multivariate
distributions by solving an optimal transportation problem on their marginal
densities with respect to a ground distance defined on their conditional
densities. By using the chain rule factorization of probabilities, we show how
to perform optimal transport on a ground space being an information-geometric
manifold of conditional probabilities. We prove that this new distance is a
metric whenever the chosen ground distance is a metric. Our distance
generalizes both the Wasserstein distances between point sets and a recently
introduced metric distance between statistical mixtures. As a first application
of this Chain Rule Optimal Transport (CROT) distance, we show that the ground
distance between statistical mixtures is upper bounded by this optimal
transport distance and its fast relaxed Sinkhorn distance, whenever the ground
distance is joint convex. We report on our experiments which quantify the
tightness of the CROT distance for the total variation distance, the square
root generalization of the Jensen-Shannon divergence, the Wasserstein
metric and the R\'enyi divergence between mixtures.Comment: 23 page
Copula Variational Bayes inference via information geometry
Variational Bayes (VB), also known as independent mean-field approximation,
has become a popular method for Bayesian network inference in recent years. Its
application is vast, e.g. in neural network, compressed sensing, clustering,
etc. to name just a few. In this paper, the independence constraint in VB will
be relaxed to a conditional constraint class, called copula in statistics.
Since a joint probability distribution always belongs to a copula class, the
novel copula VB (CVB) approximation is a generalized form of VB. Via
information geometry, we will see that CVB algorithm iteratively projects the
original joint distribution to a copula constraint space until it reaches a
local minimum Kullback-Leibler (KL) divergence. By this way, all mean-field
approximations, e.g. iterative VB, Expectation-Maximization (EM), Iterated
Conditional Mode (ICM) and k-means algorithms, are special cases of CVB
approximation.
For a generic Bayesian network, an augmented hierarchy form of CVB will also
be designed. While mean-field algorithms can only return a locally optimal
approximation for a correlated network, the augmented CVB network, which is an
optimally weighted average of a mixture of simpler network structures, can
potentially achieve the globally optimal approximation for the first time. Via
simulations of Gaussian mixture clustering, the classification's accuracy of
CVB will be shown to be far superior to that of state-of-the-art VB, EM and
k-means algorithms.Comment: IEEE Transactions on Information Theor
Learning the Information Divergence
Information divergence that measures the difference between two nonnegative
matrices or tensors has found its use in a variety of machine learning
problems. Examples are Nonnegative Matrix/Tensor Factorization, Stochastic
Neighbor Embedding, topic models, and Bayesian network optimization. The
success of such a learning task depends heavily on a suitable divergence. A
large variety of divergences have been suggested and analyzed, but very few
results are available for an objective choice of the optimal divergence for a
given task. Here we present a framework that facilitates automatic selection of
the best divergence among a given family, based on standard maximum likelihood
estimation. We first propose an approximated Tweedie distribution for the
beta-divergence family. Selecting the best beta then becomes a machine learning
problem solved by maximum likelihood. Next, we reformulate alpha-divergence in
terms of beta-divergence, which enables automatic selection of alpha by maximum
likelihood with reuse of the learning principle for beta-divergence.
Furthermore, we show the connections between gamma and beta-divergences as well
as R\'enyi and alpha-divergences, such that our automatic selection framework
is extended to non-separable divergences. Experiments on both synthetic and
real-world data demonstrate that our method can quite accurately select
information divergence across different learning problems and various
divergence families.Comment: 12 pages, 7 figure
Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions
Several recent works have explored stochastic gradient methods for
variational inference that exploit the geometry of the variational-parameter
space. However, the theoretical properties of these methods are not
well-understood and these methods typically only apply to
conditionally-conjugate models. We present a new stochastic method for
variational inference which exploits the geometry of the variational-parameter
space and also yields simple closed-form updates even for non-conjugate models.
We also give a convergence-rate analysis of our method and many other previous
methods which exploit the geometry of the space. Our analysis generalizes
existing convergence results for stochastic mirror-descent on non-convex
objectives by using a more general class of divergence functions. Beyond giving
a theoretical justification for a variety of recent methods, our experiments
show that new algorithms derived in this framework lead to state of the art
results on a variety of problems. Further, due to its generality, we expect
that our theoretical analysis could also apply to other applications.Comment: Published in UAI 2016. We have made the following change in this
revision: instead of expressing convergence rate results in terms of the
iterate difference, we state them in terms of the iterate distance divided by
the step-size (a measure of first-order optimality). We also removed some
claims about the performance with a fixed step siz
Unbiased Estimation Equation under -Separable Bregman Distortion Measures
We discuss unbiased estimation equations in a class of objective function
using a monotonically increasing function and Bregman divergence. The
choice of the function gives desirable properties such as robustness
against outliers. In order to obtain unbiased estimation equations,
analytically intractable integrals are generally required as bias correction
terms. In this study, we clarify the combination of Bregman divergence,
statistical model, and function in which the bias correction term vanishes.
Focusing on Mahalanobis and Itakura-Saito distances, we provide a
generalization of fundamental existing results and characterize a class of
distributions of positive reals with a scale parameter, which includes the
gamma distribution as a special case. We discuss the possibility of latent bias
minimization when the proportion of outliers is large, which is induced by the
extinction of the bias correction term
- …