589 research outputs found
Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds
Devising deep latent variable models for multi-modal data has been a
long-standing theme in machine learning research. Multi-modal Variational
Autoencoders (VAEs) have been a popular generative model class that learns
latent representations which jointly explain multiple modalities. Various
objective functions for such models have been suggested, often motivated as
lower bounds on the multi-modal data log-likelihood or from
information-theoretic considerations. In order to encode latent variables from
different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts
(MoE) aggregation schemes have been routinely used and shown to yield different
trade-offs, for instance, regarding their generative quality or consistency
across multiple modalities. In this work, we consider a variational bound that
can tightly lower bound the data log-likelihood. We develop more flexible
aggregation schemes that generalise PoE or MoE approaches by combining encoded
features from different modalities based on permutation-invariant neural
networks. Our numerical experiments illustrate trade-offs for multi-modal
variational bounds and various aggregation schemes. We show that tighter
variational bounds and more flexible aggregation models can become beneficial
when one wants to approximate the true joint distribution over observed
modalities and latent variables in identifiable models
Multi-Source Neural Variational Inference
Learning from multiple sources of information is an important problem in
machine-learning research. The key challenges are learning representations and
formulating inference methods that take into account the complementarity and
redundancy of various information sources. In this paper we formulate a
variational autoencoder based multi-source learning framework in which each
encoder is conditioned on a different information source. This allows us to
relate the sources via the shared latent variables by computing divergence
measures between individual source's posterior approximations. We explore a
variety of options to learn these encoders and to integrate the beliefs they
compute into a consistent posterior approximation. We visualise learned beliefs
on a toy dataset and evaluate our methods for learning shared representations
and structured output prediction, showing trade-offs of learning separate
encoders for each information source. Furthermore, we demonstrate how conflict
detection and redundancy can increase robustness of inference in a multi-source
setting.Comment: AAAI 2019, Association for the Advancement of Artificial Intelligence
(AAAI) 201
Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts
Multimodal generative models should be able to learn a meaningful latent
representation that enables a coherent joint generation of all modalities
(e.g., images and text). Many applications also require the ability to
accurately sample modalities conditioned on observations of a subset of the
modalities. Often not all modalities may be observed for all training data
points, so semi-supervised learning should be possible. In this study, we
evaluate a family of product-of-experts (PoE) based variational autoencoders
that have these desired properties. We include a novel PoE based architecture
and training procedure. An empirical evaluation shows that the PoE based models
can outperform an additive mixture-of-experts (MoE) approach. Our experiments
support the intuition that PoE models are more suited for a conjunctive
combination of modalities while MoEs are more suited for a disjunctive fusion
Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments
A real-world application or setting involves interaction between different
modalities (e.g., video, speech, text). In order to process the multimodal
information automatically and use it for an end application, Multimodal
Representation Learning (MRL) has emerged as an active area of research in
recent times. MRL involves learning reliable and robust representations of
information from heterogeneous sources and fusing them. However, in practice,
the data acquired from different sources are typically noisy. In some extreme
cases, a noise of large magnitude can completely alter the semantics of the
data leading to inconsistencies in the parallel multimodal data. In this paper,
we propose a novel method for multimodal representation learning in a noisy
environment via the generalized product of experts technique. In the proposed
method, we train a separate network for each modality to assess the credibility
of information coming from that modality, and subsequently, the contribution
from each modality is dynamically varied while estimating the joint
distribution. We evaluate our method on two challenging benchmarks from two
diverse domains: multimodal 3D hand-pose estimation and multimodal surgical
video segmentation. We attain state-of-the-art performance on both benchmarks.
Our extensive quantitative and qualitative evaluations show the advantages of
our method compared to previous approaches.Comment: 11 Pages, Accepted at ICMI 2022 Ora
Resampled Priors for Variational Autoencoders
We propose Learned Accept/Reject Sampling (LARS), a method for constructing
richer priors using rejection sampling with a learned acceptance function. This
work is motivated by recent analyses of the VAE objective, which pointed out
that commonly used simple priors can lead to underfitting. As the distribution
induced by LARS involves an intractable normalizing constant, we show how to
estimate it and its gradients efficiently. We demonstrate that LARS priors
improve VAE performance on several standard datasets both when they are learned
jointly with the rest of the model and when they are fitted to a pretrained
model. Finally, we show that LARS can be combined with existing methods for
defining flexible priors for an additional boost in performance
- …