14 research outputs found
Unsupervised Learning of Disentangled Representations from Video
We present a new model DrNET that learns disentangled image representations
from video. Our approach leverages the temporal coherence of video and a novel
adversarial loss to learn a representation that factorizes each frame into a
stationary part and a temporally varying component. The disentangled
representation can be used for a range of tasks. For example, applying a
standard LSTM to the time-vary components enables prediction of future frames.
We evaluate our approach on a range of synthetic and real videos, demonstrating
the ability to coherently generate hundreds of steps into the future
Stable Distribution Alignment Using the Dual of the Adversarial Distance
Methods that align distributions by minimizing an adversarial distance
between them have recently achieved impressive results. However, these
approaches are difficult to optimize with gradient descent and they often do
not converge well without careful hyperparameter tuning and proper
initialization. We investigate whether turning the adversarial min-max problem
into an optimization problem by replacing the maximization part with its dual
improves the quality of the resulting alignment and explore its connections to
Maximum Mean Discrepancy. Our empirical results suggest that using the dual
formulation for the restricted family of linear discriminators results in a
more stable convergence to a desirable solution when compared with the
performance of a primal min-max GAN-like objective and an MMD objective under
the same restrictions. We test our hypothesis on the problem of aligning two
synthetic point clouds on a plane and on a real-image domain adaptation problem
on digits. In both cases, the dual formulation yields an iterative procedure
that gives more stable and monotonic improvement over time.Comment: ICLR 2018 Conference Invite to Worksho
Probabilistic Video Generation using Holistic Attribute Control
Videos express highly structured spatio-temporal patterns of visual data. A
video can be thought of as being governed by two factors: (i) temporally
invariant (e.g., person identity), or slowly varying (e.g., activity),
attribute-induced appearance, encoding the persistent content of each frame,
and (ii) an inter-frame motion or scene dynamics (e.g., encoding evolution of
the person ex-ecuting the action). Based on this intuition, we propose a
generative framework for video generation and future prediction. The proposed
framework generates a video (short clip) by decoding samples sequentially drawn
from a latent space distribution into full video frames. Variational
Autoencoders (VAEs) are used as a means of encoding/decoding frames into/from
the latent space and RNN as a wayto model the dynamics in the latent space. We
improve the video generation consistency through temporally-conditional
sampling and quality by structuring the latent space with attribute controls;
ensuring that attributes can be both inferred and conditioned on during
learning/generation. As a result, given attributes and/orthe first frame, our
model is able to generate diverse but highly consistent sets ofvideo sequences,
accounting for the inherent uncertainty in the prediction task. Experimental
results on Chair CAD, Weizmann Human Action, and MIT-Flickr datasets, along
with detailed comparison to the state-of-the-art, verify effectiveness of the
framework
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS
End-to-end neural TTS training has shown improved performance in speech style
transfer. However, the improvement is still limited by the training data in
both target styles and speakers. Inadequate style transfer performance occurs
when the trained TTS tries to transfer the speech to a target style from a new
speaker with an unknown, arbitrary style. In this paper, we propose a new
approach to style transfer for both seen and unseen styles, with disjoint,
multi-style datasets, i.e., datasets of different styles are recorded, each
individual style is by one speaker with multiple utterances. To encode the
style information, we adopt an inverse autoregressive flow (IAF) structure to
improve the variational inference. The whole system is optimized to minimize a
weighed sum of four different loss functions: 1) a reconstruction loss to
measure the distortions in both source and target reconstructions; 2) an
adversarial loss to "fool" a well-trained discriminator; 3) a style distortion
loss to measure the expected style loss after the transfer; 4) a cycle
consistency loss to preserve the speaker identity of the source after the
transfer. Experiments demonstrate, both objectively and subjectively, the
effectiveness of the proposed approach for seen and unseen style transfer
tasks. The performance of the new approach is better and more robust than those
of four baseline systems of the prior art
Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning
Learning representations that can disentangle explanatory attributes
underlying the data improves interpretabilty as well as provides control on
data generation. Various learning frameworks such as VAEs, GANs and
auto-encoders have been used in the literature to learn such representations.
Most often, the latent space is constrained to a partitioned representation or
structured by a prior to impose disentangling. In this work, we advance the use
of a latent representation based on a product space of Orthogonal Spheres
PrOSe. The PrOSe model is motivated by the reasoning that latent-variables
related to the physics of image-formation can under certain relaxed assumptions
lead to spherical-spaces. Orthogonality between the spheres is motivated via
physical independence models. Imposing the orthogonal-sphere constraint is much
simpler than other complicated physical models, is fairly general and flexible,
and extensible beyond the factors used to motivate its development. Under
further relaxed assumptions of equal-sized latent blocks per factor, the
constraint can be written down in closed form as an ortho-normality term in the
loss function. We show that our approach improves the quality of
disentanglement significantly. We find consistent improvement in
disentanglement compared to several state-of-the-art approaches, across several
benchmarks and metrics.Comment: Accepted at British Machine Vision Conference (BMVC) 201
Learning Latent Subspaces in Variational Autoencoders
Variational autoencoders (VAEs) are widely used deep generative models
capable of learning unsupervised latent representations of data. Such
representations are often difficult to interpret or control. We consider the
problem of unsupervised learning of features correlated to specific labels in a
dataset. We propose a VAE-based generative model which we show is capable of
extracting features correlated to binary labels in the data and structuring it
in a latent subspace which is easy to interpret. Our model, the Conditional
Subspace VAE (CSVAE), uses mutual information minimization to learn a
low-dimensional latent subspace associated with each label that can easily be
inspected and independently manipulated. We demonstrate the utility of the
learned representations for attribute manipulation tasks on both the Toronto
Face and CelebA datasets.Comment: Published as a conference paper at NeurIPS 2018. 15 page
Image Generation from Layout
Despite significant recent progress on generative models, controlled
generation of images depicting multiple and complex object layouts is still a
difficult problem. Among the core challenges are the diversity of appearance a
given object may possess and, as a result, exponential set of images consistent
with a specified layout. To address these challenges, we propose a novel
approach for layout-based image generation; we call it Layout2Im. Given the
coarse spatial layout (bounding boxes + object categories), our model can
generate a set of realistic images which have the correct objects in the
desired locations. The representation of each object is disentangled into a
specified/certain part (category) and an unspecified/uncertain part
(appearance). The category is encoded using a word embedding and the appearance
is distilled into a low-dimensional vector sampled from a normal distribution.
Individual object representations are composed together using convolutional
LSTM, to obtain an encoding of the complete layout, and then decoded to an
image. Several loss terms are introduced to encourage accurate and diverse
generation. The proposed Layout2Im model significantly outperforms the previous
state of the art, boosting the best reported inception score by 24.66% and
28.57% on the very challenging COCO-Stuff and Visual Genome datasets,
respectively. Extensive experiments also demonstrate our method's ability to
generate complex and diverse images with multiple objects.Comment: Accepted to CVPR 2019 (Oral
Unsupervised Domain Alignment to Mitigate Low Level Dataset Biases
Dataset bias is a well-known problem in the field of computer vision. The
presence of implicit bias in any image collection hinders a model trained and
validated on a particular dataset to yield similar accuracies when tested on
other datasets. In this paper, we propose a novel debiasing technique to reduce
the effects of a biased training dataset. Our goal is to augment the training
data using a generative network by learning a non-linear mapping from the
source domain (training set) to the target domain (testing set) while retaining
training set labels. The cycle consistency loss and adversarial loss for
generative adversarial networks are used to learn the mapping. A structured
similarity index (SSIM) loss is used to enforce label retention while
augmenting the training set. Our methods and hypotheses are supported by
quantitative comparisons with prior debiasing techniques. These comparisons
showcase the superiority of our method and its potential to mitigate the
effects of dataset bias during the inference stage.Comment: 10 pages, 4 figures, 6 tables, submitted to ICAAI 201
Bayes-Factor-VAE: Hierarchical Bayesian Deep Auto-Encoder Models for Factor Disentanglement
We propose a family of novel hierarchical Bayesian deep auto-encoder models
capable of identifying disentangled factors of variability in data. While many
recent attempts at factor disentanglement have focused on sophisticated
learning objectives within the VAE framework, their choice of a standard normal
as the latent factor prior is both suboptimal and detrimental to performance.
Our key observation is that the disentangled latent variables responsible for
major sources of variability, the relevant factors, can be more appropriately
modeled using long-tail distributions. The typical Gaussian priors are, on the
other hand, better suited for modeling of nuisance factors. Motivated by this,
we extend the VAE to a hierarchical Bayesian model by introducing hyper-priors
on the variances of Gaussian latent priors, mimicking an infinite mixture,
while maintaining tractable learning and inference of the traditional VAEs.
This analysis signifies the importance of partitioning and treating in a
different manner the latent dimensions corresponding to relevant factors and
nuisances. Our proposed models, dubbed Bayes-Factor-VAEs, are shown to
outperform existing methods both quantitatively and qualitatively in terms of
latent disentanglement across several challenging benchmark tasks.Comment: International Conference on Computer Vision (ICCV) 201
Towards Better Understanding of Disentangled Representations via Mutual Information
Most existing works on disentangled representation learning are solely built
upon an marginal independence assumption: all factors in disentangled
representations should be statistically independent. This assumption is
necessary but definitely not sufficient for the disentangled representations
without additional inductive biases in the modeling process, which is shown
theoretically in recent studies. We argue in this work that disentangled
representations should be characterized by their relation with observable data.
In particular, we formulate such a relation through the concept of mutual
information: the mutual information between each factor of the disentangled
representations and data should be invariant conditioned on values of the other
factors. Together with the widely accepted independence assumption, we further
bridge it with the conditional independence of factors in representations
conditioned on data. Moreover, we note that conditional independence of latent
variables has been imposed on most VAE-type models and InfoGAN due to the
artificial choice of factorized approximate posterior q(\rvz|\rvx) in the
encoders. Such an arrangement of encoders introduces a crucial inductive bias
for disentangled representations. To demonstrate the importance of our proposed
assumption and the related inductive bias, we show in experiments that
violating the assumption leads to decline of disentanglement among factors in
the learned representations