49 research outputs found
Improving Deep Representation Learning with Complex and Multimodal Data.
Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts.
The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation.
The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd
Channel-Recurrent Autoencoding for Image Modeling
Despite recent successes in synthesizing faces and bedrooms, existing
generative models struggle to capture more complex image types, potentially due
to the oversimplification of their latent space constructions. To tackle this
issue, building on Variational Autoencoders (VAEs), we integrate recurrent
connections across channels to both inference and generation steps, allowing
the high-level features to be captured in global-to-local, coarse-to-fine
manners. Combined with adversarial loss, our channel-recurrent VAE-GAN
(crVAE-GAN) outperforms VAE-GAN in generating a diverse spectrum of high
resolution images while maintaining the same level of computational efficacy.
Our model produces interpretable and expressive latent representations to
benefit downstream tasks such as image completion. Moreover, we propose two
novel regularizations, namely the KL objective weighting scheme over time steps
and mutual information maximization between transformed latent variables and
the outputs, to enhance the training.Comment: Code: https://github.com/WendyShang/crVAE. Supplementary Materials:
http://www-personal.umich.edu/~shangw/wacv18_supplementary_material.pd
Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction
Object detection systems based on the deep convolutional neural network (CNN)
have recently made ground- breaking advances on several object detection
benchmarks. While the features learned by these high-capacity neural networks
are discriminative for categorization, inaccurate localization is still a major
source of error for detection. Building upon high-capacity CNN architectures,
we address the localization problem by 1) using a search algorithm based on
Bayesian optimization that sequentially proposes candidate regions for an
object bounding box, and 2) training the CNN with a structured loss that
explicitly penalizes the localization inaccuracy. In experiments, we
demonstrated that each of the proposed methods improves the detection
performance over the baseline method on PASCAL VOC 2007 and 2012 datasets.
Furthermore, two methods are complementary and significantly outperform the
previous state-of-the-art when combined.Comment: CVPR 201
Video Probabilistic Diffusion Models in Projected Latent Space
Despite the remarkable progress in deep generative models, synthesizing
high-resolution and temporally coherent videos still remains a challenge due to
their high-dimensionality and complex temporal dynamics along with large
spatial variations. Recent works on diffusion models have shown their potential
to solve this challenge, yet they suffer from severe computation- and
memory-inefficiency that limit the scalability. To handle this issue, we
propose a novel generative model for videos, coined projected latent video
diffusion models (PVDM), a probabilistic diffusion model which learns a video
distribution in a low-dimensional latent space and thus can be efficiently
trained with high-resolution videos under limited resources. Specifically, PVDM
is composed of two components: (a) an autoencoder that projects a given video
as 2D-shaped latent vectors that factorize the complex cubic structure of video
pixels and (b) a diffusion model architecture specialized for our new
factorized latent space and the training/sampling procedure to synthesize
videos of arbitrary length with a single model. Experiments on popular video
generation datasets demonstrate the superiority of PVDM compared with previous
video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the
UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of
the prior state-of-the-art.Comment: Project page: https://sihyun.me/PVD
Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos
Despite rapid advances in face recognition, there remains a clear gap between
the performance of still image-based face recognition and video-based face
recognition, due to the vast difference in visual quality between the domains
and the difficulty of curating diverse large-scale video datasets. This paper
addresses both of those challenges, through an image to video feature-level
domain adaptation approach, to learn discriminative video frame
representations. The framework utilizes large-scale unlabeled video data to
reduce the gap between different domains while transferring discriminative
knowledge from large-scale labeled still images. Given a face recognition
network that is pretrained in the image domain, the adaptation is achieved by
(i) distilling knowledge from the network to a video adaptation network through
feature matching, (ii) performing feature restoration through synthetic data
augmentation and (iii) learning a domain-invariant feature through a domain
adversarial discriminator. We further improve performance through a
discriminator-guided feature fusion that boosts high-quality frames while
eliminating those degraded by video domain-specific factors. Experiments on the
YouTube Faces and IJB-A datasets demonstrate that each module contributes to
our feature-level domain adaptation framework and substantially improves video
face recognition performance to achieve state-of-the-art accuracy. We
demonstrate qualitatively that the network learns to suppress diverse artifacts
in videos such as pose, illumination or occlusion without being explicitly
trained for them.Comment: accepted for publication at International Conference on Computer
Vision (ICCV) 201