4,848 research outputs found
Out-of-Distribution Detection Using Neural Rendering Generative Models
Out-of-distribution (OoD) detection is a natural downstream task for deep generative models, due to their ability to learn the input probability distribution. There are mainly two classes of approaches for OoD detection using deep generative models, viz., based on likelihood measure and the reconstruction loss. However, both approaches are unable to carry out OoD detection effectively, especially when the OoD samples have smaller variance than the training samples. For instance, both flow based and VAE models assign higher likelihood to images from SVHN when trained on CIFAR-10 images. We use a recently proposed generative model known as neural rendering model (NRM) and derive metrics for OoD. We show that NRM unifies both approaches since it provides a likelihood estimate and also carries out reconstruction in each layer of the neural network. Among various measures, we found the joint likelihood of latent variables to be the most effective one for OoD detection. Our results show that when trained on CIFAR-10, lower likelihood (of latent variables) is assigned to SVHN images. Additionally, we show that this metric is consistent across other OoD datasets. To the best of our knowledge, this is the first work to show consistently lower likelihood for OoD data with smaller variance with deep generative models
Out-of-Distribution Detection Using Neural Rendering Generative Models
Out-of-distribution (OoD) detection is a natural downstream task for deep
generative models, due to their ability to learn the input probability
distribution. There are mainly two classes of approaches for OoD detection
using deep generative models, viz., based on likelihood measure and the
reconstruction loss. However, both approaches are unable to carry out OoD
detection effectively, especially when the OoD samples have smaller variance
than the training samples. For instance, both flow based and VAE models assign
higher likelihood to images from SVHN when trained on CIFAR-10 images. We use a
recently proposed generative model known as neural rendering model (NRM) and
derive metrics for OoD. We show that NRM unifies both approaches since it
provides a likelihood estimate and also carries out reconstruction in each
layer of the neural network. Among various measures, we found the joint
likelihood of latent variables to be the most effective one for OoD detection.
Our results show that when trained on CIFAR-10, lower likelihood (of latent
variables) is assigned to SVHN images. Additionally, we show that this metric
is consistent across other OoD datasets. To the best of our knowledge, this is
the first work to show consistently lower likelihood for OoD data with smaller
variance with deep generative models
Expecting the Unexpected: Training Detectors for Unusual Pedestrians with Adversarial Imposters
As autonomous vehicles become an every-day reality, high-accuracy pedestrian
detection is of paramount practical importance. Pedestrian detection is a
highly researched topic with mature methods, but most datasets focus on common
scenes of people engaged in typical walking poses on sidewalks. But performance
is most crucial for dangerous scenarios, such as children playing in the street
or people using bicycles/skateboards in unexpected ways. Such "in-the-tail"
data is notoriously hard to observe, making both training and testing
difficult. To analyze this problem, we have collected a novel annotated dataset
of dangerous scenarios called the Precarious Pedestrian dataset. Even given a
dedicated collection effort, it is relatively small by contemporary standards
(around 1000 images). To allow for large-scale data-driven learning, we explore
the use of synthetic data generated by a game engine. A significant challenge
is selected the right "priors" or parameters for synthesis: we would like
realistic data with poses and object configurations that mimic true Precarious
Pedestrians. Inspired by Generative Adversarial Networks (GANs), we generate a
massive amount of synthetic data and train a discriminative classifier to
select a realistic subset, which we deem the Adversarial Imposters. We
demonstrate that this simple pipeline allows one to synthesize realistic
training data by making use of rendering/animation engines within a GAN
framework. Interestingly, we also demonstrate that such data can be used to
rank algorithms, suggesting that Adversarial Imposters can also be used for
"in-the-tail" validation at test-time, a notoriously difficult challenge for
real-world deployment.Comment: To appear in CVPR 201
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either complete translation invariance or varying
degrees of translation dependence, as required by the end task. CoordConv
solves the coordinate transform problem with perfect generalization and 150
times faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
showed 24% better IOU when using CoordConv, and in the RL domain agents playing
Atari games benefit significantly from the use of CoordConv layers.Comment: Published in NeurIPS 201
Deep Convolutional Inverse Graphics Network
This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a
model that learns an interpretable representation of images. This
representation is disentangled with respect to transformations such as
out-of-plane rotations and lighting variations. The DC-IGN model is composed of
multiple layers of convolution and de-convolution operators and is trained
using the Stochastic Gradient Variational Bayes (SGVB) algorithm. We propose a
training procedure to encourage neurons in the graphics code layer to represent
a specific transformation (e.g. pose or light). Given a single input image, our
model can generate new images of the same object with variations in pose and
lighting. We present qualitative and quantitative results of the model's
efficacy at learning a 3D rendering engine.Comment: First two authors contributed equall
Superimposition-guided Facial Reconstruction from Skull
We develop a new algorithm to perform facial reconstruction from a given
skull. This technique has forensic application in helping the identification of
skeletal remains when other information is unavailable. Unlike most existing
strategies that directly reconstruct the face from the skull, we utilize a
database of portrait photos to create many face candidates, then perform a
superimposition to get a well matched face, and then revise it according to the
superimposition. To support this pipeline, we build an effective autoencoder
for image-based facial reconstruction, and a generative model for constrained
face inpainting. Our experiments have demonstrated that the proposed pipeline
is stable and accurate.Comment: 14 pages; 14 figure
Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering
We propose an approach for forecasting video of complex human activity
involving multiple people. Direct pixel-level prediction is too simple to
handle the appearance variability in complex activities. Hence, we develop
novel intermediate representations. An architecture combining a hierarchical
temporal model for predicting human poses and encoder-decoder convolutional
neural networks for rendering target appearances is proposed. Our hierarchical
model captures interactions among people by adopting a dynamic group-based
interaction mechanism. Next, our appearance rendering network encodes the
targets' appearances by learning adaptive appearance filters using a fully
convolutional network. Finally, these filters are placed in encoder-decoder
neural networks to complete the rendering. We demonstrate that our model can
generate videos that are superior to state-of-the-art methods, and can handle
complex human activity scenarios in video forecasting
On Pre-Trained Image Features and Synthetic Images for Deep Learning
Deep Learning methods usually require huge amounts of training data to
perform at their full potential, and often require expensive manual labeling.
Using synthetic images is therefore very attractive to train object detectors,
as the labeling comes for free, and several approaches have been proposed to
combine synthetic and real images for training.
In this paper, we show that a simple trick is sufficient to train very
effectively modern object detectors with synthetic images only: We freeze the
layers responsible for feature extraction to generic layers pre-trained on real
images, and train only the remaining layers with plain OpenGL rendering. Our
experiments with very recent deep architectures for object recognition
(Faster-RCNN, R-FCN, Mask-RCNN) and image feature extractors (InceptionResnet
and Resnet) show this simple approach performs surprisingly well
Inverse Graphics with Probabilistic CAD Models
Recently, multiple formulations of vision problems as probabilistic
inversions of generative models based on computer graphics have been proposed.
However, applications to 3D perception from natural images have focused on
low-dimensional latent scenes, due to challenges in both modeling and
inference. Accounting for the enormous variability in 3D object shape and 2D
appearance via realistic generative models seems intractable, as does inverting
even simple versions of the many-to-many computations that link 3D scenes to 2D
images. This paper proposes and evaluates an approach that addresses key
aspects of both these challenges. We show that it is possible to solve
challenging, real-world 3D vision problems by approximate inference in
generative models for images based on rendering the outputs of probabilistic
CAD (PCAD) programs. Our PCAD object geometry priors generate deformable 3D
meshes corresponding to plausible objects and apply affine transformations to
place them in a scene. Image likelihoods are based on similarity in a feature
space based on standard mid-level image representations from the vision
literature. Our inference algorithm integrates single-site and locally blocked
Metropolis-Hastings proposals, Hamiltonian Monte Carlo and discriminative
data-driven proposals learned from training data generated from our models. We
apply this approach to 3D human pose estimation and object shape reconstruction
from single images, achieving quantitative and qualitative performance
improvements over state-of-the-art baselines.Comment: For correspondence, contact [email protected]
Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
We present a framework for efficient inference in structured image models
that explicitly reason about objects. We achieve this by performing
probabilistic inference using a recurrent neural network that attends to scene
elements and processes them one at a time. Crucially, the model itself learns
to choose the appropriate number of inference steps. We use this scheme to
learn to perform inference in partially specified 2D models (variable-sized
variational auto-encoders) and fully specified 3D models (probabilistic
renderers). We show that such models learn to identify multiple objects -
counting, locating and classifying the elements of a scene - without any
supervision, e.g., decomposing 3D images with various numbers of objects in a
single forward pass of a neural network. We further show that the networks
produce accurate inferences when compared to supervised counterparts, and that
their structure leads to improved generalization
- …