24 research outputs found
Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model
Amodal completion is a visual task that humans perform easily but which is
difficult for computer vision algorithms. The aim is to segment those object
boundaries which are occluded and hence invisible. This task is particularly
challenging for deep neural networks because data is difficult to obtain and
annotate. Therefore, we formulate amodal segmentation as an out-of-task and
out-of-distribution generalization problem. Specifically, we replace the fully
connected classifier in neural networks with a Bayesian generative model of the
neural network features. The model is trained from non-occluded images using
bounding box annotations and class labels only, but is applied to generalize
out-of-task to object segmentation and to generalize out-of-distribution to
segment occluded objects. We demonstrate how such Bayesian models can naturally
generalize beyond the training task labels when they learn a prior that models
the object's background context and shape. Moreover, by leveraging an outlier
process, Bayesian models can further generalize out-of-distribution to segment
partially occluded objects and to predict their amodal object boundaries. Our
algorithm outperforms alternative methods that use the same supervision by a
large margin, and even outperforms methods where annotated amodal segmentations
are used during training, when the amount of occlusion is large. Code is
publically available at https://github.com/YihongSun/Bayesian-Amodal
OOD-CV: A Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images
Enhancing the robustness of vision algorithms in real-world scenarios is
challenging. One reason is that existing robustness benchmarks are limited, as
they either rely on synthetic data or ignore the effects of individual nuisance
factors. We introduce OOD-CV, a benchmark dataset that includes
out-of-distribution examples of 10 object categories in terms of pose, shape,
texture, context and the weather conditions, and enables benchmarking models
for image classification, object detection, and 3D pose estimation. In addition
to this novel dataset, we contribute extensive experiments using popular
baseline methods, which reveal that: 1. Some nuisance factors have a much
stronger negative effect on the performance compared to others, also depending
on the vision task. 2. Current approaches to enhance robustness have only
marginal effects, and can even reduce robustness. 3. We do not observe
significant differences between convolutional and transformer architectures. We
believe our dataset provides a rich testbed to study robustness and will help
push forward research in this area.Comment: Project webpage: https://bzhao.me/ROBIN/, this work is accepted as
Oral at ECCV 202
Object Recognition and Parsing with Weak Supervision
Object recognition is a fundamental problem in computer vision and has attracted a lot of research attention, while object parsing is equally important for many computer vision tasks but has been less studied. With the recent development of deep neural networks, computer vision researches have been dominated by deep learning approaches, which require large amount of training data for a specific task in a specific domain. The cost of collecting rare samples and making "hard" labels is forbiddingly high and has limited the development of many important vision studies, including object parsing. This dissertation will focus on object recognition and parsing with weak supervision, which tackles the problem when only a limited amount of data or label are available for training deep neural networks in the target domain. The goal is to design more advanced computer vision models with enhanced data efficiency during training and increased robustness to out-of-distribution samples during test. To achieve this goal, I will introduce several strategies, including unsupervised learning of compositional components in deep neural networks, zero/few-shot learning by preserving useful knowledge acquired in pre-training, weakly supervised learning combined with spatial-temporal information in video data, and learning from 3D computer graphics models and synthetic data. Furthermore, I will discuss new findings in our cognitive science projects and explain how the part-based representations benefit the development of visual analogical reasoning models. I believe this series of works alleviates the data-hungry problem of deep neural networks, and improves computer vision models to behave closer to human intelligence
CoKe: Localized Contrastive Learning for Robust Keypoint Detection
Today's most popular approaches to keypoint detection involve very complex
network architectures that aim to learn holistic representations of all
keypoints. In this work, we take a step back and ask: Can we simply learn a
local keypoint representation from the output of a standard backbone
architecture? This will help make the network simpler and more robust,
particularly if large parts of the object are occluded. We demonstrate that
this is possible by looking at the problem from the perspective of
representation learning. Specifically, the keypoint kernels need to be chosen
to optimize three types of distances in the feature space: Features of the same
keypoint should be similar to each other, while differing from those of other
keypoints, and also being distinct from features from the background clutter.
We formulate this optimization process within a framework, which we call CoKe,
which includes supervised contrastive learning. CoKe needs to make several
approximations to enable representation learning process on large datasets. In
particular, we introduce a clutter bank to approximate non-keypoint features,
and a momentum update to compute the keypoint representation while training the
feature extractor. Our experiments show that CoKe achieves state-of-the-art
results compared to approaches that jointly represent all keypoints
holistically (Stacked Hourglass Networks, MSS-Net) as well as to approaches
that are supervised by detailed 3D object geometry (StarMap). Moreover, CoKe is
robust and performs exceptionally well when objects are partially occluded and
significantly outperforms related work on a range of diverse datasets
(PASCAL3D+, MPII, ObjectNet3D)
Robust 3D-aware Object Classification via Discriminative Render-and-Compare
In real-world applications, it is essential to jointly estimate the 3D object
pose and class label of objects, i.e., to perform 3D-aware classification.While
current approaches for either image classification or pose estimation can be
extended to 3D-aware classification, we observe that they are inherently
limited: 1) Their performance is much lower compared to the respective
single-task models, and 2) they are not robust in out-of-distribution (OOD)
scenarios. Our main contribution is a novel architecture for 3D-aware
classification, which builds upon a recent work and performs comparably to
single-task models while being highly robust. In our method, an object category
is represented as a 3D cuboid mesh composed of feature vectors at each mesh
vertex. Using differentiable rendering, we estimate the 3D object pose by
minimizing the reconstruction error between the mesh and the feature
representation of the target image. Object classification is then performed by
comparing the reconstruction losses across object categories. Notably, the
neural texture of the mesh is trained in a discriminative manner to enhance the
classification performance while also avoiding local optima in the
reconstruction loss. Furthermore, we show how our method and feed-forward
neural networks can be combined to scale the render-and-compare approach to
larger numbers of categories. Our experiments on PASCAL3D+, occluded-PASCAL3D+,
and OOD-CV show that our method outperforms all baselines at 3D-aware
classification by a wide margin in terms of performance and robustness
Neural Textured Deformable Meshes for Robust Analysis-by-Synthesis
Human vision demonstrates higher robustness than current AI algorithms under
out-of-distribution scenarios. It has been conjectured such robustness benefits
from performing analysis-by-synthesis. Our paper formulates triple vision tasks
in a consistent manner using approximate analysis-by-synthesis by
render-and-compare algorithms on neural features. In this work, we introduce
Neural Textured Deformable Meshes, which involve the object model with
deformable geometry that allows optimization on both camera parameters and
object geometries. The deformable mesh is parameterized as a neural field, and
covered by whole-surface neural texture maps, which are trained to have spatial
discriminability. During inference, we extract the feature map of the test
image and subsequently optimize the 3D pose and shape parameters of our model
using differentiable rendering to best reconstruct the target feature map. We
show that our analysis-by-synthesis is much more robust than conventional
neural networks when evaluated on real-world images and even in challenging
out-of-distribution scenarios, such as occlusion and domain shift. Our
algorithms are competitive with standard algorithms when tested on conventional
performance measures