198 research outputs found
Multi-task Self-Supervised Visual Learning
We investigate methods for combining multiple self-supervised tasks--i.e.,
supervised tasks where data can be collected without manual labeling--in order
to train a single visual representation. First, we provide an apples-to-apples
comparison of four different self-supervised tasks using the very deep
ResNet-101 architecture. We then combine tasks to jointly train a network. We
also explore lasso regularization to encourage the network to factorize the
information in its representation, and methods for "harmonizing" network inputs
in order to learn a more unified representation. We evaluate all methods on
ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our
results show that deeper networks work better, and that combining tasks--even
via a naive multi-head architecture--always improves performance. Our best
joint network nearly matches the PASCAL performance of a model pre-trained on
ImageNet classification, and matches the ImageNet network on NYU depth
prediction.Comment: Published at ICCV 201
Sim2real transfer learning for 3D human pose estimation: motion to the rescue
Synthetic visual data can provide practically infinite diversity and rich
labels, while avoiding ethical issues with privacy and bias. However, for many
tasks, current models trained on synthetic data generalize poorly to real data.
The task of 3D human pose estimation is a particularly interesting example of
this sim2real problem, because learning-based approaches perform reasonably
well given real training data, yet labeled 3D poses are extremely difficult to
obtain in the wild, limiting scalability. In this paper, we show that standard
neural-network approaches, which perform poorly when trained on synthetic RGB
images, can perform well when the data is pre-processed to extract cues about
the person's motion, notably as optical flow and the motion of 2D keypoints.
Therefore, our results suggest that motion can be a simple way to bridge a
sim2real gap when video is available. We evaluate on the 3D Poses in the Wild
dataset, the most challenging modern benchmark for 3D pose estimation, where we
show full 3D mesh recovery that is on par with state-of-the-art methods trained
on real 3D sequences, despite training only on synthetic humans from the
SURREAL dataset.Comment: Accepted at NeurIPS 201
Video Action Transformer Network
We introduce the Action Transformer model for recognizing and localizing
human actions in video clips. We repurpose a Transformer-style architecture to
aggregate features from the spatiotemporal context around the person whose
actions we are trying to classify. We show that by using high-resolution,
person-specific, class-agnostic queries, the model spontaneously learns to
track individual people and to pick up on semantic context from the actions of
others. Additionally its attention mechanism learns to emphasize hands and
faces, which are often crucial to discriminate an action - all without explicit
supervision other than boxes and class labels. We train and test our Action
Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming
the state-of-the-art by a significant margin using only raw RGB frames as
input.Comment: CVPR 201
- …