150 research outputs found
PifPaf: Composite Fields for Human Pose Estimation
We propose a new bottom-up method for multi-person 2D human pose estimation
that is particularly well suited for urban mobility such as self-driving cars
and delivery robots. The new method, PifPaf, uses a Part Intensity Field (PIF)
to localize body parts and a Part Association Field (PAF) to associate body
parts with each other to form full human poses. Our method outperforms previous
methods at low resolution and in crowded, cluttered and occluded scenes thanks
to (i) our new composite field PAF encoding fine-grained information and (ii)
the choice of Laplace loss for regressions which incorporates a notion of
uncertainty. Our architecture is based on a fully convolutional, single-shot,
box-free design. We perform on par with the existing state-of-the-art bottom-up
method on the standard COCO keypoint task and produce state-of-the-art results
on a modified COCO keypoint task for the transportation domain.Comment: CVPR 201
Recurrent Attention Models for Depth-Based Person Identification
We present an attention-based model that reasons on human body shape and
motion dynamics to identify individuals in the absence of RGB information,
hence in the dark. Our approach leverages unique 4D spatio-temporal signatures
to address the identification problem across days. Formulated as a
reinforcement learning task, our model is based on a combination of
convolutional and recurrent neural networks with the goal of identifying small,
discriminative regions indicative of human identity. We demonstrate that our
model produces state-of-the-art results on several published datasets given
only depth images. We further study the robustness of our model towards
viewpoint, appearance, and volumetric changes. Finally, we share insights
gleaned from interpretable 2D, 3D, and 4D visualizations of our model's
spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
Steering the behavior of a strong model pre-trained on internet-scale data
can be difficult due to the scarcity of competent supervisors. Recent studies
reveal that, despite supervisory noises, a strong student model may surpass its
weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of
such weak-to-strong generalization remains limited, especially in the presence
of large capability gaps. In this paper, we propose to address this challenge
by harnessing a diverse set of specialized teachers, instead of a single
generalist one, that collectively supervises the strong student. Our approach
resembles the classical hierarchical mixture of experts, with two components
tailored for co-supervision: (i) we progressively alternate student training
and teacher assignment, leveraging the growth of the strong student to identify
plausible supervisions; (ii) we conservatively enforce teacher-student and
local-global consistency, leveraging their dependencies to reject potential
annotation noises. We validate the proposed method through visual recognition
tasks on the OpenAI weak-to-strong benchmark and additional multi-domain
datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.Comment: Preprin
Characterizing and Improving Stability in Neural Style Transfer
Recent progress in style transfer on images has focused on improving the
quality of stylized images and speed of methods. However, real-time methods are
highly unstable resulting in visible flickering when applied to videos. In this
work we characterize the instability of these methods by examining the solution
set of the style transfer objective. We show that the trace of the Gram matrix
representing style is inversely related to the stability of the method. Then,
we present a recurrent convolutional network for real-time video style transfer
which incorporates a temporal consistency loss and overcomes the instability of
prior methods. Our networks can be applied at any resolution, do not re- quire
optical flow at test time, and produce high quality, temporally consistent
stylized videos in real-time
Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
We present a unified framework for understanding human social behaviors in
raw image sequences. Our model jointly detects multiple individuals, infers
their social actions, and estimates the collective actions with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end to generate dense proposal maps that are refined via a novel
inference scheme. The temporal consistency is handled via a person-level
matching Recurrent Neural Network. The complete model takes as input a sequence
of frames and outputs detections along with the estimates of individual actions
and collective activities. We demonstrate state-of-the-art performance of our
algorithm on multiple publicly available benchmarks
- …