327 research outputs found
PifPaf: Composite Fields for Human Pose Estimation
We propose a new bottom-up method for multi-person 2D human pose estimation
that is particularly well suited for urban mobility such as self-driving cars
and delivery robots. The new method, PifPaf, uses a Part Intensity Field (PIF)
to localize body parts and a Part Association Field (PAF) to associate body
parts with each other to form full human poses. Our method outperforms previous
methods at low resolution and in crowded, cluttered and occluded scenes thanks
to (i) our new composite field PAF encoding fine-grained information and (ii)
the choice of Laplace loss for regressions which incorporates a notion of
uncertainty. Our architecture is based on a fully convolutional, single-shot,
box-free design. We perform on par with the existing state-of-the-art bottom-up
method on the standard COCO keypoint task and produce state-of-the-art results
on a modified COCO keypoint task for the transportation domain.Comment: CVPR 201
Recurrent Attention Models for Depth-Based Person Identification
We present an attention-based model that reasons on human body shape and
motion dynamics to identify individuals in the absence of RGB information,
hence in the dark. Our approach leverages unique 4D spatio-temporal signatures
to address the identification problem across days. Formulated as a
reinforcement learning task, our model is based on a combination of
convolutional and recurrent neural networks with the goal of identifying small,
discriminative regions indicative of human identity. We demonstrate that our
model produces state-of-the-art results on several published datasets given
only depth images. We further study the robustness of our model towards
viewpoint, appearance, and volumetric changes. Finally, we share insights
gleaned from interpretable 2D, 3D, and 4D visualizations of our model's
spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201
Characterizing and Improving Stability in Neural Style Transfer
Recent progress in style transfer on images has focused on improving the
quality of stylized images and speed of methods. However, real-time methods are
highly unstable resulting in visible flickering when applied to videos. In this
work we characterize the instability of these methods by examining the solution
set of the style transfer objective. We show that the trace of the Gram matrix
representing style is inversely related to the stability of the method. Then,
we present a recurrent convolutional network for real-time video style transfer
which incorporates a temporal consistency loss and overcomes the instability of
prior methods. Our networks can be applied at any resolution, do not re- quire
optical flow at test time, and produce high quality, temporally consistent
stylized videos in real-time
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
Steering the behavior of a strong model pre-trained on internet-scale data
can be difficult due to the scarcity of competent supervisors. Recent studies
reveal that, despite supervisory noises, a strong student model may surpass its
weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of
such weak-to-strong generalization remains limited, especially in the presence
of large capability gaps. In this paper, we propose to address this challenge
by harnessing a diverse set of specialized teachers, instead of a single
generalist one, that collectively supervises the strong student. Our approach
resembles the classical hierarchical mixture of experts, with two components
tailored for co-supervision: (i) we progressively alternate student training
and teacher assignment, leveraging the growth of the strong student to identify
plausible supervisions; (ii) we conservatively enforce teacher-student and
local-global consistency, leveraging their dependencies to reject potential
annotation noises. We validate the proposed method through visual recognition
tasks on the OpenAI weak-to-strong benchmark and additional multi-domain
datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.Comment: Preprin
Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
We present a unified framework for understanding human social behaviors in
raw image sequences. Our model jointly detects multiple individuals, infers
their social actions, and estimates the collective actions with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end to generate dense proposal maps that are refined via a novel
inference scheme. The temporal consistency is handled via a person-level
matching Recurrent Neural Network. The complete model takes as input a sequence
of frames and outputs detections along with the estimates of individual actions
and collective activities. We demonstrate state-of-the-art performance of our
algorithm on multiple publicly available benchmarks
COCA-COLA’S FUTURE GROWTH STRATEGY: DIVERSIFICATION?
This case study is based on Coca-Cola Co.’s growth strategy. It provides the audience with the company background of various acquisitions, the Carbonated Soft Drink(CSD) industry analysis in which Coca-Cola belongs in, the competitive landscape, threats and opportunities facing the CSD industry, along with internal company analysis of Coca-Cola Co. Which includes, the current state of the company, its current growth strategy and the future of Coca-Cola Co. With the information provided, the audience (students) will be able to recommend various strategies that can help Coca-Cola Co. grow in the future
Unsupervised Learning of Long-Term Motion Dynamics for Videos
We present an unsupervised representation learning approach that compactly
encodes the motion dependencies in videos. Given a pair of images from a video
clip, our framework learns to predict the long-term 3D motions. To reduce the
complexity of the learning framework, we propose to describe the motion as a
sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent
Neural Network based Encoder-Decoder framework to predict these sequences of
flows. We argue that in order for the decoder to reconstruct these sequences,
the encoder must learn a robust video representation that captures long-term
motion dependencies and spatial-temporal relations. We demonstrate the
effectiveness of our learned temporal representations on activity
classification across multiple modalities and datasets such as NTU RGB+D and
MSR Daily Activity 3D. Our framework is generic to any input modality, i.e.,
RGB, Depth, and RGB-D videos.Comment: CVPR 201
- …