2,105 research outputs found
OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Realtime multi-person 2D pose estimation is a key component in enabling
machines to have an understanding of people in images and videos. In this work,
we present a realtime approach to detect the 2D pose of multiple people in an
image. The proposed method uses a nonparametric representation, which we refer
to as Part Affinity Fields (PAFs), to learn to associate body parts with
individuals in the image. This bottom-up system achieves high accuracy and
realtime performance, regardless of the number of people in the image. In
previous work, PAFs and body part location estimation were refined
simultaneously across training stages. We demonstrate that a PAF-only
refinement rather than both PAF and body part location refinement results in a
substantial increase in both runtime performance and accuracy. We also present
the first combined body and foot keypoint detector, based on an internal
annotated foot dataset that we have publicly released. We show that the
combined detector not only reduces the inference time compared to running them
sequentially, but also maintains the accuracy of each component individually.
This work has culminated in the release of OpenPose, the first open-source
realtime system for multi-person 2D pose detection, including body, foot, hand,
and facial keypoints.Comment: Journal version of arXiv:1611.08050, with better accuracy and faster
speed, release a new foot keypoint dataset:
https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset
Improving Multi-Person Pose Estimation using Label Correction
Significant attention is being paid to multi-person pose estimation methods
recently, as there has been rapid progress in the field owing to convolutional
neural networks. Especially, recent method which exploits part confidence maps
and Part Affinity Fields (PAFs) has achieved accurate real-time prediction of
multi-person keypoints. However, human annotated labels are sometimes
inappropriate for learning models. For example, if there is a limb that extends
outside an image, a keypoint for the limb may not have annotations because it
exists outside of the image, and thus the labels for the limb can not be
generated. If a model is trained with data including such missing labels, the
output of the model for the location, even though it is correct, is penalized
as a false positive, which is likely to cause negative effects on the
performance of the model. In this paper, we point out the existence of some
patterns of inappropriate labels, and propose a novel method for correcting
such labels with a teacher model trained on such incomplete data. Experiments
on the COCO dataset show that training with the corrected labels improves the
performance of the model and also speeds up training
Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields
We present an online approach to efficiently and simultaneously detect and
track the 2D pose of multiple people in a video sequence. We build upon Part
Affinity Field (PAF) representation designed for static images, and propose an
architecture that can encode and predict Spatio-Temporal Affinity Fields (STAF)
across a video sequence. In particular, we propose a novel temporal topology
cross-linked across limbs which can consistently handle body motions of a wide
range of magnitudes. Additionally, we make the overall approach recurrent in
nature, where the network ingests STAF heatmaps from previous frames and
estimates those for the current frame. Our approach uses only online inference
and tracking, and is currently the fastest and the most accurate bottom-up
approach that is runtime invariant to the number of people in the scene and
accuracy invariant to input frame rate of camera. Running at 30 fps on a
single GPU at single scale, it achieves highly competitive results on the
PoseTrack benchmarks
Out of the Box: A combined approach for handling occlusion in Human Pose Estimation
Human Pose estimation is a challenging problem, especially in the case of 3D
pose estimation from 2D images due to many different factors like occlusion,
depth ambiguities, intertwining of people, and in general crowds. 2D
multi-person human pose estimation in the wild also suffers from the same
problems - occlusion, ambiguities, and disentanglement of people's body parts.
Being a fundamental problem with loads of applications, including but not
limited to surveillance, economical motion capture for video games and movies,
and physiotherapy, this is an interesting problem to be solved both from a
practical perspective and from an intellectual perspective as well. Although
there are cases where no pose estimation can ever predict with 100% accuracy
(cases where even humans would fail), there are several algorithms that have
brought new state-of-the-art performance in human pose estimation in the wild.
We look at a few algorithms with different approaches and also formulate our
own approach to tackle a consistently bugging problem, i.e. occlusions.Comment: 11 pages, 12 figure
Looking at Hands in Autonomous Vehicles: A ConvNet Approach using Part Affinity Fields
In the context of autonomous driving, where humans may need to take over in
the event where the computer may issue a takeover request, a key step towards
driving safety is the monitoring of the hands to ensure the driver is ready for
such a request. This work, focuses on the first step of this process, which is
to locate the hands. Such a system must work in real-time and under varying
harsh lighting conditions. This paper introduces a fast ConvNet approach, based
on the work of original work of OpenPose for full body joint estimation. The
network is modified with fewer parameters and retrained using our own day-time
naturalistic autonomous driving dataset to estimate joint and affinity heatmaps
for driver & passenger's wrist and elbows, for a total of 8 joint classes and
part affinity fields between each wrist-elbow pair. The approach runs real-time
on real-world data at 40 fps on multiple drivers and passengers. The system is
extensively evaluated both quantitatively and qualitatively, showing at least
95% detection performance on joint localization and arm-angle estimation.Comment: 11 pages, 8 figures, 1 table. Submitted to "IEEE Transactions on
Intelligent Vehicles" (under review
4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras
This paper contributes a novel realtime multi-person motion capture algorithm
using multiview video inputs. Due to the heavy occlusions in each view, joint
optimization on the multiview images and multiple temporal frames is
indispensable, which brings up the essential challenge of realtime efficiency.
To this end, for the first time, we unify per-view parsing, cross-view
matching, and temporal tracking into a single optimization framework, i.e., a
4D association graph that each dimension (image space, viewpoint and time) can
be treated equally and simultaneously. To solve the 4D association graph
efficiently, we further contribute the idea of 4D limb bundle parsing based on
heuristic searching, followed with limb bundle assembling by proposing a bundle
Kruskal's algorithm. Our method enables a realtime online motion capture system
running at 30fps using 5 cameras on a 5-person scene. Benefiting from the
unified parsing, matching and tracking constraints, our method is robust to
noisy detection, and achieves high-quality online pose reconstruction quality.
The proposed method outperforms the state-of-the-art method quantitatively
without using high-level appearance information. We also contribute a multiview
video dataset synchronized with a marker-based motion capture system for
scientific evaluation.Comment: Accepted to CVPR 202
FoxNet: A Multi-face Alignment Method
Multi-face alignment aims to identify geometry structures of multiple faces
in an image, and its performance is essential for the many practical tasks,
such as face recognition, face tracking, and face animation. In this work, we
present a fast bottom-up multi-face alignment approach, which can
simultaneously localize multi-person facial landmarks with high precision.In
more detail, our bottom-up architecture maps the landmarks to the
high-dimensional space with which landmarks of all faces are represented. By
clustering the features belonging to the same face, our approach can align the
multi-person facial landmarks synchronously.Extensive experiments show that our
method can achieve high performance in the multi-face landmark alignment task
while our model is extremely fast. Moreover, we propose a new multi-face
dataset to compare the speed and precision of bottom-up face alignment method
with top-down methods. Our dataset is publicly available at
https://github.com/AISAResearch/FoxNetComment: Accepted by the 26th IEEE International Conference on Image
Processing(ICIP2019
Dual Path Networks for Multi-Person Human Pose Estimation
The task of multi-person human pose estimation in natural scenes is quite
challenging. Existing methods include both top-down and bottom-up approaches.
The main advantage of bottom-up methods is its excellent tradeoff between
estimation accuracy and computational cost. We follow this path and aim to
design smaller, faster, and more accurate neural networks for the regression of
keypoints and limb association vectors. These two regression tasks are
naturally dependent on each other. In this work, we propose a dual-path network
specially designed for multi-person human pose estimation, and compare our
performance with the openpose network in aspects of model size, forward speed,
and estimation accuracy.Comment: ICCV 2017 Workshop on PoseTrack Challenge. Challenge results
available at:
https://posetrack.net/workshops/iccv2017/posetrack-challenge-results.htm
Cascade Feature Aggregation for Human Pose Estimation
Human pose estimation plays an important role in many computer vision tasks
and has been studied for many decades. However, due to complex appearance
variations from poses, illuminations, occlusions and low resolutions, it still
remains a challenging problem. Taking the advantage of high-level semantic
information from deep convolutional neural networks is an effective way to
improve the accuracy of human pose estimation. In this paper, we propose a
novel Cascade Feature Aggregation (CFA) method, which cascades several
hourglass networks for robust human pose estimation. Features from different
stages are aggregated to obtain abundant contextual information, leading to
robustness to poses, partial occlusions and low resolution. Moreover, results
from different stages are fused to further improve the localization accuracy.
The extensive experiments on MPII datasets and LIP datasets demonstrate that
our proposed CFA outperforms the state-of-the-art and achieves the best
performance on the state-of-the-art benchmark MPII
Pose estimator and tracker using temporal flow maps for limbs
For human pose estimation in videos, it is significant how to use temporal
information between frames. In this paper, we propose temporal flow maps for
limbs (TML) and a multi-stride method to estimate and track human poses. The
proposed temporal flow maps are unit vectors describing the limbs' movements.
We constructed a network to learn both spatial information and temporal
information end-to-end. Spatial information such as joint heatmaps and part
affinity fields is regressed in the spatial network part, and the TML is
regressed in the temporal network part. We also propose a data augmentation
method to learn various types of TML better. The proposed multi-stride method
expands the data by randomly selecting two frames within a defined range. We
demonstrate that the proposed method efficiently estimates and tracks human
poses on the PoseTrack 2017 and 2018 datasets.Comment: Won the Honorable Mention Award in the 18'ECCV PoseTrack challenge.
Accepted in the 19'IJCNN conferenc
- …