2,355 research outputs found
OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Realtime multi-person 2D pose estimation is a key component in enabling
machines to have an understanding of people in images and videos. In this work,
we present a realtime approach to detect the 2D pose of multiple people in an
image. The proposed method uses a nonparametric representation, which we refer
to as Part Affinity Fields (PAFs), to learn to associate body parts with
individuals in the image. This bottom-up system achieves high accuracy and
realtime performance, regardless of the number of people in the image. In
previous work, PAFs and body part location estimation were refined
simultaneously across training stages. We demonstrate that a PAF-only
refinement rather than both PAF and body part location refinement results in a
substantial increase in both runtime performance and accuracy. We also present
the first combined body and foot keypoint detector, based on an internal
annotated foot dataset that we have publicly released. We show that the
combined detector not only reduces the inference time compared to running them
sequentially, but also maintains the accuracy of each component individually.
This work has culminated in the release of OpenPose, the first open-source
realtime system for multi-person 2D pose detection, including body, foot, hand,
and facial keypoints.Comment: Journal version of arXiv:1611.08050, with better accuracy and faster
speed, release a new foot keypoint dataset:
https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset
Towards Real-time Eyeblink Detection in The Wild:Dataset,Theory and Practices
Effective and real-time eyeblink detection is of wide-range applications,
such as deception detection, drive fatigue detection, face anti-spoofing, etc.
Although numerous of efforts have already been paid, most of them focus on
addressing the eyeblink detection problem under the constrained indoor
conditions with the relative consistent subject and environment setup.
Nevertheless, towards the practical applications eyeblink detection in the wild
is more required, and of greater challenges. However, to our knowledge this has
not been well studied before. In this paper, we shed the light to this research
topic. A labelled eyeblink in the wild dataset (i.e., HUST-LEBW) of 673
eyeblink video samples (i.e., 381 positives, and 292 negatives) is first
established by us. These samples are captured from the unconstrained movies,
with the dramatic variation on human attribute, human pose, illumination
condition, imaging configuration, etc. Then, we formulate eyeblink detection
task as a spatial-temporal pattern recognition problem. After locating and
tracking human eye using SeetaFace engine and KCF tracker respectively, a
modified LSTM model able to capture the multi-scale temporal information is
proposed to execute eyeblink verification. A feature extraction approach that
reveals appearance and motion characteristics simultaneously is also proposed.
The experiments on HUST-LEBW reveal the superiority and efficiency of our
approach. It also verifies that, the existing eyeblink detection methods cannot
achieve satisfactory performance in the wild
On Face Segmentation, Face Swapping, and Face Perception
We show that even when face images are unconstrained and arbitrarily paired,
face swapping between them is actually quite simple. To this end, we make the
following contributions. (a) Instead of tailoring systems for face
segmentation, as others previously proposed, we show that a standard fully
convolutional network (FCN) can achieve remarkably fast and accurate
segmentations, provided that it is trained on a rich enough example set. For
this purpose, we describe novel data collection and generation routines which
provide challenging segmented face examples. (b) We use our segmentations to
enable robust face swapping under unprecedented conditions. (c) Unlike previous
work, our swapping is robust enough to allow for extensive quantitative tests.
To this end, we use the Labeled Faces in the Wild (LFW) benchmark and measure
the effect of intra- and inter-subject face swapping on recognition. We show
that our intra-subject swapped faces remain as recognizable as their sources,
testifying to the effectiveness of our method. In line with well known
perceptual studies, we show that better face swapping produces less
recognizable inter-subject results. This is the first time this effect was
quantitatively demonstrated for machine vision systems
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
Multimodal Polynomial Fusion for Detecting Driver Distraction
Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone.
Although there has been a considerable amount of research on modeling the
distracted behavior of drivers under various conditions, accurate automatic
detection using multiple modalities and especially the contribution of using
the speech modality to improve accuracy has received little attention. This
paper introduces a new multimodal dataset for distracted driving behavior and
discusses automatic distraction detection using features from three modalities:
facial expression, speech and car signals. Detailed multimodal feature analysis
shows that adding more modalities monotonically increases the predictive
accuracy of the model. Finally, a simple and effective multimodal fusion
technique using a polynomial fusion layer shows superior distraction detection
results compared to the baseline SVM and neural network models.Comment: INTERSPEECH 201
Towards Fine-grained Human Pose Transfer with Detail Replenishing Network
Human pose transfer (HPT) is an emerging research topic with huge potential
in fashion design, media production, online advertising and virtual reality.
For these applications, the visual realism of fine-grained appearance details
is crucial for production quality and user engagement. However, existing HPT
methods often suffer from three fundamental issues: detail deficiency, content
ambiguity and style inconsistency, which severely degrade the visual quality
and realism of generated images. Aiming towards real-world applications, we
develop a more challenging yet practical HPT setting, termed as Fine-grained
Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail
replenishment. Concretely, we analyze the potential design flaws of existing
methods via an illustrative example, and establish the core FHPT methodology by
combing the idea of content synthesis and feature transfer together in a
mutually-guided fashion. Thereafter, we substantiate the proposed methodology
with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine
model training scheme. Moreover, we build up a complete suite of fine-grained
evaluation protocols to address the challenges of FHPT in a comprehensive
manner, including semantic analysis, structural detection and perceptual
quality assessment. Extensive experiments on the DeepFashion benchmark dataset
have verified the power of proposed benchmark against start-of-the-art works,
with 12\%-14\% gain on top-10 retrieval recall, 5\% higher joint localization
accuracy, and near 40\% gain on face identity preservation. Moreover, the
evaluation results offer further insights to the subject matter, which could
inspire many promising future works along this direction.Comment: IEEE TIP submissio
Unconstrained Fashion Landmark Detection via Hierarchical Recurrent Transformer Networks
Fashion landmarks are functional key points defined on clothes, such as
corners of neckline, hemline, and cuff. They have been recently introduced as
an effective visual representation for fashion image understanding. However,
detecting fashion landmarks are challenging due to background clutters, human
poses, and scales. To remove the above variations, previous works usually
assumed bounding boxes of clothes are provided in training and test as
additional annotations, which are expensive to obtain and inapplicable in
practice. This work addresses unconstrained fashion landmark detection, where
clothing bounding boxes are not provided in both training and test. To this
end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and
landmarks are jointly estimated and trained iteratively in an end-to-end
manner. DLAN contains two dedicated modules, including a Selective Dilated
Convolution for handling scale discrepancies, and a Hierarchical Recurrent
Spatial Transformer for handling background clutters. To evaluate DLAN, we
present a large-scale fashion landmark dataset, namely Unconstrained Landmark
Database (ULD), consisting of 30K images. Statistics show that ULD is more
challenging than existing datasets in terms of image scales, background
clutters, and human poses. Extensive experiments demonstrate the effectiveness
of DLAN over the state-of-the-art methods. DLAN also exhibits excellent
generalization across different clothing categories and modalities, making it
extremely suitable for real-world fashion analysis.Comment: To appear in ACM Multimedia (ACM MM) 2017 as a full research paper.
More details at the project page:
http://personal.ie.cuhk.edu.hk/~lz013/projects/UnconstrainedLandmarks.htm
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Fingertip in the Eye: A cascaded CNN pipeline for the real-time fingertip detection in egocentric videos
We introduce a new pipeline for hand localization and fingertip detection.
For RGB images captured from an egocentric vision mobile camera, hand and
fingertip detection remains a challenging problem due to factors like
background complexity and hand shape variety. To address these issues
accurately and robustly, we build a large scale dataset named Ego-Fingertip and
propose a bi-level cascaded pipeline of convolutional neural networks, namely,
Attention-based Hand Detector as well as Multi-point Fingertip Detector. The
proposed method significantly tackles challenges and achieves satisfactorily
accurate prediction and real-time performance compared to previous hand and
fingertip detection methods.Comment: 5 pages, 8 figure
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
Traditional architectures for solving computer vision problems and the degree
of success they enjoyed have been heavily reliant on hand-crafted features.
However, of late, deep learning techniques have offered a compelling
alternative -- that of automatically learning problem-specific features. With
this new paradigm, every problem in computer vision is now being re-examined
from a deep learning perspective. Therefore, it has become important to
understand what kind of deep networks are suitable for a given problem.
Although general surveys of this fast-moving paradigm (i.e. deep-networks)
exist, a survey specific to computer vision is missing. We specifically
consider one form of deep networks widely used in computer vision -
convolutional neural networks (CNNs). We start with "AlexNet" as our base CNN
and then examine the broad variations proposed over time to suit different
applications. We hope that our recipe-style survey will serve as a guide,
particularly for novice practitioners intending to use deep-learning techniques
for computer vision.Comment: Published in Frontiers in Robotics and AI (http://goo.gl/6691Bm
- …