25,278 research outputs found
Human Pose Regression by Combining Indirect Part Detection and Contextual Information
In this paper, we propose an end-to-end trainable regression approach for
human pose estimation from still images. We use the proposed Soft-argmax
function to convert feature maps directly to joint coordinates, resulting in a
fully differentiable framework. Our method is able to learn heat maps
representations indirectly, without additional steps of artificial ground truth
generation. Consequently, contextual information can be included to the pose
predictions in a seamless way. We evaluated our method on two very challenging
datasets, the Leeds Sports Poses (LSP) and the MPII Human Pose datasets,
reaching the best performance among all the existing regression methods and
comparable results to the state-of-the-art detection based approaches
DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild
Face Alignment is an active computer vision domain, that consists in
localizing a number of facial landmarks that vary across datasets.
State-of-the-art face alignment methods either consist in end-to-end
regression, or in refining the shape in a cascaded manner, starting from an
initial guess. In this paper, we introduce DeCaFA, an end-to-end deep
convolutional cascade architecture for face alignment. DeCaFA uses
fully-convolutional stages to keep full spatial resolution throughout the
cascade. Between each cascade stage, DeCaFA uses multiple chained transfer
layers with spatial softmax to produce landmark-wise attention maps for each of
several landmark alignment tasks. Weighted intermediate supervision, as well as
efficient feature fusion between the stages allow to learn to progressively
refine the attention maps in an end-to-end manner. We show experimentally that
DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW
databases. In addition, we show that DeCaFA can learn fine alignment with
reasonable accuracy from very few images using coarsely annotated data
When 3D-Aided 2D Face Recognition Meets Deep Learning: An extended UR2D for Pose-Invariant Face Recognition
Most of the face recognition works focus on specific modules or demonstrate a
research idea. This paper presents a pose-invariant 3D-aided 2D face
recognition system (UR2D) that is robust to pose variations as large as 90? by
leveraging deep learning technology. The architecture and the interface of UR2D
are described, and each module is introduced in detail. Extensive experiments
are conducted on the UHDB31 and IJB-A, demonstrating that UR2D outperforms
existing 2D face recognition systems such as VGG-Face, FaceNet, and a
commercial off-the-shelf software (COTS) by at least 9% on the UHDB31 dataset
and 3% on the IJB-A dataset on average in face identification tasks. UR2D also
achieves state-of-the-art performance of 85% on the IJB-A dataset by comparing
the Rank-1 accuracy score from template matching. It fills a gap by providing a
3D-aided 2D face recognition system that has compatible results with 2D face
recognition systems using deep learning techniques.Comment: Submitted to Special Issue on Biometrics in the Wild, Image and
Vision Computin
A vision based system for underwater docking
Autonomous underwater vehicles (AUVs) have been deployed for underwater
exploration. However, its potential is confined by its limited on-board battery
energy and data storage capacity. This problem has been addressed using docking
systems by underwater recharging and data transfer for AUVs. In this work, we
propose a vision based framework for underwater docking following these
systems. The proposed framework comprises two modules; (i) a detection module
which provides location information on underwater docking stations in 2D images
captured by an on-board camera, and (ii) a pose estimation module which
recovers the relative 3D position and orientation between docking stations and
AUVs from the 2D images. For robust and credible detection of docking stations,
we propose a convolutional neural network called Docking Neural Network (DoNN).
For accurate pose estimation, a perspective-n-point algorithm is integrated
into our framework. In order to examine our framework in underwater docking
tasks, we collected a dataset of 2D images, named Underwater Docking Images
Dataset (UDID), in an experimental water pool. To the best of our knowledge,
UDID is the first publicly available underwater docking dataset. In the
experiments, we first evaluate performance of the proposed detection module on
UDID and its deformed variations. Next, we assess the accuracy of the pose
estimation module by ground experiments, since it is not feasible to obtain
true relative position and orientation between docking stations and AUVs under
water. Then, we examine the pose estimation module by underwater experiments in
our experimental water pool. Experimental results show that the proposed
framework can be used to detect docking stations and estimate their relative
pose efficiently and successfully, compared to the state-of-the-art baseline
systems
Joint Multi-Person Pose Estimation and Semantic Part Segmentation
Human pose estimation and semantic part segmentation are two complementary
tasks in computer vision. In this paper, we propose to solve the two tasks
jointly for natural multi-person images, in which the estimated pose provides
object-level shape prior to regularize part segments while the part-level
segments constrain the variation of pose locations. Specifically, we first
train two fully convolutional neural networks (FCNs), namely Pose FCN and Part
FCN, to provide initial estimation of pose joint potential and semantic part
potential. Then, to refine pose joint location, the two types of potentials are
fused with a fully-connected conditional random field (FCRF), where a novel
segment-joint smoothness term is used to encourage semantic and spatial
consistency between parts and joints. To refine part segments, the refined pose
and the original part potential are integrated through a Part FCN, where the
skeleton feature from pose serves as additional regularization cues for part
segments. Finally, to reduce the complexity of the FCRF, we induce human
detection boxes and infer the graph inside each box, making the inference forty
times faster.
Since there's no dataset that contains both part segments and pose labels, we
extend the PASCAL VOC part dataset with human pose joints and perform extensive
experiments to compare our method against several most recent strategies. We
show that on this dataset our algorithm surpasses competing methods by a large
margin in both tasks.Comment: This paper has been accepted by CVPR 201
Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity
Human emotions analysis has been the focus of many studies, especially in the
field of Affective Computing, and is important for many applications, e.g.
human-computer intelligent interaction, stress analysis, interactive games,
animations, etc. Solutions for automatic emotion analysis have also benefited
from the development of deep learning approaches and the availability of vast
amount of visual facial data on the internet. This paper proposes a novel
method for human emotion recognition from a single RGB image. We construct a
large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial
dynamics, identities, expressions, appearance and 3D pose variations. We use
this dataset to train a deep Convolutional Neural Network for estimating
expression parameters of a 3D Morphable Model and combine it with an effective
back-end emotion classifier. Our proposed framework runs at 50 frames per
second and is capable of robustly estimating parameters of 3D expression
variation and accurately recognizing facial expressions from in-the-wild
images. We present extensive experimental evaluation that shows that the
proposed method outperforms the compared techniques in estimating the 3D
expression parameters and achieves state-of-the-art performance in recognising
the basic emotions from facial images, as well as recognising stress from
facial videos. %compared to the current state of the art in emotion recognition
from facial images.Comment: to be published in 15th IEEE International Conference on Automatic
Face and Gesture Recognition (FG 2020
Never Mind the Bounding Boxes, Here's the SAND Filters
Perception is the main bottleneck to perform autonomous mobile manipulation
tasks, especially in cluttered and unstructured environment. In this paper, we
propose a novel two-stage paradigm that leverage both CNN object prior and
generative sampling to perform object detection and 6D pose estimation. Our
two-stage approach builds upon both CNN and generative sampling-based local
search method to achieve sampling the network density, or SAND filter. We show
the quantitative results that SAND effectively improve object detection result
by reducing false positive and false negative recognitions, and further
produces accurate pose estimation. We also conduct extensive categorical object
sorting experiments to show our method is able to produce accurate and reliable
detections and object poses
Compositional Human Pose Regression
Regression based methods are not performing as well as detection based
methods for human pose estimation. A central problem is that the structural
information in the pose is not well exploited in the previous regression
methods. In this work, we propose a structure-aware regression approach. It
adopts a reparameterized pose representation using bones instead of joints. It
exploits the joint connection structure to define a compositional loss function
that encodes the long range interactions in the pose. It is simple, effective,
and general for both 2D and 3D pose estimation in a unified setting.
Comprehensive evaluation validates the effectiveness of our approach. It
significantly advances the state-of-the-art on Human3.6M and is competitive
with state-of-the-art results on MPII.Comment: Accepted by International Conference on Computer Vision (ICCV) 201
Orientation Driven Bag of Appearances for Person Re-identification
Person re-identification (re-id) consists of associating individual across
camera network, which is valuable for intelligent video surveillance and has
drawn wide attention. Although person re-identification research is making
progress, it still faces some challenges such as varying poses, illumination
and viewpoints. For feature representation in re-identification, existing works
usually use low-level descriptors which do not take full advantage of body
structure information, resulting in low representation ability.
%discrimination. To solve this problem, this paper proposes the mid-level
body-structure based feature representation (BSFR) which introduces body
structure pyramid for codebook learning and feature pooling in the vertical
direction of human body. Besides, varying viewpoints in the horizontal
direction of human body usually causes the data missing problem, , the
appearances obtained in different orientations of the identical person could
vary significantly. To address this problem, the orientation driven bag of
appearances (ODBoA) is proposed to utilize person orientation information
extracted by orientation estimation technic. To properly evaluate the proposed
approach, we introduce a new re-identification dataset (Market-1203) based on
the Market-1501 dataset and propose a new re-identification dataset (PKU-Reid).
Both datasets contain multiple images captured in different body orientations
for each person. Experimental results on three public datasets and two proposed
datasets demonstrate the superiority of the proposed approach, indicating the
effectiveness of body structure and orientation information for improving
re-identification performance.Comment: 13 pages, 15 figures, 3 tables, submitted to IEEE Transactions on
Circuits and Systems for Video Technolog
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
- …