839 research outputs found
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Detailed 2D-3D Joint Representation for Human-Object Interaction
Human-Object Interaction (HOI) detection lies at the core of action
understanding. Besides 2D information such as human/object appearance and
locations, 3D pose is also usually utilized in HOI learning since its
view-independence. However, rough 3D body joints just carry sparse body
information and are not sufficient to understand complex interactions. Thus, we
need detailed 3D body shape to go further. Meanwhile, the interacted object in
3D is also not fully studied in HOI learning. In light of these, we propose a
detailed 2D-3D joint representation learning method. First, we utilize the
single-view human body capture method to obtain detailed 3D body, face and hand
shapes. Next, we estimate the 3D object location and size with reference to the
2D human-object spatial configuration and object category priors. Finally, a
joint learning framework and cross-modal consistency tasks are proposed to
learn the joint HOI representation. To better evaluate the 2D ambiguity
processing capacity of models, we propose a new benchmark named Ambiguous-HOI
consisting of hard ambiguous images. Extensive experiments in large-scale HOI
benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code
and data are available at https://github.com/DirtyHarryLYL/DJ-RN.Comment: Accepted to CVPR 2020, supplementary materials included, code
available:https://github.com/DirtyHarryLYL/DJ-R
Gait Recognition via Disentangled Representation Learning
Gait, the walking pattern of individuals, is one of the most important
biometrics modalities. Most of the existing gait recognition methods take
silhouettes or articulated body models as the gait features. These methods
suffer from degraded recognition performance when handling confounding
variables, such as clothing, carrying and view angle. To remedy this issue, we
propose a novel AutoEncoder framework to explicitly disentangle pose and
appearance features from RGB imagery and the LSTM-based integration of pose
features over time produces the gait feature. In addition, we collect a
Frontal-View Gait (FVG) dataset to focus on gait recognition from frontal-view
walking, which is a challenging problem since it contains minimal gait cues
compared to other views. FVG also includes other important variations, e.g.,
walking speed, carrying, and clothing. With extensive experiments on CASIA-B,
USF and FVG datasets, our method demonstrates superior performance to the state
of the arts quantitatively, the ability of feature disentanglement
qualitatively, and promising computational efficiency.Comment: To appear at CVPR 2019 as an oral presentatio
Towards the Design of an End-to-End Automated System for Image and Video-based Recognition
Over many decades, researchers working in object recognition have longed for
an end-to-end automated system that will simply accept 2D or 3D image or videos
as inputs and output the labels of objects in the input data. Computer vision
methods that use representations derived based on geometric, radiometric and
neural considerations and statistical and structural matchers and artificial
neural network-based methods where a multi-layer network learns the mapping
from inputs to class labels have provided competing approaches for image
recognition problems. Over the last four years, methods based on Deep
Convolutional Neural Networks (DCNNs) have shown impressive performance
improvements on object detection/recognition challenge problems. This has been
made possible due to the availability of large annotated data, a better
understanding of the non-linear mapping between image and class labels as well
as the affordability of GPUs. In this paper, we present a brief history of
developments in computer vision and artificial neural networks over the last
forty years for the problem of image-based recognition. We then present the
design details of a deep learning system for end-to-end unconstrained face
verification/recognition. Some open issues regarding DCNNs for object
recognition problems are then discussed. We caution the readers that the views
expressed in this paper are from the authors and authors only!Comment: 7 page
Pedestrian Alignment Network for Large-scale Person Re-identification
Person re-identification (person re-ID) is mostly viewed as an image
retrieval problem. This task aims to search a query person in a large image
pool. In practice, person re-ID usually adopts automatic detectors to obtain
cropped pedestrian images. However, this process suffers from two types of
detector errors: excessive background and part missing. Both errors deteriorate
the quality of pedestrian alignment and may compromise pedestrian matching due
to the position and scale variances. To address the misalignment problem, we
propose that alignment can be learned from an identification procedure. We
introduce the pedestrian alignment network (PAN) which allows discriminative
embedding learning and pedestrian alignment without extra annotations. Our key
observation is that when the convolutional neural network (CNN) learns to
discriminate between different identities, the learned feature maps usually
exhibit strong activations on the human body rather than the background. The
proposed network thus takes advantage of this attention mechanism to adaptively
locate and align pedestrians within a bounding box. Visual examples show that
pedestrians are better aligned with PAN. Experiments on three large-scale re-ID
datasets confirm that PAN improves the discriminative ability of the feature
embeddings and yields competitive accuracy with the state-of-the-art methods
Learning to Infer the Depth Map of a Hand from its Color Image
We propose the first approach to the problem of inferring the depth map of a
human hand based on a single RGB image. We achieve this with a Convolutional
Neural Network (CNN) that employs a stacked hourglass model as its main
building block. Intermediate supervision is used in several outputs of the
proposed architecture in a staged approach. To aid the process of training and
inference, hand segmentation masks are also estimated in such an intermediate
supervision step, and used to guide the subsequent depth estimation process. In
order to train and evaluate the proposed method we compile and make publicly
available HandRGBD, a new dataset of 20,601 views of hands, each consisting of
an RGB image and an aligned depth map. Based on HandRGBD, we explore variants
of the proposed approach in an ablative study and determine the best performing
one. The results of an extensive experimental evaluation demonstrate that hand
depth estimation from a single RGB frame can be achieved with an accuracy of
22mm, which is comparable to the accuracy achieved by contemporary low-cost
depth cameras. Such a 3D reconstruction of hands based on RGB information is
valuable as a final result on its own right, but also as an input to several
other hand analysis and perception algorithms that require depth input.
Essentially, in such a context, the proposed approach bridges the gap between
RGB and RGBD, by making all existing RGBD-based methods applicable to RGB
input
Survey on RGB, 3D, Thermal, and Multimodal Approaches for Facial Expression Recognition: History, Trends, and Affect-related Applications
Facial expressions are an important way through which humans interact
socially. Building a system capable of automatically recognizing facial
expressions from images and video has been an intense field of study in recent
years. Interpreting such expressions remains challenging and much research is
needed about the way they relate to human affect. This paper presents a general
overview of automatic RGB, 3D, thermal and multimodal facial expression
analysis. We define a new taxonomy for the field, encompassing all steps from
face detection to facial expression recognition, and describe and classify the
state of the art methods accordingly. We also present the important datasets
and the bench-marking of most influential methods. We conclude with a general
discussion about trends, important questions and future lines of research
A Survey of the Trends in Facial and Expression Recognition Databases and Methods
Automated facial identification and facial expression recognition have been
topics of active research over the past few decades. Facial and expression
recognition find applications in human-computer interfaces, subject tracking,
real-time security surveillance systems and social networking. Several holistic
and geometric methods have been developed to identify faces and expressions
using public and local facial image databases. In this work we present the
evolution in facial image data sets and the methodologies for facial
identification and recognition of expressions such as anger, sadness,
happiness, disgust, fear and surprise. We observe that most of the earlier
methods for facial and expression recognition aimed at improving the
recognition rates for facial feature-based methods using static images.
However, the recent methodologies have shifted focus towards robust
implementation of facial/expression recognition from large image databases that
vary with space (gathered from the internet) and time (video recordings). The
evolution trends in databases and methodologies for facial and expression
recognition can be useful for assessing the next-generation topics that may
have applications in security systems or personal identification systems that
involve "Quantitative face" assessments.Comment: 16 pages, 4 figures, 3 tables, International Journal of Computer
Science and Engineering Survey, October, 201
Deep Face Recognition: A Survey
Deep learning applies multiple processing layers to learn representations of
data with multiple levels of feature extraction. This emerging technique has
reshaped the research landscape of face recognition (FR) since 2014, launched
by the breakthroughs of DeepFace and DeepID. Since then, deep learning
technique, characterized by the hierarchical architecture to stitch together
pixels into invariant face representation, has dramatically improved the
state-of-the-art performance and fostered successful real-world applications.
In this survey, we provide a comprehensive review of the recent developments on
deep FR, covering broad topics on algorithm designs, databases, protocols, and
application scenes. First, we summarize different network architectures and
loss functions proposed in the rapid evolution of the deep FR methods. Second,
the related face processing methods are categorized into two classes:
"one-to-many augmentation" and "many-to-one normalization". Then, we summarize
and compare the commonly used databases for both model training and evaluation.
Third, we review miscellaneous scenes in deep FR, such as cross-factor,
heterogenous, multiple-media and industrial scenes. Finally, the technical
challenges and several promising directions are highlighted.Comment: Neurocomputin
- …