15,117 research outputs found
Multi-Person Pose Estimation with Enhanced Feature Aggregation and Selection
We propose a novel Enhanced Feature Aggregation and Selection network
(EFASNet) for multi-person 2D human pose estimation. Due to enhanced feature
representation, our method can well handle crowded, cluttered and occluded
scenes. More specifically, a Feature Aggregation and Selection Module (FASM),
which constructs hierarchical multi-scale feature aggregation and makes the
aggregated features discriminative, is proposed to get more accurate
fine-grained representation, leading to more precise joint locations. Then, we
perform a simple Feature Fusion (FF) strategy which effectively fuses
high-resolution spatial features and low-resolution semantic features to obtain
more reliable context information for well-estimated joints. Finally, we build
a Dense Upsampling Convolution (DUC) module to generate more precise
prediction, which can recover missing joint details that are usually
unavailable in common upsampling process. As a result, the predicted keypoint
heatmaps are more accurate. Comprehensive experiments demonstrate that the
proposed approach outperforms the state-of-the-art methods and achieves the
superior performance over three benchmark datasets: the recent big dataset
CrowdPose, the COCO keypoint detection dataset and the MPII Human Pose dataset.
Our code will be released upon acceptance.Comment: arXiv admin note: text overlap with arXiv:1905.03466 by other author
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Gait Recognition via Disentangled Representation Learning
Gait, the walking pattern of individuals, is one of the most important
biometrics modalities. Most of the existing gait recognition methods take
silhouettes or articulated body models as the gait features. These methods
suffer from degraded recognition performance when handling confounding
variables, such as clothing, carrying and view angle. To remedy this issue, we
propose a novel AutoEncoder framework to explicitly disentangle pose and
appearance features from RGB imagery and the LSTM-based integration of pose
features over time produces the gait feature. In addition, we collect a
Frontal-View Gait (FVG) dataset to focus on gait recognition from frontal-view
walking, which is a challenging problem since it contains minimal gait cues
compared to other views. FVG also includes other important variations, e.g.,
walking speed, carrying, and clothing. With extensive experiments on CASIA-B,
USF and FVG datasets, our method demonstrates superior performance to the state
of the arts quantitatively, the ability of feature disentanglement
qualitatively, and promising computational efficiency.Comment: To appear at CVPR 2019 as an oral presentatio
GAN-based Pose-aware Regulation for Video-based Person Re-identification
Video-based person re-identification deals with the inherent difficulty of
matching unregulated sequences with different length and with incomplete target
pose/viewpoint structure. Common approaches operate either by reducing the
problem to the still images case, facing a significant information loss, or by
exploiting inter-sequence temporal dependencies as in Siamese Recurrent Neural
Networks or in gait analysis. However, in all cases, the inter-sequences
pose/viewpoint misalignment is not considered, and the existing spatial
approaches are mostly limited to the still images context. To this end, we
propose a novel approach that can exploit more effectively the rich video
information, by accounting for the role that the changing pose/viewpoint factor
plays in the sequences matching process. Specifically, our approach consists of
two components. The first one attempts to complement the original
pose-incomplete information carried by the sequences with synthetic
GAN-generated images, and fuse their feature vectors into a more discriminative
viewpoint-insensitive embedding, namely Weighted Fusion (WF). Another one
performs an explicit pose-based alignment of sequence pairs to promote coherent
feature matching, namely Weighted-Pose Regulation (WPR). Extensive experiments
on two large video-based benchmark datasets show that our approach outperforms
considerably existing methods
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Bio-Inspired Human Action Recognition using Hybrid Max-Product Neuro-Fuzzy Classifier and Quantum-Behaved PSO
Studies on computational neuroscience through functional magnetic resonance
imaging (fMRI) and following biological inspired system stated that human
action recognition in the brain of mammalian leads two distinct pathways in the
model, which are specialized for analysis of motion (optic flow) and form
information. Principally, we have defined a novel and robust form features
applying active basis model as form extractor in form pathway in the biological
inspired model. An unbalanced synergetic neural net-work classifies shapes and
structures of human objects along with tuning its attention parameter by
quantum particle swarm optimization (QPSO) via initiation of Centroidal Voronoi
Tessellations. These tools utilized and justified as strong tools for following
biological system model in form pathway. But the final decision has done by
combination of ultimate outcomes of both pathways via fuzzy inference which
increases novality of proposed model. Combination of these two brain pathways
is done by considering each feature sets in Gaussian membership functions with
fuzzy product inference method. Two configurations have been proposed for form
pathway: applying multi-prototype human action templates using two time
synergetic neural network for obtaining uniform template regarding each
actions, and second scenario that it uses abstracting human action in four
key-frames. Experimental results showed promising accuracy performance on
different datasets (KTH and Weizmann).Comment: author's version, SWJ 201
An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition
Skeleton-based action recognition is an important task that requires the
adequate understanding of movement characteristics of a human action from the
given skeleton sequence. Recent studies have shown that exploring spatial and
temporal features of the skeleton sequence is vital for this task.
Nevertheless, how to effectively extract discriminative spatial and temporal
features is still a challenging problem. In this paper, we propose a novel
Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action
recognition from skeleton data. The proposed AGC-LSTM can not only capture
discriminative features in spatial configuration and temporal dynamics but also
explore the co-occurrence relationship between spatial and temporal domains. We
also present a temporal hierarchical architecture to increases temporal
receptive fields of the top AGC-LSTM layer, which boosts the ability to learn
the high-level semantic representation and significantly reduces the
computation cost. Furthermore, to select discriminative spatial information,
the attention mechanism is employed to enhance information of key joints in
each AGC-LSTM layer. Experimental results on two datasets are provided: NTU
RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate
the effectiveness of our approach and show that our approach outperforms the
state-of-the-art methods on both datasets.Comment: Accepted by CVPR201
G2DA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person Re-Identification
RGB-Infrared (IR) person re-identification aims to retrieve
person-of-interest from heterogeneous cameras, easily suffering from large
image modality discrepancy caused by different sensing wavelength ranges.
Existing work usually minimizes such discrepancy by aligning domain
distribution of global features, while neglecting the intra-modality structural
relations between semantic parts. This could result in the network overly
focusing on local cues, without considering long-range body part dependencies,
leading to meaningless region representations. In this paper, we propose a
graph-enabled distribution matching solution, dubbed Geometry-Guided
Dual-Alignment (G2DA) learning, for RGB-IR ReID. It can jointly encourage the
cross-modal consistency between part semantics and structural relations for
fine-grained modality alignment by solving a graph matching task within a
multi-scale skeleton graph that embeds human topology information.
Specifically, we propose to build a semantic-aligned complete graph into which
all cross-modality images can be mapped via a pose-adaptive graph construction
mechanism. This graph represents extracted whole-part features by nodes and
expresses the node-wise similarities with associated edges. To achieve the
graph-based dual-alignment learning, an Optimal Transport (OT) based structured
metric is further introduced to simultaneously measure point-wise relations and
group-wise structural similarities across modalities. By minimizing the cost of
an inter-modality transport plan, G2DA can learn a consistent and
discriminative feature subspace for cross-modality image retrieval.
Furthermore, we advance a Message Fusion Attention (MFA) mechanism to
adaptively reweight the information flow of semantic propagation, effectively
strengthening the discriminability of extracted semantic features.Comment: 14 pages, 7 figure
A Survey on Truth Discovery
Thanks to information explosion, data for the objects of interest can be
collected from increasingly more sources. However, for the same object, there
usually exist conflicts among the collected multi-source information. To tackle
this challenge, truth discovery, which integrates multi-source noisy
information by estimating the reliability of each source, has emerged as a hot
topic. Several truth discovery methods have been proposed for various
scenarios, and they have been successfully applied in diverse application
domains. In this survey, we focus on providing a comprehensive overview of
truth discovery methods, and summarizing them from different aspects. We also
discuss some future directions of truth discovery research. We hope that this
survey will promote a better understanding of the current progress on truth
discovery, and offer some guidelines on how to apply these approaches in
application domains
Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions
The attention mechanism provides a sequential prediction framework for
learning spatial models with enhanced implicit temporal consistency. In this
work, we show a systematic design (from 2D to 3D) for how conventional networks
and other forms of constraints can be incorporated into the attention framework
for learning long-range dependencies for the task of pose estimation. The
contribution of this paper is to provide a systematic approach for designing
and training of attention-based models for the end-to-end pose estimation, with
the flexibility and scalability of arbitrary video sequences as input. We
achieve this by adapting temporal receptive field via a multi-scale structure
of dilated convolutions. Besides, the proposed architecture can be easily
adapted to a causal model enabling real-time performance. Any off-the-shelf 2D
pose estimation systems, e.g. Mocap libraries, can be easily integrated in an
ad-hoc fashion. Our method achieves the state-of-the-art performance and
outperforms existing methods by reducing the mean per joint position error to
33.4 mm on Human3.6M dataset
- …