4,307 research outputs found
Deep High-Resolution Representation Learning for Human Pose Estimation
This is an official pytorch implementation of Deep High-Resolution
Representation Learning for Human Pose Estimation. In this work, we are
interested in the human pose estimation problem with a focus on learning
reliable high-resolution representations. Most existing methods recover
high-resolution representations from low-resolution representations produced by
a high-to-low resolution network. Instead, our proposed network maintains
high-resolution representations through the whole process. We start from a
high-resolution subnetwork as the first stage, gradually add high-to-low
resolution subnetworks one by one to form more stages, and connect the
mutli-resolution subnetworks in parallel. We conduct repeated multi-scale
fusions such that each of the high-to-low resolution representations receives
information from other parallel representations over and over, leading to rich
high-resolution representations. As a result, the predicted keypoint heatmap is
potentially more accurate and spatially more precise. We empirically
demonstrate the effectiveness of our network through the superior pose
estimation results over two benchmark datasets: the COCO keypoint detection
dataset and the MPII Human Pose dataset. The code and models have been publicly
available at
\url{https://github.com/leoxiaobin/deep-high-resolution-net.pytorch}.Comment: accepted by CVPR201
Deep Appearance Models: A Deep Boltzmann Machine Approach for Face Modeling
The "interpretation through synthesis" approach to analyze face images,
particularly Active Appearance Models (AAMs) method, has become one of the most
successful face modeling approaches over the last two decades. AAM models have
ability to represent face images through synthesis using a controllable
parameterized Principal Component Analysis (PCA) model. However, the accuracy
and robustness of the synthesized faces of AAM are highly depended on the
training sets and inherently on the generalizability of PCA subspaces. This
paper presents a novel Deep Appearance Models (DAMs) approach, an efficient
replacement for AAMs, to accurately capture both shape and texture of face
images under large variations. In this approach, three crucial components
represented in hierarchical layers are modeled using the Deep Boltzmann
Machines (DBM) to robustly capture the variations of facial shapes and
appearances. DAMs are therefore superior to AAMs in inferencing a
representation for new face images under various challenging conditions. The
proposed approach is evaluated in various applications to demonstrate its
robustness and capabilities, i.e. facial super-resolution reconstruction,
facial off-angle reconstruction or face frontalization, facial occlusion
removal and age estimation using challenging face databases, i.e. Labeled Face
Parts in the Wild (LFPW), Helen and FG-NET. Comparing to AAMs and other deep
learning based approaches, the proposed DAMs achieve competitive results in
those applications, thus this showed their advantages in handling occlusions,
facial representation, and reconstruction
Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation
In this paper, we propose a pose grammar to tackle the problem of 3D human
pose estimation. Our model directly takes 2D pose as input and learns a
generalized 2D-3D mapping function. The proposed model consists of a base
network which efficiently captures pose-aligned features and a hierarchy of
Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of
knowledge regarding human body configuration (i.e., kinematics, symmetry, motor
coordination). The proposed model thus enforces high-level constraints over
human poses. In learning, we develop a pose sample simulator to augment
training samples in virtual camera views, which further improves our model
generalizability. We validate our method on public 3D human pose benchmarks and
propose a new evaluation protocol working on cross-view setting to verify the
generalization capability of different methods. We empirically observe that
most state-of-the-art methods encounter difficulty under such setting while our
method can well handle such challenges.Comment: Accepted by AAAI 201
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
Densely Semantically Aligned Person Re-Identification
We propose a densely semantically aligned person re-identification framework.
It fundamentally addresses the body misalignment problem caused by
pose/viewpoint variations, imperfect person detection, occlusion, etc. By
leveraging the estimation of the dense semantics of a person image, we
construct a set of densely semantically aligned part images (DSAP-images),
where the same spatial positions have the same semantics across different
images. We design a two-stream network that consists of a main full image
stream (MF-Stream) and a densely semantically-aligned guiding stream
(DSAG-Stream). The DSAG-Stream, with the DSAP-images as input, acts as a
regulator to guide the MF-Stream to learn densely semantically aligned features
from the original image. In the inference, the DSAG-Stream is discarded and
only the MF-Stream is needed, which makes the inference system computationally
efficient and robust. To the best of our knowledge, we are the first to make
use of fine grained semantics to address the misalignment problems for re-ID.
Our method achieves rank-1 accuracy of 78.9% (new protocol) on the CUHK03
dataset, 90.4% on the CUHK01 dataset, and 95.7% on the Market1501 dataset,
outperforming state-of-the-art methods.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Improved training of binary networks for human pose estimation and image recognition
Big neural networks trained on large datasets have advanced the
state-of-the-art for a large variety of challenging problems, improving
performance by a large margin. However, under low memory and limited
computational power constraints, the accuracy on the same problems drops
considerable. In this paper, we propose a series of techniques that
significantly improve the accuracy of binarized neural networks (i.e networks
where both the features and the weights are binary). We evaluate the proposed
improvements on two diverse tasks: fine-grained recognition (human pose
estimation) and large-scale image recognition (ImageNet classification).
Specifically, we introduce a series of novel methodological changes including:
(a) more appropriate activation functions, (b) reverse-order initialization,
(c) progressive quantization, and (d) network stacking and show that these
additions improve existing state-of-the-art network binarization techniques,
significantly. Additionally, for the first time, we also investigate the extent
to which network binarization and knowledge distillation can be combined. When
tested on the challenging MPII dataset, our method shows a performance
improvement of more than 4% in absolute terms. Finally, we further validate our
findings by applying the proposed techniques for large-scale object recognition
on the Imagenet dataset, on which we report a reduction of error rate by 4%
Matrix and tensor decompositions for training binary neural networks
This paper is on improving the training of binary neural networks in which
both activations and weights are binary. While prior methods for neural network
binarization binarize each filter independently, we propose to instead
parametrize the weight tensor of each layer using matrix or tensor
decomposition. The binarization process is then performed using this latent
parametrization, via a quantization function (e.g. sign function) applied to
the reconstructed weights. A key feature of our method is that while the
reconstruction is binarized, the computation in the latent factorized space is
done in the real domain. This has several advantages: (i) the latent
factorization enforces a coupling of the filters before binarization, which
significantly improves the accuracy of the trained models. (ii) while at
training time, the binary weights of each convolutional layer are parametrized
using real-valued matrix or tensor decomposition, during inference we simply
use the reconstructed (binary) weights. As a result, our method does not
sacrifice any advantage of binary networks in terms of model compression and
speeding-up inference. As a further contribution, instead of computing the
binary weight scaling factors analytically, as in prior work, we propose to
learn them discriminatively via back-propagation. Finally, we show that our
approach significantly outperforms existing methods when tested on the
challenging tasks of (a) human pose estimation (more than 4% improvements) and
(b) ImageNet classification (up to 5% performance gains)
Learning Compositional Neural Information Fusion for Human Parsing
This work proposes to combine neural networks with the compositional
hierarchy of human bodies for efficient and complete human parsing. We
formulate the approach as a neural information fusion framework. Our model
assembles the information from three inference processes over the hierarchy:
direct inference (directly predicting each part of a human body using image
information), bottom-up inference (assembling knowledge from constituent
parts), and top-down inference (leveraging context from parent nodes). The
bottom-up and top-down inferences explicitly model the compositional and
decompositional relations in human bodies, respectively. In addition, the
fusion of multi-source information is conditioned on the inputs, i.e., by
estimating and considering the confidence of the sources. The whole model is
end-to-end differentiable, explicitly modeling information flows and
structures. Our approach is extensively evaluated on four popular datasets,
outperforming the state-of-the-arts in all cases, with a fast processing speed
of 23fps. Our code and results have been released to help ease future research
in this direction.Comment: ICCV2019. Websie:
https://github.com/ZzzjzzZ/CompositionalHumanParsin
Modeling of Facial Aging and Kinship: A Survey
Computational facial models that capture properties of facial cues related to
aging and kinship increasingly attract the attention of the research community,
enabling the development of reliable methods for age progression, age
estimation, age-invariant facial characterization, and kinship verification
from visual data. In this paper, we review recent advances in modeling of
facial aging and kinship. In particular, we provide an up-to date, complete
list of available annotated datasets and an in-depth analysis of geometric,
hand-crafted, and learned facial representations that are used for facial aging
and kinship characterization. Moreover, evaluation protocols and metrics are
reviewed and notable experimental results for each surveyed task are analyzed.
This survey allows us to identify challenges and discuss future research
directions for the development of robust facial models in real-world
conditions
- …