76 research outputs found
Hybrid Graph Neural Networks for Crowd Counting
Crowd counting is an important yet challenging task due to the large scale
and density variation. Recent investigations have shown that distilling rich
relations among multi-scale features and exploiting useful information from the
auxiliary task, i.e., localization, are vital for this task. Nevertheless, how
to comprehensively leverage these relations within a unified network
architecture is still a challenging problem. In this paper, we present a novel
network structure called Hybrid Graph Neural Network (HyGnn) which targets to
relieve the problem by interweaving the multi-scale features for crowd density
as well as its auxiliary task (localization) together and performing joint
reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to
jointly represent the task-specific feature maps of different scales as nodes,
and two types of relations as edges:(i) multi-scale relations for capturing the
feature dependencies across scales and (ii) mutual beneficial relations
building bridges for the cooperation between counting and localization. Thus,
through message passing, HyGnn can distill rich relations between the nodes to
obtain more powerful representations, leading to robust and accurate results.
Our HyGnn performs significantly well on four challenging datasets:
ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF_QNRF, outperforming
the state-of-the-art approaches by a large margin.Comment: To appear in AAAI 202
High-Accuracy Facial Depth Models derived from 3D Synthetic Data
In this paper, we explore how synthetically generated 3D face models can be
used to construct a high accuracy ground truth for depth. This allows us to
train the Convolutional Neural Networks (CNN) to solve facial depth estimation
problems. These models provide sophisticated controls over image variations
including pose, illumination, facial expressions and camera position. 2D
training samples can be rendered from these models, typically in RGB format,
together with depth information. Using synthetic facial animations, a dynamic
facial expression or facial action data can be rendered for a sequence of image
frames together with ground truth depth and additional metadata such as head
pose, light direction, etc. The synthetic data is used to train a CNN based
facial depth estimation system which is validated on both synthetic and real
images. Potential fields of application include 3D reconstruction, driver
monitoring systems, robotic vision systems, and advanced scene understanding
Diagnosing Rarity in Human-Object Interaction Detection
Human-object interaction (HOI) detection is a core task in computer vision.
The goal is to localize all human-object pairs and recognize their
interactions. An interaction defined by a tuple leads to a
long-tailed visual recognition challenge since many combinations are rarely
represented. The performance of the proposed models is limited especially for
the tail categories, but little has been done to understand the reason. To that
end, in this paper, we propose to diagnose rarity in HOI detection. We propose
a three-step strategy, namely Detection, Identification and Recognition where
we carefully analyse the limiting factors by studying state-of-the-art models.
Our findings indicate that detection and identification steps are altered by
the interaction signals like occlusion and relative location, as a result
limiting the recognition accuracy.Comment: Accepted at CVPR'20 Workshop on Learning from Limited Label
Context-aware Human Motion Prediction
The problem of predicting human motion given a sequence of past observations
is at the core of many applications in robotics and computer vision. Current
state-of-the-art formulate this problem as a sequence-to-sequence task, in
which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that
predicts future movements, typically in the order of 1 to 2 seconds. However,
one aspect that has been obviated so far, is the fact that human motion is
inherently driven by interactions with objects and/or other humans in the
environment. In this paper, we explore this scenario using a novel
context-aware motion prediction architecture. We use a semantic-graph model
where the nodes parameterize the human and objects in the scene and the edges
their mutual interactions. These interactions are iteratively learned through a
graph attention layer, fed with the past observations, which now include both
object and human body motions. Once this semantic graph is learned, we inject
it to a standard RNN to predict future movements of the human/s and object/s.
We consider two variants of our architecture, either freezing the contextual
interactions in the future of updating them. A thorough evaluation in the
"Whole-Body Human Motion Database" shows that in both cases, our context-aware
networks clearly outperform baselines in which the context information is not
considered.Comment: Accepted at CVPR2
- …