115 research outputs found
Self-supervised Multi-view Person Association and Its Applications
Reliable markerless motion tracking of people participating in a complex
group activity from multiple moving cameras is challenging due to frequent
occlusions, strong viewpoint and appearance variations, and asynchronous video
streams. To solve this problem, reliable association of the same person across
distant viewpoints and temporal instances is essential. We present a
self-supervised framework to adapt a generic person appearance descriptor to
the unlabeled videos by exploiting motion tracking, mutual exclusion
constraints, and multi-view geometry. The adapted discriminative descriptor is
used in a tracking-by-clustering formulation. We validate the effectiveness of
our descriptor learning on WILDTRACK [14] and three new complex social scenes
captured by multiple cameras with up to 60 people "in the wild". We report
significant improvement in association accuracy (up to 18%) and stable and
coherent 3D human skeleton tracking (5 to 10 times) over the baseline. Using
the reconstructed 3D skeletons, we cut the input videos into a multi-angle
video where the image of a specified person is shown from the best visible
front-facing camera. Our algorithm detects inter-human occlusion to determine
the camera switching moment while still maintaining the flow of the action
well.Comment: Accepted to IEEE TPAM
A Differential Approach for Gaze Estimation
Non-invasive gaze estimation methods usually regress gaze directions directly
from a single face or eye image. However, due to important variabilities in eye
shapes and inner eye structures amongst individuals, universal models obtain
limited accuracies and their output usually exhibit high variance as well as
biases which are subject dependent. Therefore, increasing accuracy is usually
done through calibration, allowing gaze predictions for a subject to be mapped
to his/her actual gaze. In this paper, we introduce a novel image differential
method for gaze estimation. We propose to directly train a differential
convolutional neural network to predict the gaze differences between two eye
input images of the same subject. Then, given a set of subject specific
calibration images, we can use the inferred differences to predict the gaze
direction of a novel eye sample. The assumption is that by allowing the
comparison between two eye images, annoyance factors (alignment, eyelid
closing, illumination perturbations) which usually plague single image
prediction methods can be much reduced, allowing better prediction altogether.
Experiments on 3 public datasets validate our approach which constantly
outperforms state-of-the-art methods even when using only one calibration
sample or when the latter methods are followed by subject specific gaze
adaptation.Comment: Extension to our paper A differential approach for gaze estimation
with calibration (BMVC 2018) Submitted to PAMI on Aug. 7th, 2018 Accepted by
PAMI short on Dec. 2019, in IEEE Transactions on Pattern Analysis and Machine
Intelligenc
Monocular Total Capture: Posing Face, Body, and Hands in the Wild
We present the first method to capture the 3D total motion of a target person
from a monocular view input. Given an image or a monocular video, our method
reconstructs the motion from body, face, and fingers represented by a 3D
deformable mesh model. We use an efficient representation called 3D Part
Orientation Fields (POFs), to encode the 3D orientations of all body parts in
the common 2D image space. POFs are predicted by a Fully Convolutional Network
(FCN), along with the joint confidence maps. To train our network, we collect a
new 3D human motion dataset capturing diverse total body motion of 40 subjects
in a multiview system. We leverage a 3D deformable human model to reconstruct
total body pose from the CNN outputs by exploiting the pose and shape prior in
the model. We also present a texture-based tracking method to obtain temporally
coherent motion capture output. We perform thorough quantitative evaluations
including comparison with the existing body-specific and hand-specific methods,
and performance analysis on camera viewpoint and human pose changes. Finally,
we demonstrate the results of our total body motion capture on various
challenging in-the-wild videos. Our code and newly collected human motion
dataset will be publicly shared.Comment: 17 pages, 16 figure
3-D Laser-Based Multiclass and Multiview Object Detection in Cluttered Indoor Scenes
This paper investigates the problem of multiclass and multiview 3-D object detection for service robots operating in a cluttered indoor environment. A novel 3-D object detection system using laser point clouds is proposed to deal with cluttered indoor scenes with a fewer and imbalanced training data. Raw 3-D point clouds are first transformed to 2-D bearing angle images to reduce the computational cost, and then jointly trained multiple object detectors are deployed to perform the multiclass and multiview 3-D object detection. The reclassification technique is utilized on each detected low confidence bounding box in the system to reduce false alarms in the detection. The RUS-SMOTEboost algorithm is used to train a group of independent binary classifiers with imbalanced training data. Dense histograms of oriented gradients and local binary pattern features are combined as a feature set for the reclassification task. Based on the dalian university of technology (DUT)-3-D data set taken from various office and household environments, experimental results show the validity and good performance of the proposed method
RGB-D-based Action Recognition Datasets: A Survey
Human action recognition from RGB-D (Red, Green, Blue and Depth) data has
attracted increasing attention since the first work reported in 2010. Over this
period, many benchmark datasets have been created to facilitate the development
and evaluation of new algorithms. This raises the question of which dataset to
select and how to use it in providing a fair and objective comparative
evaluation against state-of-the-art methods. To address this issue, this paper
provides a comprehensive review of the most commonly used action recognition
related RGB-D video datasets, including 27 single-view datasets, 10 multi-view
datasets, and 7 multi-person datasets. The detailed information and analysis of
these datasets is a useful resource in guiding insightful selection of datasets
for future research. In addition, the issues with current algorithm evaluation
vis-\'{a}-vis limitations of the available datasets and evaluation protocols
are also highlighted; resulting in a number of recommendations for collection
of new datasets and use of evaluation protocols
Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling
It remains a huge challenge to design effective and efficient trackers under
complex scenarios, including occlusions, illumination changes and pose
variations. To cope with this problem, a promising solution is to integrate the
temporal consistency across consecutive frames and multiple feature cues in a
unified model. Motivated by this idea, we propose a novel correlation
filter-based tracker in this work, in which the temporal relatedness is
reconciled under a multi-task learning framework and the multiple feature cues
are modeled using a multi-view learning approach. We demonstrate the resulting
regression model can be efficiently learned by exploiting the structure of
blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm
is developed thereafter for efficient online tracking. Meanwhile, we
incorporate an adaptive scale estimation mechanism to strengthen the stability
of scale variation tracking. We implement our tracker using two types of
features and test it on two benchmark datasets. Experimental results
demonstrate the superiority of our proposed approach when compared with other
state-of-the-art trackers. project homepage
http://bmal.hust.edu.cn/project/KMF2JMTtracking.htmlComment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technology. The MATLAB code of our method is available from
our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.htm
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation
In this paper we present a novel approach for bottom-up multi-person 3D human
pose estimation from monocular RGB images. We propose to use high resolution
volumetric heatmaps to model joint locations, devising a simple and effective
compression method to drastically reduce the size of this representation. At
the core of the proposed method lies our Volumetric Heatmap Autoencoder, a
fully-convolutional network tasked with the compression of ground-truth
heatmaps into a dense intermediate representation. A second model, the Code
Predictor, is then trained to predict these codes, which can be decompressed at
test time to re-obtain the original representation. Our experimental evaluation
shows that our method performs favorably when compared to state of the art on
both multi-person and single-person 3D human pose estimation datasets and,
thanks to our novel compression strategy, can process full-HD images at the
constant runtime of 8 fps regardless of the number of subjects in the scene.
Code and models available at https://github.com/fabbrimatteo/LoCO .Comment: CVPR 202
- …