4,463 research outputs found
SALSA: A Novel Dataset for Multimodal Group Behavior Analysis
Studying free-standing conversational groups (FCGs) in unstructured social
settings (e.g., cocktail party ) is gratifying due to the wealth of information
available at the group (mining social networks) and individual (recognizing
native behavioral and personality traits) levels. However, analyzing social
scenes involving FCGs is also highly challenging due to the difficulty in
extracting behavioral cues such as target locations, their speaking activity
and head/body pose due to crowdedness and presence of extreme occlusions. To
this end, we propose SALSA, a novel dataset facilitating multimodal and
Synergetic sociAL Scene Analysis, and make two main contributions to research
on automated social interaction analysis: (1) SALSA records social interactions
among 18 participants in a natural, indoor environment for over 60 minutes,
under the poster presentation and cocktail party contexts presenting
difficulties in the form of low-resolution images, lighting variations,
numerous occlusions, reverberations and interfering sound sources; (2) To
alleviate these problems we facilitate multimodal analysis by recording the
social interplay using four static surveillance cameras and sociometric badges
worn by each participant, comprising the microphone, accelerometer, bluetooth
and infrared sensors. In addition to raw data, we also provide annotations
concerning individuals' personality as well as their position, head, body
orientation and F-formation information over the entire event duration. Through
extensive experiments with state-of-the-art approaches, we show (a) the
limitations of current methods and (b) how the recorded multiple cues
synergetically aid automatic analysis of social interactions. SALSA is
available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure
Polarimetric Pose Prediction
Light has many properties that vision sensors can passively measure.
Colour-band separated wavelength and intensity are arguably the most commonly
used for monocular 6D object pose estimation. This paper explores how
complementary polarisation information, i.e. the orientation of light wave
oscillations, influences the accuracy of pose predictions. A hybrid model that
leverages physical priors jointly with a data-driven learning strategy is
designed and carefully tested on objects with different levels of photometric
complexity. Our design significantly improves the pose accuracy compared to
state-of-the-art photometric approaches and enables object pose estimation for
highly reflective and transparent objects. A new multi-modal instance-level 6D
object pose dataset with highly accurate pose annotations for multiple objects
with varying photometric complexity is introduced as a benchmark.Comment: Accepted at ECCV 2022; 25 pages (14 main paper + References + 7
Appendix
Human Motion Trajectory Prediction: A Survey
With growing numbers of intelligent autonomous systems in human environments,
the ability of such systems to perceive, understand and anticipate human
behavior becomes increasingly important. Specifically, predicting future
positions of dynamic agents and planning considering such predictions are key
tasks for self-driving vehicles, service robots and advanced surveillance
systems. This paper provides a survey of human motion trajectory prediction. We
review, analyze and structure a large selection of work from different
communities and propose a taxonomy that categorizes existing methods based on
the motion modeling approach and level of contextual information used. We
provide an overview of the existing datasets and performance metrics. We
discuss limitations of the state of the art and outline directions for further
research.Comment: Submitted to the International Journal of Robotics Research (IJRR),
37 page
Learning to Extract Motion from Videos in Convolutional Neural Networks
This paper shows how to extract dense optical flow from videos with a
convolutional neural network (CNN). The proposed model constitutes a potential
building block for deeper architectures to allow using motion without resorting
to an external algorithm, \eg for recognition in videos. We derive our network
architecture from signal processing principles to provide desired invariances
to image contrast, phase and texture. We constrain weights within the network
to enforce strict rotation invariance and substantially reduce the number of
parameters to learn. We demonstrate end-to-end training on only 8 sequences of
the Middlebury dataset, orders of magnitude less than competing CNN-based
motion estimation methods, and obtain comparable performance to classical
methods on the Middlebury benchmark. Importantly, our method outputs a
distributed representation of motion that allows representing multiple,
transparent motions, and dynamic textures. Our contributions on network design
and rotation invariance offer insights nonspecific to motion estimation
Deep Learning for Face Anti-Spoofing: A Survey
Face anti-spoofing (FAS) has lately attracted increasing attention due to its
vital role in securing face recognition systems from presentation attacks
(PAs). As more and more realistic PAs with novel types spring up, traditional
FAS methods based on handcrafted features become unreliable due to their
limited representation capacity. With the emergence of large-scale academic
datasets in the recent decade, deep learning based FAS achieves remarkable
performance and dominates this area. However, existing reviews in this field
mainly focus on the handcrafted features, which are outdated and uninspiring
for the progress of FAS community. In this paper, to stimulate future research,
we present the first comprehensive review of recent advances in deep learning
based FAS. It covers several novel and insightful components: 1) besides
supervision with binary label (e.g., '0' for bonafide vs. '1' for PAs), we also
investigate recent methods with pixel-wise supervision (e.g., pseudo depth
map); 2) in addition to traditional intra-dataset evaluation, we collect and
analyze the latest methods specially designed for domain generalization and
open-set FAS; and 3) besides commercial RGB camera, we summarize the deep
learning applications under multi-modal (e.g., depth and infrared) or
specialized (e.g., light field and flash) sensors. We conclude this survey by
emphasizing current open issues and highlighting potential prospects.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
A Novel Two Stream Decision Level Fusion of Vision and Inertial Sensors Data for Automatic Multimodal Human Activity Recognition System
This paper presents a novel multimodal human activity recognition system. It
uses a two-stream decision level fusion of vision and inertial sensors. In the
first stream, raw RGB frames are passed to a part affinity field-based pose
estimation network to detect the keypoints of the user. These keypoints are
then pre-processed and inputted in a sliding window fashion to a specially
designed convolutional neural network for the spatial feature extraction
followed by regularized LSTMs to calculate the temporal features. The outputs
of LSTM networks are then inputted to fully connected layers for
classification. In the second stream, data obtained from inertial sensors are
pre-processed and inputted to regularized LSTMs for the feature extraction
followed by fully connected layers for the classification. At this stage, the
SoftMax scores of two streams are then fused using the decision level fusion
which gives the final prediction. Extensive experiments are conducted to
evaluate the performance. Four multimodal standard benchmark datasets (UP-Fall
detection, UTD-MHAD, Berkeley-MHAD, and C-MHAD) are used for experimentations.
The accuracies obtained by the proposed system are 96.9 %, 97.6 %, 98.7 %, and
95.9 % respectively on the UP-Fall Detection, UTDMHAD, Berkeley-MHAD, and
C-MHAD datasets. These results are far superior than the current
state-of-the-art methods
- …