864 research outputs found
Target-absent Human Attention
The prediction of human gaze behavior is important for building
human-computer interactive systems that can anticipate a user's attention.
Computer vision models have been developed to predict the fixations made by
people as they search for target objects. But what about when the image has no
target? Equally important is to know how people search when they cannot find a
target, and when they would stop searching. In this paper, we propose the first
data-driven computational model that addresses the search-termination problem
and predicts the scanpath of search fixations made by people searching for
targets that do not appear in images. We model visual search as an imitation
learning problem and represent the internal knowledge that the viewer acquires
through fixations using a novel state representation that we call Foveated
Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a
pretrained ConvNet that produces an in-network feature pyramid, all with
minimal computational overhead. Our method integrates FFMs as the state
representation in inverse reinforcement learning. Experimentally, we improve
the state of the art in predicting human target-absent search behavior on the
COCO-Search18 datasetComment: Accepted to ECCV202
Contextual Encoder-Decoder Network for Visual Saliency Prediction
Predicting salient regions in natural images requires the detection of
objects that are present in a scene. To develop robust representations for this
challenging task, high-level visual features at multiple spatial scales must be
extracted and augmented with contextual information. However, existing models
aimed at explaining human fixation maps do not incorporate such a mechanism
explicitly. Here we propose an approach based on a convolutional neural network
pre-trained on a large-scale image classification task. The architecture forms
an encoder-decoder structure and includes a module with multiple convolutional
layers at different dilation rates to capture multi-scale features in parallel.
Moreover, we combine the resulting representations with global scene
information for accurately predicting visual saliency. Our model achieves
competitive and consistent results across multiple evaluation metrics on two
public saliency benchmarks and we demonstrate the effectiveness of the
suggested approach on five datasets and selected examples. Compared to state of
the art approaches, the network is based on a lightweight image classification
backbone and hence presents a suitable choice for applications with limited
computational resources, such as (virtual) robotic systems, to estimate human
fixations across complex natural scenes.Comment: Accepted Manuscrip
EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction
This paper details the sixth Emotion Recognition in the Wild (EmotiW)
challenge. EmotiW 2018 is a grand challenge in the ACM International Conference
on Multimodal Interaction 2018, Colorado, USA. The challenge aims at providing
a common platform to researchers working in the affective computing community
to benchmark their algorithms on `in the wild' data. This year EmotiW contains
three sub-challenges: a) Audio-video based emotion recognition; b) Student
engagement prediction; and c) Group-level emotion recognition. The databases,
protocols and baselines are discussed in detail
A new multi-modal dataset for human affect analysis
In this paper we present a new multi-modal dataset of spontaneous three way human interactions. Participants were recorded in an unconstrained environment at various locations during a sequence of debates in a video conference, Skype style arrangement. An additional depth modality was introduced, which permitted the capture of 3D information in addition to the video and audio signals. The dataset consists of 16 participants and is subdivided into 6 unique sections. The dataset was manually annotated on a continuously scale across 5 different affective dimensions including arousal, valence, agreement, content and interest.
The annotation was performed by three human annotators with the ensemble average calculated for use in the dataset. The corpus enables the analysis of human affect during conversations in a real life scenario. We first briefly reviewed the existing affect dataset and the methodologies
related to affect dataset construction, then we detailed how our unique dataset was constructed
Bayesian Networks for the robust and unbiased prediction of depression and its symptoms utilizing speech and multimodal data
Predicting the presence of major depressive disorder (MDD) using behavioural
and cognitive signals is a highly non-trivial task. The heterogeneous clinical
profile of MDD means that any given speech, facial expression and/or observed
cognitive pattern may be associated with a unique combination of depressive
symptoms. Conventional discriminative machine learning models potentially lack
the complexity to robustly model this heterogeneity. Bayesian networks,
however, may instead be well-suited to such a scenario. These networks are
probabilistic graphical models that efficiently describe the joint probability
distribution over a set of random variables by explicitly capturing their
conditional dependencies. This framework provides further advantages over
standard discriminative modelling by offering the possibility to incorporate
expert opinion in the graphical structure of the models, generating explainable
model predictions, informing about the uncertainty of predictions, and
naturally handling missing data. In this study, we apply a Bayesian framework
to capture the relationships between depression, depression symptoms, and
features derived from speech, facial expression and cognitive game data
collected at thymia.Comment: Accepted for publication at Interspeech 202
Motion planning in dynamic environments using context-aware human trajectory prediction
Over the years, the separate fields of motion planning, mapping, and human trajectory prediction have advanced considerably. However, the literature is still sparse in providing practical frameworks that enable mobile manipulators to perform whole-body movements and account for the predicted motion of moving obstacles. Previous optimisation-based motion planning approaches that use distance fields have suffered from the high computational cost required to update the environment representation. We demonstrate that GPU-accelerated predicted composite distance fields significantly reduce the computation time compared to calculating distance fields from scratch. We integrate this technique with a complete motion planning and perception framework that accounts for the predicted motion of humans in dynamic environments, enabling reactive and pre-emptive motion planning that incorporates predicted motions. To achieve this, we propose and implement a novel human trajectory prediction method that combines intention recognition with trajectory optimisation-based motion planning. We validate our resultant framework on a real-world Toyota Human Support Robot (HSR) using live RGB-D sensor data from the onboard camera. In addition to providing analysis on a publicly available dataset, we release the Oxford Indoor Human Motion (Oxford-IHM) dataset and demonstrate state-of-the-art performance in human trajectory prediction. The Oxford-IHM dataset is a human trajectory prediction dataset in which people walk between regions of interest in an indoor environment. Both static and robot-mounted RGB-D cameras observe the people while tracked with a motion-capture system
- …