1,218 research outputs found
WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition
Though research has shown the complementarity of camera- and inertial-based
data, datasets which offer both modalities remain scarce. In this paper we
introduce WEAR, a multimodal benchmark dataset for both vision- and
wearable-based Human Activity Recognition (HAR). The dataset comprises data
from 18 participants performing a total of 18 different workout activities with
untrimmed inertial (acceleration) and camera (egocentric video) data recorded
at 10 different outside locations. WEAR features a diverse set of activities
which are low in inter-class similarity and, unlike previous egocentric
datasets, not defined by human-object-interactions nor originate from
inherently distinct activity categories. Provided benchmark results reveal that
single-modality architectures have different strengths and weaknesses in their
prediction performance. Further, in light of the recent success of
transformer-based video action detection models, we demonstrate their
versatility by applying them in a plain fashion using vision, inertial and
combined (vision + inertial) features as input. Results show that vision
transformers are not only able to produce competitive results using only
inertial data, but also can function as an architecture to fuse both modalities
by means of simple concatenation, with the multimodal approach being able to
produce the highest average mAP, precision and close-to-best F1-scores. Up
until now, vision-based transformers have neither been explored in inertial nor
in multimodal human activity recognition, making our approach the first to do
so. The dataset and code to reproduce experiments is publicly available via:
mariusbock.github.io/wearComment: 12 pages, 2 figures, 2 table
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video
Manual annotations of temporal bounds for object interactions (i.e. start and
end times) are typical training input to recognition, localization and
detection algorithms. For three publicly available egocentric datasets, we
uncover inconsistencies in ground truth temporal bounds within and across
annotators and datasets. We systematically assess the robustness of
state-of-the-art approaches to changes in labeled temporal bounds, for object
interaction recognition. As boundaries are trespassed, a drop of up to 10% is
observed for both Improved Dense Trajectories and Two-Stream Convolutional
Neural Network.
We demonstrate that such disagreement stems from a limited understanding of
the distinct phases of an action, and propose annotating based on the Rubicon
Boundaries, inspired by a similarly named cognitive model, for consistent
temporal bounds of object interactions. Evaluated on a public dataset, we
report a 4% increase in overall accuracy, and an increase in accuracy for 55%
of classes when Rubicon Boundaries are used for temporal annotations.Comment: ICCV 201
Seeing and hearing egocentric actions: how much can we learn?
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification.Peer ReviewedPostprint (author's final draft
- …