504 research outputs found
Egocentric Activity Recognition with Multimodal Fisher Vector
With the increasing availability of wearable devices, research on egocentric
activity recognition has received much attention recently. In this paper, we
build a Multimodal Egocentric Activity dataset which includes egocentric videos
and sensor data of 20 fine-grained and diverse activity categories. We present
a novel strategy to extract temporal trajectory-like features from sensor data.
We propose to apply the Fisher Kernel framework to fuse video and temporal
enhanced sensor features. Experiment results show that with careful design of
feature extraction and fusion algorithm, sensor data can enhance
information-rich video data. We make publicly available the Multimodal
Egocentric Activity dataset to facilitate future research.Comment: 5 pages, 4 figures, ICASSP 2016 accepte
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video
Manual annotations of temporal bounds for object interactions (i.e. start and
end times) are typical training input to recognition, localization and
detection algorithms. For three publicly available egocentric datasets, we
uncover inconsistencies in ground truth temporal bounds within and across
annotators and datasets. We systematically assess the robustness of
state-of-the-art approaches to changes in labeled temporal bounds, for object
interaction recognition. As boundaries are trespassed, a drop of up to 10% is
observed for both Improved Dense Trajectories and Two-Stream Convolutional
Neural Network.
We demonstrate that such disagreement stems from a limited understanding of
the distinct phases of an action, and propose annotating based on the Rubicon
Boundaries, inspired by a similarly named cognitive model, for consistent
temporal bounds of object interactions. Evaluated on a public dataset, we
report a 4% increase in overall accuracy, and an increase in accuracy for 55%
of classes when Rubicon Boundaries are used for temporal annotations.Comment: ICCV 201
Boosted Multiple Kernel Learning for First-Person Activity Recognition
Activity recognition from first-person (ego-centric) videos has recently
gained attention due to the increasing ubiquity of the wearable cameras. There
has been a surge of efforts adapting existing feature descriptors and designing
new descriptors for the first-person videos. An effective activity recognition
system requires selection and use of complementary features and appropriate
kernels for each feature. In this study, we propose a data-driven framework for
first-person activity recognition which effectively selects and combines
features and their respective kernels during the training. Our experimental
results show that use of Multiple Kernel Learning (MKL) and Boosted MKL in
first-person activity recognition problem exhibits improved results in
comparison to the state-of-the-art. In addition, these techniques enable the
expansion of the framework with new features in an efficient and convenient
way.Comment: First published in the Proceedings of the 25th European Signal
Processing Conference (EUSIPCO-2017) in 2017, published by EURASI
WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition
Though research has shown the complementarity of camera- and inertial-based
data, datasets which offer both modalities remain scarce. In this paper we
introduce WEAR, a multimodal benchmark dataset for both vision- and
wearable-based Human Activity Recognition (HAR). The dataset comprises data
from 18 participants performing a total of 18 different workout activities with
untrimmed inertial (acceleration) and camera (egocentric video) data recorded
at 10 different outside locations. WEAR features a diverse set of activities
which are low in inter-class similarity and, unlike previous egocentric
datasets, not defined by human-object-interactions nor originate from
inherently distinct activity categories. Provided benchmark results reveal that
single-modality architectures have different strengths and weaknesses in their
prediction performance. Further, in light of the recent success of
transformer-based video action detection models, we demonstrate their
versatility by applying them in a plain fashion using vision, inertial and
combined (vision + inertial) features as input. Results show that vision
transformers are not only able to produce competitive results using only
inertial data, but also can function as an architecture to fuse both modalities
by means of simple concatenation, with the multimodal approach being able to
produce the highest average mAP, precision and close-to-best F1-scores. Up
until now, vision-based transformers have neither been explored in inertial nor
in multimodal human activity recognition, making our approach the first to do
so. The dataset and code to reproduce experiments is publicly available via:
mariusbock.github.io/wearComment: 12 pages, 2 figures, 2 table
Egocentric Vision-based Action Recognition: A survey
[EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)
- …