10 research outputs found
Multitask Learning to Improve Egocentric Action Recognition
In this work we employ multitask learning to capitalize on the structure that
exists in related supervised tasks to train complex neural networks. It allows
training a network for multiple objectives in parallel, in order to improve
performance on at least one of them by capitalizing on a shared representation
that is developed to accommodate more information than it otherwise would for a
single task. We employ this idea to tackle action recognition in egocentric
videos by introducing additional supervised tasks. We consider learning the
verbs and nouns from which action labels consist of and predict coordinates
that capture the hand locations and the gaze-based visual saliency for all the
frames of the input video segments. This forces the network to explicitly focus
on cues from secondary tasks that it might otherwise have missed resulting in
improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show
consistent improvements when training with multiple tasks over the single-task
baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in
action recognition by 3.84%. Apart from actions, our method produces accurate
hand and gaze estimations as side tasks, without requiring any additional input
at test time other than the RGB video clips.Comment: 10 pages, 3 figures, accepted at the 5th Egocentric Perception,
Interaction and Computing (EPIC) workshop at ICCV 2019, code repository:
https://github.com/georkap/hand_track_classificatio
Can Gaze Inform Egocentric Action Recognition?
We investigate the hypothesis that gaze-signal can improve egocentric action recognition on the standard benchmark, EGTEA Gaze++ dataset. In contrast to prior work where gaze-signal was only used during training, we formulate a novel neural fusion approach, Cross-modality Attention Blocks (CMA), to leverage gaze-signal for action recognition during inference as well. CMA combines information from different modalities at different levels of abstraction to achieve state-of-the-art performance for egocentric action recognition. Specifically, fusing the video-stream with optical-flow with CMA outperforms the current state-of-the-art by 3%. However, when CMA is employed to fuse gaze-signal with video-stream data, no improvements are observed. Further investigation of this counter-intuitive finding indicates that small spatial overlap between the network's attention-map and gaze ground-truth renders the gaze-signal uninformative for this benchmark. Based on our empirical findings, we recommend improvements to the current benchmark to develop practical systems for egocentric video understanding with gaze-signal.</p
Bringing Online Egocentric Action Recognition into the wild
To enable a safe and effective human-robot cooperation, it is crucial to
develop models for the identification of human activities. Egocentric vision
seems to be a viable solution to solve this problem, and therefore many works
provide deep learning solutions to infer human actions from first person
videos. However, although very promising, most of these do not consider the
major challenges that comes with a realistic deployment, such as the
portability of the model, the need for real-time inference, and the robustness
with respect to the novel domains (i.e., new spaces, users, tasks). With this
paper, we set the boundaries that egocentric vision models should consider for
realistic applications, defining a novel setting of egocentric action
recognition in the wild, which encourages researchers to develop novel,
applications-aware solutions. We also present a new model-agnostic technique
that enables the rapid repurposing of existing architectures in this new
context, demonstrating the feasibility to deploy a model on a tiny device
(Jetson Nano) and to perform the task directly on the edge with very low energy
consumption (2.4W on average at 50 fps)
Can Gaze Inform Egocentric Action Recognition?
We investigate the hypothesis that gaze-signal can improve egocentric action recognition on the standard benchmark, EGTEA Gaze++ dataset. In contrast to prior work where gaze-signal was only used during training, we formulate a novel neural fusion approach, Cross-modality Attention Blocks (CMA), to leverage gaze-signal for action recognition during inference as well. CMA combines information from different modalities at different levels of abstraction to achieve state-of-the-art performance for egocentric action recognition. Specifically, fusing the video-stream with optical-flow with CMA outperforms the current state-of-the-art by 3%. However, when CMA is employed to fuse gaze-signal with video-stream data, no improvements are observed. Further investigation of this counter-intuitive finding indicates that small spatial overlap between the network's attention-map and gaze ground-truth renders the gaze-signal uninformative for this benchmark. Based on our empirical findings, we recommend improvements to the current benchmark to develop practical systems for egocentric video understanding with gaze-signal.</p
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
In egocentric videos, actions occur in quick succession. We capitalise on the
action's temporal context and propose a method that learns to attend to
surrounding actions in order to improve recognition performance. To incorporate
the temporal context, we propose a transformer-based multimodal model that
ingests video and audio as input modalities, with an explicit language model
providing action sequence context to enhance the predictions. We test our
approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art
performance. Our ablations showcase the advantage of utilising temporal context
as well as incorporating audio input modality and language model to rescore
predictions. Code and models at: https://github.com/ekazakos/MTCN.Comment: Accepted at BMVC 202
Measuring hand use in the home after cervical spinal cord injury using egocentric video
Background: Egocentric video has recently emerged as a potential solution for
monitoring hand function in individuals living with tetraplegia in the
community, especially for its ability to detect functional use in the home
environment. Objective: To develop and validate a wearable vision-based system
for measuring hand use in the home among individuals living with tetraplegia.
Methods: Several deep learning algorithms for detecting functional hand-object
interactions were developed and compared. The most accurate algorithm was used
to extract measures of hand function from 65 hours of unscripted video recorded
at home by 20 participants with tetraplegia. These measures were: the
percentage of interaction time over total recording time (Perc); the average
duration of individual interactions (Dur); the number of interactions per hour
(Num). To demonstrate the clinical validity of the technology, egocentric
measures were correlated with validated clinical assessments of hand function
and independence (Graded Redefined Assessment of Strength, Sensibility and
Prehension - GRASSP, Upper Extremity Motor Score - UEMS, and Spinal Cord
Independent Measure - SCIM). Results: Hand-object interactions were
automatically detected with a median F1-score of 0.80 (0.67-0.87). Our results
demonstrated that higher UEMS and better prehension were related to greater
time spent interacting, whereas higher SCIM and better hand sensation resulted
in a higher number of interactions performed during the egocentric video
recordings. Conclusions: For the first time, measures of hand function
automatically estimated in an unconstrained environment in individuals with
tetraplegia have been validated against internationally accepted measures of
hand function. Future work will necessitate a formal evaluation of the
reliability and responsiveness of the egocentric-based performance measures for
hand use
Egocentric action recognition from noisy videos
学位の種別: 修士University of Tokyo(東京大学
Analysis of the hands in egocentric vision: A survey
Egocentric vision (a.k.a. first-person vision - FPV) applications have
thrived over the past few years, thanks to the availability of affordable
wearable cameras and large annotated datasets. The position of the wearable
camera (usually mounted on the head) allows recording exactly what the camera
wearers have in front of them, in particular hands and manipulated objects.
This intrinsic advantage enables the study of the hands from multiple
perspectives: localizing hands and their parts within the images; understanding
what actions and activities the hands are involved in; and developing
human-computer interfaces that rely on hand gestures. In this survey, we review
the literature that focuses on the hands using egocentric vision, categorizing
the existing approaches into: localization (where are the hands or parts of
them?); interpretation (what are the hands doing?); and application (e.g.,
systems that used egocentric hand cues for solving a specific problem).
Moreover, a list of the most prominent datasets with hand-based annotations is
provided
Egocentric Vision-based Action Recognition: A survey
[EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)
Multitask Learning to Improve Egocentric Action Recognition
In this work we employ multitask learning to capitalize on the structure that exists in related supervised tasks to train complex neural networks. It allows training a network for multiple objectives in parallel, in order to improve performance on at least one of them by capitalizing on a shared representation that is developed to accommodate more information than it otherwise would for a single task. We employ this idea to tackle action recognition in egocentric videos by introducing additional supervised tasks. We consider learning the verbs and nouns from which action labels consist of and predict coordinates that capture the hand locations and the gaze-based visual saliency for all the frames of the input video segments. This forces the network to explicitly focus on cues from secondary tasks that it might otherwise have missed resulting in improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show consistent improvements when training with multiple tasks over the single-task baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in action recognition by 3.84%. Apart from actions, our method produces accurate hand and gaze estimations as side tasks, without requiring any additional input at test time other than the RGB video clips