1,246 research outputs found
Egocentric video description based on temporally-linked sequences
Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is storytelling, that is, how to understand and tell the story relying behind the pictures. In this paper, we tackle storytelling as an egocentric sequences description problem. We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences. Furthermore, we present a new method for multimodal data fusion consisting on a multi-input attention recurrent network. We also release the EDUB-SegDesc dataset. This is the first dataset for egocentric image sequences description, consisting of 1339 events with 3991 descriptions, from 55 days acquired by 11 people. Finally, we prove that our proposal outperforms classical attentional encoder-decoder methods for video description
Egocentric video description based on temporally-linked sequences
[EN] Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is storytelling, that is, how to understand and tell the story relying behind the pictures.
In this paper, we tackle storytelling as an egocentric sequences description problem. We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences. Furthermore, we present a new method for multimodal data fusion consisting on a multi-input attention recurrent network. We also release the EDUB-SegDesc dataset. This is the first dataset for egocentric image sequences description, consisting of 1339 events with 3991 descriptions, from 55¿days acquired by 11 people. Finally, we prove that our proposal outperforms classical attentional encoder-decoder methods for video description.This work was partially founded by TIN2015-66951-C2, SGR 1219, CERCA, Grant 20141510 (Marato TV3), PrometeoII/2014/030 and R-MIPRCV network (TIN2014-54728-REDC). Petia Radeva is partially founded by ICREA Academia'2014. Marc Bolanos is partially founded by an FPU fellowship. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X GPU used for this research. The funders had no role in the study design, data collection, analysis, and preparation of the manuscript.Bolaños, M.; Peris-Abril, Á.; Casacuberta Nolla, F.; Soler, S.; Radeva, P. (2018). Egocentric video description based on temporally-linked sequences. Journal of Visual Communication and Image Representation. 50:205-216. https://doi.org/10.1016/j.jvcir.2017.11.022S2052165
Learning to Localize and Align Fine-Grained Actions to Sparse Instructions
Automatic generation of textual video descriptions that are time-aligned with
video content is a long-standing goal in computer vision. The task is
challenging due to the difficulty of bridging the semantic gap between the
visual and natural language domains. This paper addresses the task of
automatically generating an alignment between a set of instructions and a first
person video demonstrating an activity. The sparse descriptions and ambiguity
of written instructions create significant alignment challenges. The key to our
approach is the use of egocentric cues to generate a concise set of action
proposals, which are then matched to recipe steps using object recognition and
computational linguistic techniques. We obtain promising results on both the
Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions
Dataset
Multitask Learning to Improve Egocentric Action Recognition
In this work we employ multitask learning to capitalize on the structure that
exists in related supervised tasks to train complex neural networks. It allows
training a network for multiple objectives in parallel, in order to improve
performance on at least one of them by capitalizing on a shared representation
that is developed to accommodate more information than it otherwise would for a
single task. We employ this idea to tackle action recognition in egocentric
videos by introducing additional supervised tasks. We consider learning the
verbs and nouns from which action labels consist of and predict coordinates
that capture the hand locations and the gaze-based visual saliency for all the
frames of the input video segments. This forces the network to explicitly focus
on cues from secondary tasks that it might otherwise have missed resulting in
improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show
consistent improvements when training with multiple tasks over the single-task
baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in
action recognition by 3.84%. Apart from actions, our method produces accurate
hand and gaze estimations as side tasks, without requiring any additional input
at test time other than the RGB video clips.Comment: 10 pages, 3 figures, accepted at the 5th Egocentric Perception,
Interaction and Computing (EPIC) workshop at ICCV 2019, code repository:
https://github.com/georkap/hand_track_classificatio
Unsupervised Mapping and Semantic User Localisation from First-Person Monocular Video
We propose an unsupervised probabilistic framework for learning a human-centred representation of a person’s environment from first-person video. Specifically, non-geometric maps modelled as hierarchies of probabilistic place graphs and view graphs are learned. Place graphs model a user’s patterns of transition between physical locations whereas view graphs capture an aspect of user behaviour within those locations. Furthermore, we describe an implementation in which the notion of place is divided into stations and the routes that interconnect them. Stations typically correspond to rooms or areas where a user spends time. Visits to stations are temporally segmented based on qualitative visual motion. We describe how to learn maps online in an unsupervised manner, and how to localise the user within these maps. We report experiments on two datasets, including comparison of performance with and without view graphs, and demonstrate better online mapping than when using offline clustering.<br/
Predicting visual context for unsupervised event segmentation in continuous photo-streams
Segmenting video content into events provides semantic structures for
indexing, retrieval, and summarization. Since motion cues are not available in
continuous photo-streams, and annotations in lifelogging are scarce and costly,
the frames are usually clustered into events by comparing the visual features
between them in an unsupervised way. However, such methodologies are
ineffective to deal with heterogeneous events, e.g. taking a walk, and
temporary changes in the sight direction, e.g. at a meeting. To address these
limitations, we propose Contextual Event Segmentation (CES), a novel
segmentation paradigm that uses an LSTM-based generative network to model the
photo-stream sequences, predict their visual context, and track their
evolution. CES decides whether a frame is an event boundary by comparing the
visual context generated from the frames in the past, to the visual context
predicted from the future. We implemented CES on a new and massive lifelogging
dataset consisting of more than 1.5 million images spanning over 1,723 days.
Experiments on the popular EDUB-Seg dataset show that our model outperforms the
state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is
only 3 points below that of human annotators.Comment: Accepted for publication at the 2018 ACM Multimedia Conference (MM
'18
EGO-TOPO: Environment Affordances from Egocentric Video
First-person video naturally brings the use of a physical environment to the
forefront, since it shows the camera wearer interacting fluidly in a space
based on his intentions. However, current methods largely separate the observed
actions from the persistent space itself. We introduce a model for environment
affordances that is learned directly from egocentric video. The main idea is to
gain a human-centric model of a physical space (such as a kitchen) that
captures (1) the primary spatial zones of interaction and (2) the likely
activities they support. Our approach decomposes a space into a topological map
derived from first-person activity, organizing an ego-video into a series of
visits to the different zones. Further, we show how to link zones across
multiple related environments (e.g., from videos of multiple kitchens) to
obtain a consolidated representation of environment functionality. On
EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene
affordances and anticipating future actions in long-form video.Comment: Published in CVPR 2020, project page:
http://vision.cs.utexas.edu/projects/ego-topo
- …