Search CORE

1,246 research outputs found

Egocentric video description based on temporally-linked sequences

Author: Bolaños Solà Marc
Casacuberta Francisco
Peris Álvaro
Radeva Petia
Soler Sergi
Publication venue: 'Elsevier BV'
Publication date: 25/10/2019
Field of study

Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is storytelling, that is, how to understand and tell the story relying behind the pictures. In this paper, we tackle storytelling as an egocentric sequences description problem. We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences. Furthermore, we present a new method for multimodal data fusion consisting on a multi-input attention recurrent network. We also release the EDUB-SegDesc dataset. This is the first dataset for egocentric image sequences description, consisting of 1339 events with 3991 descriptions, from 55 days acquired by 11 people. Finally, we prove that our proposal outperforms classical attentional encoder-decoder methods for video description

Diposit Digital de la Universitat de Barcelona

Egocentric video description based on temporally-linked sequences

Author: Bolaños Marc
Casacuberta Nolla Francisco
Peris-Abril Álvaro
Radeva Petia
Soler Sergi
Publication venue: 'Elsevier BV'
Publication date: 09/11/2017
Field of study

[EN] Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is storytelling, that is, how to understand and tell the story relying behind the pictures. In this paper, we tackle storytelling as an egocentric sequences description problem. We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences. Furthermore, we present a new method for multimodal data fusion consisting on a multi-input attention recurrent network. We also release the EDUB-SegDesc dataset. This is the first dataset for egocentric image sequences description, consisting of 1339 events with 3991 descriptions, from 55¿days acquired by 11 people. Finally, we prove that our proposal outperforms classical attentional encoder-decoder methods for video description.This work was partially founded by TIN2015-66951-C2, SGR 1219, CERCA, Grant 20141510 (Marato TV3), PrometeoII/2014/030 and R-MIPRCV network (TIN2014-54728-REDC). Petia Radeva is partially founded by ICREA Academia'2014. Marc Bolanos is partially founded by an FPU fellowship. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X GPU used for this research. The funders had no role in the study design, data collection, analysis, and preparation of the manuscript.Bolaños, M.; Peris-Abril, Á.; Casacuberta Nolla, F.; Soler, S.; Radeva, P. (2018). Egocentric video description based on temporally-linked sequences. Journal of Visual Communication and Image Representation. 50:205-216. https://doi.org/10.1016/j.jvcir.2017.11.022S2052165

arXiv.org e-Print Archive

Crossref

RiuNet

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

Author: Alayrac Jean-Baptiste
Hahn Meera
Laptev Ivan
Rehg James M.
Ruiz Nataniel
Publication venue
Publication date: 22/09/2018
Field of study

Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision. The task is challenging due to the difficulty of bridging the semantic gap between the visual and natural language domains. This paper addresses the task of automatically generating an alignment between a set of instructions and a first person video demonstrating an activity. The sparse descriptions and ambiguity of written instructions create significant alignment challenges. The key to our approach is the use of egocentric cues to generate a concise set of action proposals, which are then matched to recipe steps using object recognition and computational linguistic techniques. We obtain promising results on both the Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions Dataset

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Multitask Learning to Improve Egocentric Action Recognition

Author: Kapidis Georgios
Noldus Lucas
Poppe Ronald
van Dam Elsbeth
Veltkamp Remco
Publication venue
Publication date: 01/01/2019
Field of study

In this work we employ multitask learning to capitalize on the structure that exists in related supervised tasks to train complex neural networks. It allows training a network for multiple objectives in parallel, in order to improve performance on at least one of them by capitalizing on a shared representation that is developed to accommodate more information than it otherwise would for a single task. We employ this idea to tackle action recognition in egocentric videos by introducing additional supervised tasks. We consider learning the verbs and nouns from which action labels consist of and predict coordinates that capture the hand locations and the gaze-based visual saliency for all the frames of the input video segments. This forces the network to explicitly focus on cues from secondary tasks that it might otherwise have missed resulting in improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show consistent improvements when training with multiple tasks over the single-task baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in action recognition by 3.84%. Apart from actions, our method produces accurate hand and gaze estimations as side tasks, without requiring any additional input at test time other than the RGB video clips.Comment: 10 pages, 3 figures, accepted at the 5th Egocentric Perception, Interaction and Computing (EPIC) workshop at ICCV 2019, code repository: https://github.com/georkap/hand_track_classificatio

arXiv.org e-Print Archive

Crossref

Utrecht University Repository

Unsupervised Mapping and Semantic User Localisation from First-Person Monocular Video

Author: McKenna Stephen
Suveges Tamas
Publication venue
Publication date: 22/08/2024
Field of study

We propose an unsupervised probabilistic framework for learning a human-centred representation of a person’s environment from first-person video. Specifically, non-geometric maps modelled as hierarchies of probabilistic place graphs and view graphs are learned. Place graphs model a user’s patterns of transition between physical locations whereas view graphs capture an aspect of user behaviour within those locations. Furthermore, we describe an implementation in which the notion of place is divided into stations and the routes that interconnect them. Stations typically correspond to rooms or areas where a user spends time. Visits to stations are temporally segmented based on qualitative visual motion. We describe how to learn maps online in an unsupervised manner, and how to localise the user within these maps. We report experiments on two datasets, including comparison of performance with and without view graphs, and demonstrate better online mapping than when using offline clustering.<br/

University of Dundee Online Publications

Predicting visual context for unsupervised event segmentation in continuous photo-streams

Author: Bolanos Marc
Dang-Nguyen Duc-Tien
del Molino Ana Garcia
del Molino Ana Garcia
Gygli Michael
Lee Yong Jae
Lin Jie
Lin Wei-Hao
Ng Hamg Wei
Srivastava Nitish
Yamamoto Shuhei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/08/2018
Field of study

Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators.Comment: Accepted for publication at the 2018 ACM Multimedia Conference (MM '18

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

Keyframe Summarisation of Egocentric Video

Author: Yousefi Paria
Publication venue
Publication date: 04/11/2019
Field of study

Bangor University Research Portal

EGO-TOPO: Environment Affordances from Egocentric Video

Author: Feichtenhofer Christoph
Grauman Kristen
Li Yanghao
Nagarajan Tushar
Publication venue
Publication date: 27/03/2020
Field of study

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.Comment: Published in CVPR 2020, project page: http://vision.cs.utexas.edu/projects/ego-topo

arXiv.org e-Print Archive

Crossref

Verbs and Me:An Investigation Into Verbs as Labels for Action Recognition in Video Understanding

Author: Wray Michael
Publication venue
Publication date: 23/01/2020
Field of study

Explore Bristol Research