Search CORE

20 research outputs found

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Author: Damen Dima
Doughty Hazel
Farinella Giovanni Maria
Fidler Sanja
Furnari Antonino
Kazakos Evangelos
Moltisanti Davide
Munro Jonathan
Perrett Toby
Price Will
Wray Michael
Publication venue
Publication date: 31/07/2018
Field of study

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.i

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Encouraging LSTMs to Anticipate Actions Very Early

Author: Aliakbarian Mohammad Sadegh
Andersson Lars
Fernando Basura
Petersson Lars
Saleh Fatemeh Sadat
Salzmann Mathieu
Publication venue
Publication date: 13/08/2017
Field of study

In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos. As such, it is therefore key to the success of computer vision applications requiring to react as early as possible, such as autonomous navigation. In this paper, we propose a new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence. To this end, we develop a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduce a novel loss function that encourages the model to predict the correct class as early as possible. Our experiments on standard benchmark datasets evidence the benefits of our approach; We outperform the state-of-the-art action anticipation methods for early prediction by a relative increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on UCF-101.Comment: 13 Pages, 7 Figures, 11 Tables. Accepted in ICCV 2017. arXiv admin note: text overlap with arXiv:1611.0552

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Anticipating Daily Intention using On-Wrist Motion Triggered Sensing

Author: Chan Cheng-Sheng
Chien Ting-An
Hu Chan-Wei
Sun Min
Wu Tz-Ying
Publication venue
Publication date: 20/10/2017
Field of study

Anticipating human intention by observing one's actions has many applications. For instance, picking up a cellphone, then a charger (actions) implies that one wants to charge the cellphone (intention). By anticipating the intention, an intelligent system can guide the user to the closest power outlet. We propose an on-wrist motion triggered sensing system for anticipating daily intentions, where the on-wrist sensors help us to persistently observe one's actions. The core of the system is a novel Recurrent Neural Network (RNN) and Policy Network (PN), where the RNN encodes visual and motion observation to anticipate intention, and the PN parsimoniously triggers the process of visual observation to reduce computation requirement. We jointly trained the whole network using policy gradient and cross-entropy loss. To evaluate, we collect the first daily "intention" dataset consisting of 2379 videos with 34 intentions and 164 unique action sequences. Our method achieves 92.68%, 90.85%, 97.56% accuracy on three users while processing only 29% of the visual observation on average

arXiv.org e-Print Archive

Crossref

VIENA2: A Driving Anticipation Dataset

Author: A Pentland
FS Saleh
G Ros
HS Koppula
JFP Kooij
L Wang
SR Richter
X Li
X Wang
Publication venue
Publication date: 29/10/2018
Field of study

Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single dataset or framework that addresses them all in a consistent manner. In this paper, we therefore introduce a new, large-scale dataset, called VIENA2, covering 5 generic driving scenarios, with a total of 25 distinct action classes. It contains more than 15K full HD, 5s long videos acquired in various driving conditions, weathers, daytimes and environments, complemented with a common and realistic set of sensor measurements. This amounts to more than 2.25M frames, each annotated with an action label, corresponding to 600 samples per action class. We discuss our data acquisition strategy and the statistics of our dataset, and benchmark state-of-the-art action anticipation techniques, including a new multi-modal LSTM architecture with an effective loss function for action anticipation in driving scenarios.Comment: Accepted in ACCV 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Multiscale human activity recognition and anticipation network

Author: Everitt Aluna
Golodetz Stuart
Markham Andrew
Trigoni Niki
Xing Yang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/05/2022
Field of study

Deep convolutional neural networks have been leveraged to achieve huge improvements in video understanding and human activity recognition performance in the past decade. However, most existing methods focus on activities that have similar time scales, leaving the task of action recognition on multiscale human behaviors less explored. In this study, a two-stream multiscale human activity recognition and anticipation (MS-HARA) network is proposed, which is jointly optimized using a multitask learning method. The MS-HARA network fuses the two streams of the network using an efficient temporal-channel attention (TCA)-based fusion approach to improve the model's representational ability for both temporal and spatial features. We investigate the multiscale human activities from two basic categories, namely, midterm activities and long-term activities. The network is designed to function as part of a real-time processing framework to support interaction and mutual understanding between humans and intelligent machines. It achieves state-of-the-art results on several datasets for different tasks and different application domains. The midterm and long-term action recognition and anticipation performance, as well as the network fusion, are extensively tested to show the efficiency of the proposed network. The results show that the MS-HARA network can easily be extended to different application domains

Cranfield CERES

Scaling Egocentric Vision:The EPIC-KITCHENS Dataset

Author: Damen Dima
Doughty Hazel
Farinella Giovanni
Fidler Sanja
Furnari Antonio
Kazakos Vangelis
Moltisanti Davide
Munro Jonathan
Perrett Toby
Wray Michael
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/10/2018
Field of study

Crossref

Explore Bristol Research