Search CORE

37 research outputs found

Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation

Author: G Navarro
HS Koppula
L Zappella
Lingling Tao
M Rohrbach
Q Shi
T van Kasteren
Publication venue
Publication date: 30/09/2016
Field of study

Joint segmentation and classification of fine-grained actions is important for applications of human-robot interaction, video surveillance, and human skill evaluation. However, despite substantial recent progress in large-scale action classification, the performance of state-of-the-art fine-grained action recognition approaches remains low. We propose a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier. Our spatiotemporal CNN is comprised of a spatial component that uses convolutional filters to capture information about objects and their relationships, and a temporal component that uses large 1D convolutional filters to capture information about how object relationships change across time. These features are used in tandem with a semi-Markov model that models transitions from one action to another. We introduce an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach. We highlight the effectiveness of our Segmental Spatiotemporal CNN on cooking and surgical action datasets for which we observe substantially improved performance relative to recent baseline methods.Comment: Updated from the ECCV 2016 version. We fixed an important mathematical error and made the section on segmental inference cleare

arXiv.org e-Print Archive

Crossref

Stam: a framework for spatio-temporal affordance maps

Author: A Chemero
HS Koppula
JJ Gibson
L Montesano
M Luber
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2016
Field of study

A�ordances have been introduced in literature as action op- portunities that objects o�er, and used in robotics to semantically rep- resent their interconnection. However, when considering an environment instead of an object, the problem becomes more complex due to the dynamism of its state. To tackle this issue, we introduce the concept of Spatio-Temporal A�ordances (STA) and Spatio-Temporal A�ordance Map (STAM). Using this formalism, we encode action semantics re- lated to the environment to improve task execution capabilities of an autonomous robot. We experimentally validate our approach to support the execution of robot tasks by showing that a�ordances encode accurate semantics of the environment

University of Lincoln Institutional Repository

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

VIENA2: A Driving Anticipation Dataset

Author: A Pentland
FS Saleh
G Ros
HS Koppula
JFP Kooij
L Wang
SR Richter
X Li
X Wang
Publication venue
Publication date: 29/10/2018
Field of study

Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single dataset or framework that addresses them all in a consistent manner. In this paper, we therefore introduce a new, large-scale dataset, called VIENA2, covering 5 generic driving scenarios, with a total of 25 distinct action classes. It contains more than 15K full HD, 5s long videos acquired in various driving conditions, weathers, daytimes and environments, complemented with a common and realistic set of sensor measurements. This amounts to more than 2.25M frames, each annotated with an action label, corresponding to 600 samples per action class. We discuss our data acquisition strategy and the statistics of our dataset, and benchmark state-of-the-art action anticipation techniques, including a new multi-modal LSTM architecture with an effective loss function for action anticipation in driving scenarios.Comment: Accepted in ACCV 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Graph Distillation for Action Detection with Privileged Modalities

Author: Bingbing Ni
C Zach
HS Koppula
J Liu
L Shao
M Liu
M Yu
R Caruana
SJ Pan
V Escorcia
V Vapnik
W Li
Z Ding
Z Qin
Publication venue
Publication date: 27/07/2018
Field of study

We propose a technique that tackles action detection in multimodal videos under a realistic and challenging condition in which only limited training data and partially observed modalities are available. Common methods in transfer learning do not take advantage of the extra modalities potentially available in the source domain. On the other hand, previous work on multimodal learning only focuses on a single domain or task and does not handle the modality discrepancy between training and testing. In this work, we propose a method termed graph distillation that incorporates rich privileged information from a large-scale multimodal dataset in the source domain, and improves the learning in the target domain where training data and modalities are scarce. We evaluate our approach on action classification and detection tasks in multimodal videos, and show that our model outperforms the state-of-the-art by a large margin on the NTU RGB+D and PKU-MMD benchmarks. The code is released at http://alan.vision/eccv18_graph/.Comment: ECCV 201

arXiv.org e-Print Archive

Crossref

CAR-Net: Clairvoyant Attentive Recurrent Network

Author: A Robicquet
B Zhao
BT Morris
CK Williams
D Helbing
D Makris
D Xie
HS Koppula
J Quiñonero-Candela
JM Wang
KM Kitani
L Ballan
N Graham
O Russakovsky
P McCullagh
R Vesel
RE Kalman
S Hochreiter
S Pellegrini
Publication venue
Publication date: 16/07/2018
Field of study

We present an interpretable framework for path prediction that leverages dependencies between agents' behaviors and their spatial navigation environment. We exploit two sources of information: the past motion trajectory of the agent of interest and a wide top-view image of the navigation scene. We propose a Clairvoyant Attentive Recurrent Network (CAR-Net) that learns where to look in a large image of the scene when solving the path prediction task. Our method can attend to any area, or combination of areas, within the raw image (e.g., road intersections) when predicting the trajectory of the agent. This allows us to visualize fine-grained semantic elements of navigation scenes that influence the prediction of trajectories. To study the impact of space on agents' trajectories, we build a new dataset made of top-view images of hundreds of scenes (Formula One racing tracks) where agents' behaviors are heavily influenced by known areas in the images (e.g., upcoming turns). CAR-Net successfully attends to these salient regions. Additionally, CAR-Net reaches state-of-the-art accuracy on the standard trajectory forecasting benchmark, Stanford Drone Dataset (SDD). Finally, we show CAR-Net's ability to generalize to unseen scenes.Comment: The 2nd and 3rd authors contributed equall

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Author: A Furnari
A Furnari
CY Chen
D Damen
G di Pellegrino
HS Koppula
JM Wang
KM Kitani
L-Y Gui
M Rushworth
R Hari
SM Aglioti
V Delaitre
V Pavlovic
W James
Y Huang
Y Li
Y Shen
Publication venue
Publication date: 19/07/2020
Field of study

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods ignore how the camera wearer interacts with the objects, or simply consider body motion as a separate modality. In contrast, we observe that the international hand movement reveals critical information about the future activity. Motivated by this, we adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using latent variables in our deep model. The predicted motor attention is further used to characterise the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI

arXiv.org e-Print Archive

Crossref

A 3D Human Posture Approach for Activity Recognition Based on Depth Camera

Author: AA Chaaraoui
AW Vieira
B Ni
B Ni
D Weinland
E Cippitelli
G Willems
HS Koppula
Ian H. Witten
J Han
J Shotton
J.K. Aggarwal
Jiang Wang
JK Aggarwal
JR Padilla-López
L Bao
L Gan
M Hall
M Jiang
R Poppe
S Gaglio
S Gasparrini
SJ Preece
T Kanungo
V Argyriou
W Ding
X Yang
Y Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Human activity recognition plays an important role in the context of Ambient Assisted Living (AAL), providing useful tools to improve people quality of life. This work presents an activity recognition algorithm based on the extraction of skeleton joints from a depth camera. The system describes an activity using a set of few and basic postures extracted by means of the X-means clustering algorithm. A multi-class Support Vector Machine, trained with the Sequential Minimal Optimization is employed to perform the classification. The system is evaluated on two public datasets for activity recognition which have different skeleton models, the CAD-60 with 15 joints and the TST with 25 joints. The proposed approach achieves precision/recall performances of 99.8 % on CAD-60 and 97.2 %/91.7 % on TST. The results are promising for an applied use in the context of AAL

Crossref

Archivio della ricerca della Scuola Superiore Sant'Anna

Human activity learning for assistive robotics using a classifier ensemble

Author: Ahmad Lotfi
B Ni
C Jayawardena
C Zhang
Caroline Langensiepen
D Zhou
DA Adama
David Ada Adama
Enea Cippitelli
EP Ijjina
F Chao
F Han
G Parisi
G Yao
HS Koppula
I Kononenko
JA Iglesias
K Weiss
Kevin Lee
MA Tahir
NA Capela
P Gupta
Pedro Trindade
R Diao
S Blackman
S Gaglio
SZ Li
UM Nunes
X Yang
Y Xiao
Y Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2018
Field of study

Assistive robots in ambient assisted living environments can be equipped with learning capabilities to effectively learn and execute human activities. This paper proposes a human activity learning (HAL) system for application in assistive robotics. An RGB-depth sensor is used to acquire information of human activities, and a set of statistical, spatial and temporal features for encoding key aspects of human activities are extracted from the acquired information of human activities. Redundant features are removed and the relevant features used in the HAL model. An ensemble of three individual classifiers—support vector machines (SVMs), K-nearest neighbour and random forest - is employed to learn the activities. The performance of the proposed system is improved when compared with the performance of other methods using a single classifier. This approach is evaluated on experimental dataset created for this work and also on a benchmark dataset—the Cornell Activity Dataset (CAD-60). Experimental results show the overall performance achieved by the proposed system is comparable to the state of the art and has the potential to benefit applications in assistive robots for reducing the time spent in learning activities

Deakin Research Online

Crossref

Nottingham Trent Institutional Repository (IRep)

Human activities transfer learning for assistive robotics

Author: F Han
HS Koppula
J Lu
J Shell
JA Iglesias
JC Bezdek
K Weiss
S-Z Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Deakin Research Online

Crossref

Nottingham Trent Institutional Repository (IRep)

VIENA(2): A Driving Anticipation Dataset

Author: A Pentland
FS Saleh
G Ros
HS Koppula
JFP Kooij
L Wang
SR Richter
X Li
X Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/11/2019
Field of study

Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single dataset or framework that addresses them all in a consistent manner. In this paper, we therefore introduce a new, large-scale dataset, called VIENA2, covering 5 generic driving scenarios, with a total of 25 distinct action classes. It contains more than 15K full HD, 5 s long videos acquired in various driving conditions, weathers, daytimes and environments, complemented with a common and realistic set of sensor measurements. This amounts to more than 2.25M frames, each annotated with an action label, corresponding to 600 samples per action class. We discuss our data acquisition strategy and the statistics of our dataset, and benchmark state-of-the-art action anticipation techniques, including a new multi-modal LSTM architecture with an effective loss function for action anticipation in driving scenarios

Infoscience - École polytechnique fédérale de Lausanne

Crossref