18 research outputs found
What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations
We propose and address a new generalisation problem: can a model trained for
action recognition successfully classify actions when they are performed within
a previously unseen scenario and in a previously unseen location? To answer
this question, we introduce the Action Recognition Generalisation Over
scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from
the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We
demonstrate recognition models struggle to generalise over 10 proposed test
splits, each of an unseen scenario in an unseen location. We thus propose CIR,
a method to represent each video as a Cross-Instance Reconstruction of videos
from other domains. Reconstructions are paired with text narrations to guide
the learning of a domain generalisable representation. We provide extensive
analysis and ablations on ARGO1M that show CIR outperforms prior domain
generalisation works on all test splits. Code and data:
https://chiaraplizz.github.io/what-can-a-cook/.Comment: Accepted at ICCV 2023. Project page:
https://chiaraplizz.github.io/what-can-a-cook
Person Re-ID by Fusion of Video Silhouettes and Wearable Signals for Home Monitoring Applications
The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed
Temporal-Relational CrossTransformers for Few-Shot Action Recognition
We propose a novel approach to few-shot action recognition, finding
temporally-corresponding frame tuples between the query and videos in the
support set. Distinct from previous few-shot works, we construct class
prototypes using the CrossTransformer attention mechanism to observe relevant
sub-sequences of all support videos, rather than using class averages or single
best matches. Video representations are formed from ordered tuples of varying
numbers of frames, which allows sub-sequences of actions at different speeds
and temporal offsets to be compared.
Our proposed Temporal-Relational CrossTransformers (TRX) achieve
state-of-the-art results on few-shot splits of Kinetics, Something-Something V2
(SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on
SSv2 by a wide margin (12%) due to the its ability to model temporal relations.
A detailed ablation showcases the importance of matching to multiple support
set videos and learning higher-order relational CrossTransformers.Comment: Accepted in CVPR 202
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
First-person vision is gaining interest as it offers a unique viewpoint on
people's interaction with objects, their attention, and even intention.
However, progress in this challenging domain has been relatively slow due to
the lack of sufficiently large datasets. In this paper, we introduce
EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32
participants in their native kitchen environments. Our videos depict
nonscripted daily activities: we simply asked each participant to start
recording every time they entered their kitchen. Recording took place in 4
cities (in North America and Europe) by participants belonging to 10 different
nationalities, resulting in highly diverse cooking styles. Our dataset features
55 hours of video consisting of 11.5M frames, which we densely labeled for a
total of 39.6K action segments and 454.3K object bounding boxes. Our annotation
is unique in that we had the participants narrate their own videos (after
recording), thus reflecting true intention, and we crowd-sourced ground-truths
based on these. We describe our object, action and anticipation challenges, and
evaluate several baselines over two test splits, seen and unseen kitchens.
Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and
Project page: http://epic-kitchens.github.i