24 research outputs found
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Domain alignment in convolutional networks aims to learn the degree of
layer-specific feature alignment beneficial to the joint learning of source and
target datasets. While increasingly popular in convolutional networks, there
have been no previous attempts to achieve domain alignment in recurrent
networks. Similar to spatial features, both source and target domains are
likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is
able to learn temporal dependencies from two domains concurrently. It performs
cross-contaminated batch normalisation on both input-to-hidden and
hidden-to-hidden weights, and learns the parameters for cross-contamination,
for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on
frame-level action recognition using three datasets, taking a pair at a time,
and report an average increase in accuracy of 3.5%. The proposed DDLSTM
architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.Comment: To appear in CVPR 201
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Previous one-stage action detection approaches have modelled temporal
dependencies using only the visual modality. In this paper, we explore
different strategies to incorporate the audio modality, using multi-scale
cross-attention to fuse the two modalities. We also demonstrate the correlation
between the distance from the timestep to the action centre and the accuracy of
the predicted boundaries. Thus, we propose a novel network head to estimate the
closeness of timesteps to the action centre, which we call the centricity
score. This leads to increased confidence for proposals that exhibit more
precise boundaries. Our method can be integrated with other one-stage
anchor-free architectures and we demonstrate this on three recent baselines on
the EPIC-Kitchens-100 action detection benchmark where we achieve
state-of-the-art performance. Detailed ablation studies showcase the benefits
of fusing audio and our proposed centricity scores. Code and models for our
proposed method are publicly available at
https://github.com/hanielwang/Audio-Visual-TAD.gitComment: Accepted to VUA workshop at BMVC 202
What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations
We propose and address a new generalisation problem: can a model trained for
action recognition successfully classify actions when they are performed within
a previously unseen scenario and in a previously unseen location? To answer
this question, we introduce the Action Recognition Generalisation Over
scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from
the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We
demonstrate recognition models struggle to generalise over 10 proposed test
splits, each of an unseen scenario in an unseen location. We thus propose CIR,
a method to represent each video as a Cross-Instance Reconstruction of videos
from other domains. Reconstructions are paired with text narrations to guide
the learning of a domain generalisable representation. We provide extensive
analysis and ablations on ARGO1M that show CIR outperforms prior domain
generalisation works on all test splits. Code and data:
https://chiaraplizz.github.io/what-can-a-cook/.Comment: Accepted at ICCV 2023. Project page:
https://chiaraplizz.github.io/what-can-a-cook
Use Your Head: Improving Long-Tail Video Recognition
This paper presents an investigation into long-tail video recognition. We
demonstrate that, unlike naturally-collected video datasets and existing
long-tail image benchmarks, current video benchmarks fall short on multiple
long-tailed properties. Most critically, they lack few-shot classes in their
tails. In response, we propose new video benchmarks that better assess
long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
We then propose a method, Long-Tail Mixed Reconstruction, which reduces
overfitting to instances from few-shot classes by reconstructing them as
weighted combinations of samples from head classes. LMR then employs label
mixing to learn robust decision boundaries. It achieves state-of-the-art
average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and
VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmrComment: CVPR 202
Person Re-ID by Fusion of Video Silhouettes and Wearable Signals for Home Monitoring Applications
The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed
Meta-Learning with Context-Agnostic Initialisations
Meta-learning approaches have addressed few-shot problems by finding
initialisations suited for fine-tuning to target tasks. Often there are
additional properties within training data (which we refer to as context), not
relevant to the target task, which act as a distractor to meta-learning,
particularly when the target task contains examples from a novel context not
seen during training. We address this oversight by incorporating a
context-adversarial component into the meta-learning process. This produces an
initialisation for fine-tuning to target which is both context-agnostic and
task-generalised. We evaluate our approach on three commonly used meta-learning
algorithms and two problems. We demonstrate our context-agnostic meta-learning
improves results in each case. First, we report on Omniglot few-shot character
classification, using alphabets as context. An average improvement of 4.3% is
observed across methods and tasks when classifying characters from an unseen
alphabet. Second, we evaluate on a dataset for personalised energy expenditure
predictions from video, using participant knowledge as context. We demonstrate
that context-agnostic meta-learning decreases the average mean square error by
30%