172 research outputs found
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Domain alignment in convolutional networks aims to learn the degree of
layer-specific feature alignment beneficial to the joint learning of source and
target datasets. While increasingly popular in convolutional networks, there
have been no previous attempts to achieve domain alignment in recurrent
networks. Similar to spatial features, both source and target domains are
likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is
able to learn temporal dependencies from two domains concurrently. It performs
cross-contaminated batch normalisation on both input-to-hidden and
hidden-to-hidden weights, and learns the parameters for cross-contamination,
for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on
frame-level action recognition using three datasets, taking a pair at a time,
and report an average increase in accuracy of 3.5%. The proposed DDLSTM
architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.Comment: To appear in CVPR 201
Action Recognition from Single Timestamp Supervision in Untrimmed Videos
Recognising actions in videos relies on labelled supervision during training,
typically the start and end times of each action instance. This supervision is
not only subjective, but also expensive to acquire. Weak video-level
supervision has been successfully exploited for recognition in untrimmed
videos, however it is challenged when the number of different actions in
training videos increases. We propose a method that is supervised by single
timestamps located around each action instance, in untrimmed videos. We replace
expensive action bounds with sampling distributions initialised from these
timestamps. We then use the classifier's response to iteratively update the
sampling distributions. We demonstrate that these distributions converge to the
location and extent of discriminative action segments. We evaluate our method
on three datasets for fine-grained recognition, with increasing number of
different actions per video, and show that single timestamps offer a reasonable
compromise between recognition performance and labelling effort, performing
comparably to full temporal supervision. Our update method improves top-1 test
accuracy by up to 5.4%. across the evaluated datasets.Comment: CVPR 201
Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination
We present a method for assessing skill from video, applicable to a variety
of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate
the problem as pairwise (who's better?) and overall (who's best?) ranking of
video collections, using supervised deep ranking. We propose a novel loss
function that learns discriminative features when a pair of videos exhibit
variance in skill, and learns shared features when a pair of videos exhibit
comparable skill levels. Results demonstrate our method is applicable across
tasks, with the percentage of correctly ordered pairs of videos ranging from
70% to 83% for four datasets. We demonstrate the robustness of our approach via
sensitivity analysis of its parameters. We see this work as effort toward the
automated organization of how-to video collections and overall, generic skill
determination in video.Comment: CVPR 201
Play It Back: Iterative Attention for Audio Recognition
A key function of auditory cognition is the association of characteristic
sounds with their corresponding semantics over time. Humans attempting to
discriminate between fine-grained audio categories, often replay the same
discriminative sounds to increase their prediction confidence. We propose an
end-to-end attention-based architecture that through selective repetition
attends over the most discriminative sounds across the audio sequence. Our
model initially uses the full audio sequence and iteratively refines the
temporal segments replayed based on slot attention. At each playback, the
selected segments are replayed using a smaller hop length which represents
higher resolution features within these segments. We show that our method can
consistently achieve state-of-the-art performance across three
audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.Comment: Accepted at IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) 202
Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
Fine-grained action recognition datasets exhibit environmental bias, where
multiple video sequences are captured from a limited number of environments.
Training a model in one environment and deploying in another results in a drop
in performance due to an unavoidable domain shift. Unsupervised Domain
Adaptation (UDA) approaches have frequently utilised adversarial training
between the source and target domains. However, these approaches have not
explored the multi-modal nature of video within each domain. In this work we
exploit the correspondence of modalities as a self-supervised alignment
approach for UDA in addition to adversarial alignment.
We test our approach on three kitchens from our large-scale dataset,
EPIC-Kitchens, using two modalities commonly employed for action recognition:
RGB and Optical Flow. We show that multi-modal self-supervision alone improves
the performance over source-only training by 2.4% on average. We then combine
adversarial training with multi-modal self-supervision, showing that our
approach outperforms other UDA methods by 3%.Comment: Accepted to CVPR 2020 for an oral presentatio
- …