6,576 research outputs found
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Domain alignment in convolutional networks aims to learn the degree of
layer-specific feature alignment beneficial to the joint learning of source and
target datasets. While increasingly popular in convolutional networks, there
have been no previous attempts to achieve domain alignment in recurrent
networks. Similar to spatial features, both source and target domains are
likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is
able to learn temporal dependencies from two domains concurrently. It performs
cross-contaminated batch normalisation on both input-to-hidden and
hidden-to-hidden weights, and learns the parameters for cross-contamination,
for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on
frame-level action recognition using three datasets, taking a pair at a time,
and report an average increase in accuracy of 3.5%. The proposed DDLSTM
architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.Comment: To appear in CVPR 201
Action Recognition from Single Timestamp Supervision in Untrimmed Videos
Recognising actions in videos relies on labelled supervision during training,
typically the start and end times of each action instance. This supervision is
not only subjective, but also expensive to acquire. Weak video-level
supervision has been successfully exploited for recognition in untrimmed
videos, however it is challenged when the number of different actions in
training videos increases. We propose a method that is supervised by single
timestamps located around each action instance, in untrimmed videos. We replace
expensive action bounds with sampling distributions initialised from these
timestamps. We then use the classifier's response to iteratively update the
sampling distributions. We demonstrate that these distributions converge to the
location and extent of discriminative action segments. We evaluate our method
on three datasets for fine-grained recognition, with increasing number of
different actions per video, and show that single timestamps offer a reasonable
compromise between recognition performance and labelling effort, performing
comparably to full temporal supervision. Our update method improves top-1 test
accuracy by up to 5.4%. across the evaluated datasets.Comment: CVPR 201
Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination
We present a method for assessing skill from video, applicable to a variety
of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate
the problem as pairwise (who's better?) and overall (who's best?) ranking of
video collections, using supervised deep ranking. We propose a novel loss
function that learns discriminative features when a pair of videos exhibit
variance in skill, and learns shared features when a pair of videos exhibit
comparable skill levels. Results demonstrate our method is applicable across
tasks, with the percentage of correctly ordered pairs of videos ranging from
70% to 83% for four datasets. We demonstrate the robustness of our approach via
sensitivity analysis of its parameters. We see this work as effort toward the
automated organization of how-to video collections and overall, generic skill
determination in video.Comment: CVPR 201
- âŠ