12 research outputs found
SMART Frame Selection for Action Recognition
Action recognition is computationally expensive. In this paper, we address
the problem of frame selection to improve the accuracy of action recognition.
In particular, we show that selecting good frames helps in action recognition
performance even in the trimmed videos domain. Recent work has successfully
leveraged frame selection for long, untrimmed videos, where much of the content
is not relevant, and easy to discard. In this work, however, we focus on the
more standard short, trimmed action recognition problem. We argue that good
frame selection can not only reduce the computational cost of action
recognition but also increase the accuracy by getting rid of frames that are
hard to classify. In contrast to previous work, we propose a method that
instead of selecting frames by considering one at a time, considers them
jointly. This results in a more efficient selection, where good frames are more
effectively distributed over the video, like snapshots that tell a story. We
call the proposed frame selection SMART and we test it in combination with
different backbone architectures and on multiple benchmarks (Kinetics,
Something-something, UCF101). We show that the SMART frame selection
consistently improves the accuracy compared to other frame selection strategies
while reducing the computational cost by a factor of 4 to 10 times.
Additionally, we show that when the primary goal is recognition performance,
our selection strategy can improve over recent state-of-the-art models and
frame selection strategies on various benchmarks (UCF101, HMDB51, FCVID, and
ActivityNet).Comment: To be published in AAAI-2
Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
Deep learning models have revolutionized various fields, from image
recognition to natural language processing, by achieving unprecedented levels
of accuracy. However, their increasing energy consumption has raised concerns
about their environmental impact, disadvantaging smaller entities in research
and exacerbating global energy consumption. In this paper, we explore the
trade-off between model accuracy and electricity consumption, proposing a
metric that penalizes large consumption of electricity. We conduct a
comprehensive study on the electricity consumption of various deep learning
models across different GPUs, presenting a detailed analysis of their
accuracy-efficiency trade-offs. By evaluating accuracy per unit of electricity
consumed, we demonstrate how smaller, more energy-efficient models can
significantly expedite research while mitigating environmental concerns. Our
results highlight the potential for a more sustainable approach to deep
learning, emphasizing the importance of optimizing models for efficiency. This
research also contributes to a more equitable research landscape, where smaller
entities can compete effectively with larger counterparts. This advocates for
the adoption of efficient deep learning practices to reduce electricity
consumption, safeguarding the environment for future generations whilst also
helping ensure a fairer competitive landscape
Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition
We address the problem of data augmentation for video action recognition.
Standard augmentation strategies in video are hand-designed and sample the
space of possible augmented data points either at random, without knowing which
augmented points will be better, or through heuristics. We propose to learn
what makes a good video for action recognition and select only high-quality
samples for augmentation. In particular, we choose video compositing of a
foreground and a background video as the data augmentation process, which
results in diverse and realistic new samples. We learn which pairs of videos to
augment without having to actually composite them. This reduces the space of
possible augmentations, which has two advantages: it saves computational cost
and increases the accuracy of the final trained classifier, as the augmented
pairs are of higher quality than average. We present experimental results on
the entire spectrum of training settings: few-shot, semi-supervised and fully
supervised. We observe consistent improvements across all of them over prior
work and baselines on Kinetics, UCF101, HMDB51, and achieve a new
state-of-the-art on settings with limited data. We see improvements of up to
8.6% in the semi-supervised setting.Comment: Accepted to ECCV-202
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Understanding the steps required to perform a task is an important skill for
AI systems. Learning these steps from instructional videos involves two
subproblems: (i) identifying the temporal boundary of sequentially occurring
segments and (ii) summarizing these steps in natural language. We refer to this
task as Procedure Segmentation and Summarization (PSS). In this paper, we take
a closer look at PSS and propose three fundamental improvements over current
methods. The segmentation task is critical, as generating a correct summary
requires each step of the procedure to be correctly identified. However,
current segmentation metrics often overestimate the segmentation quality
because they do not consider the temporal order of segments. In our first
contribution, we propose a new segmentation metric that takes into account the
order of segments, giving a more reliable measure of the accuracy of a given
predicted segmentation. Current PSS methods are typically trained by proposing
segments, matching them with the ground truth and computing a loss. However,
much like segmentation metrics, existing matching algorithms do not consider
the temporal order of the mapping between candidate segments and the ground
truth. In our second contribution, we propose a matching algorithm that
constrains the temporal order of segment mapping, and is also differentiable.
Lastly, we introduce multi-modal feature training for PSS, which further
improves segmentation. We evaluate our approach on two instructional video
datasets (YouCook2 and Tasty) and observe an improvement over the
state-of-the-art of and for procedure segmentation and
summarization, respectively.Comment: Accepted at BMVC 202
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Zero-shot action recognition is the task of recognizing action classes
without visual examples, only with a semantic embedding which relates unseen to
seen classes. The problem can be seen as learning a function which generalizes
well to instances of unseen classes without losing discrimination between
classes. Neural networks can model the complex boundaries between visual
classes, which explains their success as supervised models. However, in
zero-shot learning, these highly specialized class boundaries may not transfer
well from seen to unseen classes. In this paper, we propose a clustering-based
model, which considers all training samples at once, instead of optimizing for
each instance individually. We optimize the clustering using Reinforcement
Learning which we show is critical for our approach to work. We call the
proposed method CLASTER and observe that it consistently improves over the
state-of-the-art in all standard datasets, UCF101, HMDB51, and Olympic Sports;
both in the standard zero-shot evaluation and the generalized zero-shot
learning
A New Split for Evaluating True Zero-Shot Action Recognition
Zero-shot action recognition is the task of classifying action categories
that are not available in the training set. In this setting, the standard
evaluation protocol is to use existing action recognition datasets (e.g.
UCF101) and randomly split the classes into seen and unseen. However, most
recent work builds on representations pre-trained on the Kinetics dataset,
where classes largely overlap with classes in the zero-shot evaluation
datasets. As a result, classes which are supposed to be unseen, are present
during supervised pre-training, invalidating the condition of the zero-shot
setting. A similar concern was previously noted several years ago for image
based zero-shot recognition, but has not been considered by the zero-shot
action recognition community. In this paper, we propose a new split for true
zero-shot action recognition with no overlap between unseen test classes and
training or pre-training classes. We benchmark several recent approaches on the
proposed True Zero-Shot (TruZe) Split for UCF101 and HMDB51, with zero-shot and
generalized zero-shot evaluation. In our extensive analysis we find that our
TruZe splits are significantly harder than comparable random splits as nothing
is leaking from pre-training, i.e. unseen performance is consistently lower, up
to 9.4% for zero-shot action recognition. In an additional evaluation we also
find that similar issues exist in the splits used in few-shot action
recognition, here we see differences of up to 14.1%. We publish our splits and
hope that our benchmark analysis will change how the field is evaluating zero-
and few-shot action recognition moving forward