16,492 research outputs found
Video Action Transformer Network
We introduce the Action Transformer model for recognizing and localizing
human actions in video clips. We repurpose a Transformer-style architecture to
aggregate features from the spatiotemporal context around the person whose
actions we are trying to classify. We show that by using high-resolution,
person-specific, class-agnostic queries, the model spontaneously learns to
track individual people and to pick up on semantic context from the actions of
others. Additionally its attention mechanism learns to emphasize hands and
faces, which are often crucial to discriminate an action - all without explicit
supervision other than boxes and class labels. We train and test our Action
Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming
the state-of-the-art by a significant margin using only raw RGB frames as
input.Comment: CVPR 201
Two-Stream Transformer Architecture for Long Video Understanding
Pure vision transformer architectures are highly effective for short video
classification and action recognition tasks. However, due to the quadratic
complexity of self attention and lack of inductive bias, transformers are
resource intensive and suffer from data inefficiencies. Long form video
understanding tasks amplify data and memory efficiency problems in transformers
making current approaches unfeasible to implement on data or memory restricted
domains. This paper introduces an efficient Spatio-Temporal Attention Network
(STAN) which uses a two-stream transformer architecture to model dependencies
between static image features and temporal contextual features. Our proposed
approach can classify videos up to two minutes in length on a single GPU, is
data efficient, and achieves SOTA performance on several long video
understanding tasks
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training
Static appearance of video may impede the ability of a deep neural network to
learn motion-relevant features in video action recognition. In this paper, we
introduce a new concept, Dynamic Appearance (DA), summarizing the appearance
information relating to movement in a video while filtering out the static
information considered unrelated to motion. We consider distilling the dynamic
appearance from raw video data as a means of efficient video understanding. To
this end, we propose the Pixel-Wise Temporal Projection (PWTP), which projects
the static appearance of a video into a subspace within its original vector
space, while the dynamic appearance is encoded in the projection residual
describing a special motion pattern. Moreover, we integrate the PWTP module
with a CNN or Transformer into an end-to-end training framework, which is
optimized by utilizing multi-objective optimization algorithms. We provide
extensive experimental results on four action recognition benchmarks:
Kinetics400, Something-Something V1, UCF101 and HMDB51
Efficient Video Transformers with Spatial-Temporal Token Selection
Video transformers have achieved impressive results on major video
recognition benchmarks, which however suffer from high computational cost. In
this paper, we present STTS, a token selection framework that dynamically
selects a few informative tokens in both temporal and spatial dimensions
conditioned on input video samples. Specifically, we formulate token selection
as a ranking problem, which estimates the importance of each token through a
lightweight scorer network and only those with top scores will be used for
downstream evaluation. In the temporal dimension, we keep the frames that are
most relevant to the action categories, while in the spatial dimension, we
identify the most discriminative region in feature maps without affecting the
spatial context used in a hierarchical way in most video transformers. Since
the decision of token selection is non-differentiable, we employ a
perturbed-maximum based differentiable Top-K operator for end-to-end training.
We mainly conduct extensive experiments on Kinetics-400 with a recently
introduced video transformer backbone, MViT. Our framework achieves similar
results while requiring 20% less computation. We also demonstrate our approach
is generic for different transformer architectures and video datasets. Code is
available at https://github.com/wangjk666/STTS.Comment: Accepted by ECCV 202
Event-based Vision for Early Prediction of Manipulation Actions
Neuromorphic visual sensors are artificial retinas that output sequences of
asynchronous events when brightness changes occur in the scene. These sensors
offer many advantages including very high temporal resolution, no motion blur
and smart data compression ideal for real-time processing. In this study, we
introduce an event-based dataset on fine-grained manipulation actions and
perform an experimental study on the use of transformers for action prediction
with events. There is enormous interest in the fields of cognitive robotics and
human-robot interaction on understanding and predicting human actions as early
as possible. Early prediction allows anticipating complex stages for planning,
enabling effective and real-time interaction. Our Transformer network uses
events to predict manipulation actions as they occur, using online inference.
The model succeeds at predicting actions early on, building up confidence over
time and achieving state-of-the-art classification. Moreover, the
attention-based transformer architecture allows us to study the role of the
spatio-temporal patterns selected by the model. Our experiments show that the
Transformer network captures action dynamic features outperforming video-based
approaches and succeeding with scenarios where the differences between actions
lie in very subtle cues. Finally, we release the new event dataset, which is
the first in the literature for manipulation action recognition. Code will be
available at https://github.com/DaniDeniz/EventVisionTransformer.Comment: 15 pages, 9 figure
- …