17 research outputs found
TubeR: Tubelet Transformer for Video Action Detection
We propose TubeR: a simple solution for spatio-temporal video action
detection. Different from existing methods that depend on either an off-line
actor detector or hand-designed actor-positional hypotheses like proposals or
anchors, we propose to directly detect an action tubelet in a video by
simultaneously performing action localization and recognition from a single
representation. TubeR learns a set of tubelet-queries and utilizes a
tubelet-attention module to model the dynamic spatio-temporal nature of a video
clip, which effectively reinforces the model capacity compared to using
actor-positional hypotheses in the spatio-temporal space. For videos containing
transitional states or scene changes, we propose a context aware classification
head to utilize short-term and long-term context to strengthen action
classification, and an action switch regression head for detecting the precise
temporal action extent. TubeR directly produces action tubelets with variable
lengths and even maintains good results for long video clips. TubeR outperforms
the previous state-of-the-art on commonly used action detection datasets AVA,
UCF101-24 and JHMDB51-21
Asynchronous Interaction Aggregation for Action Detection
Understanding interaction is an essential part of video action detection. We
propose the Asynchronous Interaction Aggregation network (AIA) that leverages
different interactions to boost action detection. There are two key designs in
it: one is the Interaction Aggregation structure (IA) adopting a uniform
paradigm to model and integrate multiple types of interaction; the other is the
Asynchronous Memory Update algorithm (AMU) that enables us to achieve better
performance by modeling very long-term interaction dynamically without huge
computation cost. We provide empirical evidence to show that our network can
gain notable accuracy from the integrative interactions and is easy to train
end-to-end. Our method reports the new state-of-the-art performance on AVA
dataset, with 3.7 mAP gain (12.6% relative improvement) on validation split
comparing to our strong baseline. The results on dataset UCF101-24 and
EPIC-Kitchens further illustrate the effectiveness of our approach. Source code
will be made public at: https://github.com/MVIG-SJTU/AlphAction
What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions
We propose a novel one-stage Transformer-based semantic and spatial refined
transformer (SSRT) to solve the Human-Object Interaction detection task, which
requires to localize humans and objects, and predicts their interactions.
Differently from previous Transformer-based HOI approaches, which mostly focus
at improving the design of the decoder outputs for the final detection, SSRT
introduces two new modules to help select the most relevant object-action pairs
within an image and refine the queries' representation using rich semantic and
spatial features. These enhancements lead to state-of-the-art results on the
two most popular HOI benchmarks: V-COCO and HICO-DET.Comment: CVPR 2022 Ora