2 research outputs found
Improving Action Localization by Progressive Cross-stream Cooperation
Spatio-temporal action localization consists of three levels of tasks:
spatial localization, action classification, and temporal segmentation. In this
work, we propose a new Progressive Cross-stream Cooperation (PCSC) framework to
use both region proposals and features from one stream (i.e. Flow/RGB) to help
another stream (i.e. RGB/Flow) to iteratively improve action localization
results and generate better bounding boxes in an iterative fashion.
Specifically, we first generate a larger set of region proposals by combining
the latest region proposals from both streams, from which we can readily obtain
a larger set of labelled training samples to help learn better action detection
models. Second, we also propose a new message passing approach to pass
information from one stream to another stream in order to learn better
representations, which also leads to better action detection models. As a
result, our iterative framework progressively improves action localization
results at the frame level. To improve action localization results at the video
level, we additionally propose a new strategy to train class-specific
actionness detectors for better temporal segmentation, which can be readily
learnt by focusing on "confusing" samples from the same action class.
Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB
demonstrate the effectiveness of our newly proposed approaches for
spatio-temporal action localization in realistic scenarios.Comment: CVPR201
Spatio-Temporal Action Detection with Multi-Object Interaction
Spatio-temporal action detection in videos requires localizing the action
both spatially and temporally in the form of an "action tube". Nowadays, most
spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are
annotated with action tubes that contain a single person performing the action,
thus the predominant action detection models simply employ a person detection
and tracking pipeline for localization. However, when the action is defined as
an interaction between multiple objects, such methods may fail since each
bounding box in the action tube contains multiple objects instead of one
person. In this paper, we study the spatio-temporal action detection problem
with multi-object interaction. We introduce a new dataset that is annotated
with action tubes containing multi-object interactions. Moreover, we propose an
end-to-end spatio-temporal action detection model that performs both spatial
and temporal regression simultaneously. Our spatial regression may enclose
multiple objects participating in the action. During test time, we simply
connect the regressed bounding boxes within the predicted temporal duration
using a simple heuristic. We report the baseline results of our proposed model
on this new dataset, and also show competitive results on the standard
benchmark UCF101-24 using only RGB input