194,610 research outputs found
Finding Action Tubes with a Sparse-to-Dense Framework
The task of spatial-temporal action detection has attracted increasing
attention among researchers. Existing dominant methods solve this problem by
relying on short-term information and dense serial-wise detection on each
individual frames or clips. Despite their effectiveness, these methods showed
inadequate use of long-term information and are prone to inefficiency. In this
paper, we propose for the first time, an efficient framework that generates
action tube proposals from video streams with a single forward pass in a
sparse-to-dense manner. There are two key characteristics in this framework:
(1) Both long-term and short-term sampled information are explicitly utilized
in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS)
is designed to effectively approximate the tube output while keeping the system
tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and
UCFSports benchmark datasets, achieving promising results that are competitive
to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our
framework about 7.6 times more efficient than the nearest competitor.Comment: 5 figures; AAAI 202
Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation
In this work, we address the problem of spatio-temporal action detection in
temporally untrimmed videos. It is an important and challenging task as finding
accurate human actions in both temporal and spatial space is important for
analyzing large-scale video data. To tackle this problem, we propose a cascade
proposal and location anticipation (CPLA) model for frame-level action
detection. There are several salient points of our model: (1) a cascade region
proposal network (casRPN) is adopted for action proposal generation and shows
better localization accuracy compared with single region proposal network
(RPN); (2) action spatio-temporal consistencies are exploited via a location
anticipation network (LAN) and thus frame-level action detection is not
conducted independently. Frame-level detections are then linked by solving an
linking score maximization problem, and temporally trimmed into spatio-temporal
action tubes. We demonstrate the effectiveness of our model on the challenging
UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
Generic Tubelet Proposals for Action Localization
We develop a novel framework for action localization in videos. We propose
the Tube Proposal Network (TPN), which can generate generic, class-independent,
video-level tubelet proposals in videos. The generated tubelet proposals can be
utilized in various video analysis tasks, including recognizing and localizing
actions in videos. In particular, we integrate these generic tubelet proposals
into a unified temporal deep network for action classification. Compared with
other methods, our generic tubelet proposal method is accurate, general, and is
fully differentiable under a smoothL1 loss function. We demonstrate the
performance of our algorithm on the standard UCF-Sports, J-HMDB21, and UCF-101
datasets. Our class-independent TPN outperforms other tubelet generation
methods, and our unified temporal deep network achieves state-of-the-art
localization results on all three datasets
- …