8,442 research outputs found
Finding Action Tubes with a Sparse-to-Dense Framework
The task of spatial-temporal action detection has attracted increasing
attention among researchers. Existing dominant methods solve this problem by
relying on short-term information and dense serial-wise detection on each
individual frames or clips. Despite their effectiveness, these methods showed
inadequate use of long-term information and are prone to inefficiency. In this
paper, we propose for the first time, an efficient framework that generates
action tube proposals from video streams with a single forward pass in a
sparse-to-dense manner. There are two key characteristics in this framework:
(1) Both long-term and short-term sampled information are explicitly utilized
in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS)
is designed to effectively approximate the tube output while keeping the system
tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and
UCFSports benchmark datasets, achieving promising results that are competitive
to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our
framework about 7.6 times more efficient than the nearest competitor.Comment: 5 figures; AAAI 202
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatiotemporal action localisation by
extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate
sets of temporally connected bounding boxes called \textit{action micro-tubes}.
However, they fail to consider that the underlying anchor proposal hypotheses
should also move (transition) from frame to frame, as the actor or the camera
does. Assuming we evaluate 2D anchors in each frame, then the number of
possible transitions from each 2D anchor to the next, for a sequence of
consecutive frames, is in the order of , expensive even for small
values of . To avoid this problem, we introduce a Transition-Matrix-based
Network (TraMNet) which relies on computing transition probabilities between
anchor proposals while maximising their overlap with ground truth bounding
boxes across frames, and enforcing sparsity via a transition threshold. As the
resulting transition matrix is sparse and stochastic, this reduces the proposal
hypothesis search space from to the cardinality of the thresholded
matrix. At training time, transitions are specific to cell locations of the
feature maps, so that a sparse (efficient) transition matrix is used to train
the network. At test time, a denser transition matrix can be obtained either by
decreasing the threshold or by adding to it all the relative transitions
originating from any cell location, allowing the network to handle transitions
in the test data that might not have been present in the training data, and
making detection translation-invariant. Finally, we show that our network can
handle sparse annotations such as those available in the DALY dataset. We
report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24
datasets to support our claims.Comment: 15 page
- …