3,830 research outputs found
Detecting events and key actors in multi-person videos
Multi-person event recognition is a challenging task, often with many people
active in the scene but only a small subset contributing to an actual event. In
this paper, we propose a model which learns to detect events in such videos
while automatically "attending" to the people responsible for the event. Our
model does not use explicit annotations regarding who or where those people are
during training and testing. In particular, we track people in videos and use a
recurrent neural network (RNN) to represent the track features. We learn
time-varying attention weights to combine these features at each time-instant.
The attended features are then processed using another RNN for event
detection/classification. Since most video datasets with multiple people are
restricted to a small number of videos, we also collected a new basketball
dataset comprising 257 basketball games with 14K event annotations
corresponding to 11 event classes. Our model outperforms state-of-the-art
methods for both event classification and detection on this new dataset.
Additionally, we show that the attention mechanism is able to consistently
localize the relevant players.Comment: Accepted for publication in CVPR'1
Mobile Video Object Detection with Temporally-Aware Feature Maps
This paper introduces an online model for object detection in videos designed
to run in real-time on low-powered mobile and embedded devices. Our approach
combines fast single-image object detection with convolutional long short term
memory (LSTM) layers to create an interweaved recurrent-convolutional
architecture. Additionally, we propose an efficient Bottleneck-LSTM layer that
significantly reduces computational cost compared to regular LSTMs. Our network
achieves temporal awareness by using Bottleneck-LSTMs to refine and propagate
feature maps across frames. This approach is substantially faster than existing
detection methods in video, outperforming the fastest single-frame models in
model size and computational cost while attaining accuracy comparable to much
more expensive single-frame models on the Imagenet VID 2015 dataset. Our model
reaches a real-time inference speed of up to 15 FPS on a mobile CPU.Comment: In CVPR 201
Video Object Detection with an Aligned Spatial-Temporal Memory
We introduce Spatial-Temporal Memory Networks for video object detection. At
its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent
computation unit to model long-term temporal appearance and motion dynamics.
The STMM's design enables full integration of pretrained backbone CNN weights,
which we find to be critical for accurate detection. Furthermore, in order to
tackle object motion in videos, we propose a novel MatchTrans module to align
the spatial-temporal memory from frame to frame. Our method produces
state-of-the-art results on the benchmark ImageNet VID dataset, and our
ablative studies clearly demonstrate the contribution of our different design
choices. We release our code and models at
http://fanyix.cs.ucdavis.edu/project/stmn/project.html
Single Shot Temporal Action Detection
Temporal action detection is a very important yet challenging problem, since
videos in real applications are usually long, untrimmed and contain multiple
action instances. This problem requires not only recognizing action categories
but also detecting start time and end time of each action instance. Many
state-of-the-art methods adopt the "detection by classification" framework:
first do proposal, and then classify proposals. The main drawback of this
framework is that the boundaries of action instance proposals have been fixed
during the classification step. To address this issue, we propose a novel
Single Shot Action Detector (SSAD) network based on 1D temporal convolutional
layers to skip the proposal generation step via directly detecting action
instances in untrimmed video. On pursuit of designing a particular SSAD network
that can work effectively for temporal action detection, we empirically search
for the best network architecture of SSAD due to lacking existing models that
can be directly adopted. Moreover, we investigate into input feature types and
fusion strategies to further improve detection accuracy. We conduct extensive
experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When
setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD
significantly outperforms other state-of-the-art systems by increasing mAP from
19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.Comment: ACM Multimedia 201
- …