3,398 research outputs found
Detecting events and key actors in multi-person videos
Multi-person event recognition is a challenging task, often with many people
active in the scene but only a small subset contributing to an actual event. In
this paper, we propose a model which learns to detect events in such videos
while automatically "attending" to the people responsible for the event. Our
model does not use explicit annotations regarding who or where those people are
during training and testing. In particular, we track people in videos and use a
recurrent neural network (RNN) to represent the track features. We learn
time-varying attention weights to combine these features at each time-instant.
The attended features are then processed using another RNN for event
detection/classification. Since most video datasets with multiple people are
restricted to a small number of videos, we also collected a new basketball
dataset comprising 257 basketball games with 14K event annotations
corresponding to 11 event classes. Our model outperforms state-of-the-art
methods for both event classification and detection on this new dataset.
Additionally, we show that the attention mechanism is able to consistently
localize the relevant players.Comment: Accepted for publication in CVPR'1
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify
temporal segments corresponding to the actions, and learn models that
generalize to unconstrained web videos. We find that web images queried by
action names serve as well-localized highlights for many actions, but are
noisily labeled. To solve this problem, we propose a simple yet effective
method that takes weak video labels and noisy image labels as input, and
generates localized action frames as output. This is achieved by cross-domain
transfer between video frames and web images, using pre-trained deep
convolutional neural networks. We then use the localized action frames to train
action recognition models with long short-term memory networks. We collect a
fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014
localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
prediction.Comment: To appear in IJC
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Deep models that are both effective and explainable are desirable in many
settings; prior explainable models have been unimodal, offering either
image-based visualization of attention weights or text-based generation of
post-hoc justifications. We propose a multimodal approach to explanation, and
argue that the two modalities provide complementary explanatory strengths. We
collect two new datasets to define and evaluate this task, and propose a novel
model which can provide joint textual rationale generation and attention
visualization. Our datasets define visual and textual justifications of a
classification decision for activity recognition tasks (ACT-X) and for visual
question answering tasks (VQA-X). We quantitatively show that training with the
textual explanations not only yields better textual justification models, but
also better localizes the evidence that supports the decision. We also
qualitatively show cases where visual explanation is more insightful than
textual explanation, and vice versa, supporting our thesis that multimodal
explanation models offer significant benefits over unimodal approaches.Comment: arXiv admin note: text overlap with arXiv:1612.0475
- …