23,714 research outputs found
Concurrence-Aware Long Short-Term Sub-Memories for Person-Person Action Recognition
Recently, Long Short-Term Memory (LSTM) has become a popular choice to model
individual dynamics for single-person action recognition due to its ability of
modeling the temporal information in various ranges of dynamic contexts.
However, existing RNN models only focus on capturing the temporal dynamics of
the person-person interactions by naively combining the activity dynamics of
individuals or modeling them as a whole. This neglects the inter-related
dynamics of how person-person interactions change over time. To this end, we
propose a novel Concurrence-Aware Long Short-Term Sub-Memories (Co-LSTSM) to
model the long-term inter-related dynamics between two interacting people on
the bounding boxes covering people. Specifically, for each frame, two
sub-memory units store individual motion information, while a concurrent LSTM
unit selectively integrates and stores inter-related motion information between
interacting people from these two sub-memory units via a new co-memory cell.
Experimental results on the BIT and UT datasets show the superiority of
Co-LSTSM compared with the state-of-the-art methods
Latent Embeddings for Collective Activity Recognition
Rather than simply recognizing the action of a person individually,
collective activity recognition aims to find out what a group of people is
acting in a collective scene. Previ- ous state-of-the-art methods using
hand-crafted potentials in conventional graphical model which can only define a
limited range of relations. Thus, the complex structural de- pendencies among
individuals involved in a collective sce- nario cannot be fully modeled. In
this paper, we overcome these limitations by embedding latent variables into
feature space and learning the feature mapping functions in a deep learning
framework. The embeddings of latent variables build a global relation
containing person-group interac- tions and richer contextual information by
jointly modeling broader range of individuals. Besides, we assemble atten- tion
mechanism during embedding for achieving more com- pact representations. We
evaluate our method on three col- lective activity datasets, where we
contribute a much larger dataset in this work. The proposed model has achieved
clearly better performance as compared to the state-of-the- art methods in our
experiments.Comment: 6pages, accepted by IEEE-AVSS201
CERN: Confidence-Energy Recurrent Network for Group Activity Recognition
This work is about recognizing human activities occurring in videos at
distinct semantic levels, including individual actions, interactions, and group
activities. The recognition is realized using a two-level hierarchy of Long
Short-Term Memory (LSTM) networks, forming a feed-forward deep architecture,
which can be trained end-to-end. In comparison with existing architectures of
LSTMs, we make two key contributions giving the name to our approach as
Confidence-Energy Recurrent Network -- CERN. First, instead of using the common
softmax layer for prediction, we specify a novel energy layer (EL) for
estimating the energy of our predictions. Second, rather than finding the
common minimum-energy class assignment, which may be numerically unstable under
uncertainty, we specify that the EL additionally computes the p-values of the
solutions, and in this way estimates the most confident energy minimum. The
evaluation on the Collective Activity and Volleyball datasets demonstrates: (i)
advantages of our two contributions relative to the common softmax and
energy-minimization formulations and (ii) a superior performance relative to
the state-of-the-art approaches.Comment: Accepted to IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 201
Detecting events and key actors in multi-person videos
Multi-person event recognition is a challenging task, often with many people
active in the scene but only a small subset contributing to an actual event. In
this paper, we propose a model which learns to detect events in such videos
while automatically "attending" to the people responsible for the event. Our
model does not use explicit annotations regarding who or where those people are
during training and testing. In particular, we track people in videos and use a
recurrent neural network (RNN) to represent the track features. We learn
time-varying attention weights to combine these features at each time-instant.
The attended features are then processed using another RNN for event
detection/classification. Since most video datasets with multiple people are
restricted to a small number of videos, we also collected a new basketball
dataset comprising 257 basketball games with 14K event annotations
corresponding to 11 event classes. Our model outperforms state-of-the-art
methods for both event classification and detection on this new dataset.
Additionally, we show that the attention mechanism is able to consistently
localize the relevant players.Comment: Accepted for publication in CVPR'1
- …