13,985 research outputs found
Detecting events and key actors in multi-person videos
Multi-person event recognition is a challenging task, often with many people
active in the scene but only a small subset contributing to an actual event. In
this paper, we propose a model which learns to detect events in such videos
while automatically "attending" to the people responsible for the event. Our
model does not use explicit annotations regarding who or where those people are
during training and testing. In particular, we track people in videos and use a
recurrent neural network (RNN) to represent the track features. We learn
time-varying attention weights to combine these features at each time-instant.
The attended features are then processed using another RNN for event
detection/classification. Since most video datasets with multiple people are
restricted to a small number of videos, we also collected a new basketball
dataset comprising 257 basketball games with 14K event annotations
corresponding to 11 event classes. Our model outperforms state-of-the-art
methods for both event classification and detection on this new dataset.
Additionally, we show that the attention mechanism is able to consistently
localize the relevant players.Comment: Accepted for publication in CVPR'1
Modeling Taxi Drivers' Behaviour for the Next Destination Prediction
In this paper, we study how to model taxi drivers' behaviour and geographical
information for an interesting and challenging task: the next destination
prediction in a taxi journey. Predicting the next location is a well studied
problem in human mobility, which finds several applications in real-world
scenarios, from optimizing the efficiency of electronic dispatching systems to
predicting and reducing the traffic jam. This task is normally modeled as a
multiclass classification problem, where the goal is to select, among a set of
already known locations, the next taxi destination. We present a Recurrent
Neural Network (RNN) approach that models the taxi drivers' behaviour and
encodes the semantics of visited locations by using geographical information
from Location-Based Social Networks (LBSNs). In particular, RNNs are trained to
predict the exact coordinates of the next destination, overcoming the problem
of producing, in output, a limited set of locations, seen during the training
phase. The proposed approach was tested on the ECML/PKDD Discovery Challenge
2015 dataset - based on the city of Porto -, obtaining better results with
respect to the competition winner, whilst using less information, and on
Manhattan and San Francisco datasets.Comment: preprint version of a paper submitted to IEEE Transactions on
Intelligent Transportation System
A Grid-based Representation for Human Action Recognition
Human action recognition (HAR) in videos is a fundamental research topic in
computer vision. It consists mainly in understanding actions performed by
humans based on a sequence of visual observations. In recent years, HAR have
witnessed significant progress, especially with the emergence of deep learning
models. However, most of existing approaches for action recognition rely on
information that is not always relevant for this task, and are limited in the
way they fuse the temporal information. In this paper, we propose a novel
method for human action recognition that encodes efficiently the most
discriminative appearance information of an action with explicit attention on
representative pose features, into a new compact grid representation. Our GRAR
(Grid-based Representation for Action Recognition) method is tested on several
benchmark datasets demonstrating that our model can accurately recognize human
actions, despite intra-class appearance variations and occlusion challenges.Comment: Accepted on 25th International Conference on Pattern Recognition
(ICPR 2020
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Skeleton-based Relational Reasoning for Group Activity Analysis
Research on group activity recognition mostly leans on the standard
two-stream approach (RGB and Optical Flow) as their input features. Few have
explored explicit pose information, with none using it directly to reason about
the persons interactions. In this paper, we leverage the skeleton information
to learn the interactions between the individuals straight from it. With our
proposed method GIRN, multiple relationship types are inferred from independent
modules, that describe the relations between the body joints pair-by-pair.
Additionally to the joints relations, we also experiment with the previously
unexplored relationship between individuals and relevant objects (e.g.
volleyball). The individuals distinct relations are then merged through an
attention mechanism, that gives more importance to those individuals more
relevant for distinguishing the group activity. We evaluate our method in the
Volleyball dataset, obtaining competitive results to the state-of-the-art. Our
experiments demonstrate the potential of skeleton-based approaches for modeling
multi-person interactions.Comment: 26 pages, 5 figures, accepted manuscript in Elsevier Pattern
Recognition, minor writing revisions and new reference
Actor-Transformers for Group Activity Recognition
This paper strives to recognize individual actions and group activities from
videos. While existing solutions for this challenging problem explicitly model
spatial and temporal relationships based on location of individual actors, we
propose an actor-transformer model able to learn and selectively extract
information relevant for group activity recognition. We feed the transformer
with rich actor-specific static and dynamic representations expressed by
features from a 2D pose network and 3D CNN, respectively. We empirically study
different ways to combine these representations and show their complementary
benefits. Experiments show what is important to transform and how it should be
transformed. What is more, actor-transformers achieve state-of-the-art results
on two publicly available benchmarks for group activity recognition,
outperforming the previous best published results by a considerable margin.Comment: CVPR 202
Analyzing Human-Human Interactions: A Survey
Many videos depict people, and it is their interactions that inform us of
their activities, relation to one another and the cultural and social setting.
With advances in human action recognition, researchers have begun to address
the automated recognition of these human-human interactions from video. The
main challenges stem from dealing with the considerable variation in recording
setting, the appearance of the people depicted and the coordinated performance
of their interaction. This survey provides a summary of these challenges and
datasets to address these, followed by an in-depth discussion of relevant
vision-based recognition and detection methods. We focus on recent, promising
work based on deep learning and convolutional neural networks (CNNs). Finally,
we outline directions to overcome the limitations of the current
state-of-the-art to analyze and, eventually, understand social human actions
Mecanismos de codificación y procesamiento de información en redes basadas en firmas neuronales
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 21-02-202
- …