13,913 research outputs found
A Short Note on the Kinetics-700 Human Action Dataset
We describe an extension of the DeepMind Kinetics human action dataset from
600 classes to 700 classes, where for each class there are at least 600 video
clips from different YouTube videos. This paper details the changes introduced
for this new release of the dataset, and includes a comprehensive set of
statistics as well as baseline results using the I3D neural network
architecture.Comment: arXiv admin note: substantial text overlap with arXiv:1808.0134
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Human actions often involve complex interactions across several inter-related
objects in the scene. However, existing approaches to fine-grained video
understanding or visual relationship detection often rely on single object
representation or pairwise object relationships. Furthermore, learning
interactions across multiple objects in hundreds of frames for video is
computationally infeasible and performance may suffer since a large
combinatorial space has to be modeled. In this paper, we propose to efficiently
learn higher-order interactions between arbitrary subgroups of objects for
fine-grained video understanding. We demonstrate that modeling object
interactions significantly improves accuracy for both action recognition and
video captioning, while saving more than 3-times the computation over
traditional pairwise relationships. The proposed method is validated on two
large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and
SINet-Caption achieve state-of-the-art performances on both datasets even
though the videos are sampled at a maximum of 1 FPS. To the best of our
knowledge, this is the first work modeling object interactions on open domain
large-scale video datasets, and we additionally model higher-order object
interactions which improves the performance with low computational costs.Comment: CVPR 201
VIENA2: A Driving Anticipation Dataset
Action anticipation is critical in scenarios where one needs to react before
the action is finalized. This is, for instance, the case in automated driving,
where a car needs to, e.g., avoid hitting pedestrians and respect traffic
lights. While solutions have been proposed to tackle subsets of the driving
anticipation tasks, by making use of diverse, task-specific sensors, there is
no single dataset or framework that addresses them all in a consistent manner.
In this paper, we therefore introduce a new, large-scale dataset, called
VIENA2, covering 5 generic driving scenarios, with a total of 25 distinct
action classes. It contains more than 15K full HD, 5s long videos acquired in
various driving conditions, weathers, daytimes and environments, complemented
with a common and realistic set of sensor measurements. This amounts to more
than 2.25M frames, each annotated with an action label, corresponding to 600
samples per action class. We discuss our data acquisition strategy and the
statistics of our dataset, and benchmark state-of-the-art action anticipation
techniques, including a new multi-modal LSTM architecture with an effective
loss function for action anticipation in driving scenarios.Comment: Accepted in ACCV 201
RHM: Robot House Multi-view Human Activity Recognition Dataset
© 2023, IARIA.With the recent increased development of deep neural networks and dataset capabilities, the Human Action Recognition (HAR) domain is growing rapidly in terms of both the available datasets and deep models. Despite this, there are some lacks at datasets specifically covering the Robotics field and Human-Robot interaction. We prepare and introduce a new multi-view dataset to address this. The Robot House Multi-View dataset (RHM) contains four views: Front, Back, Ceiling, and Robot Views. There are 14 classes with 6701 video clips for each view, making a total of 26804 video clips for the four views. The lengths of the video clips are between 1 to 5 seconds. The videos with the same number and the same classes are synchronized in different views. In the second part of this paper, we consider how single streams afford activity recognition using established state-of-the-art models. We then assess the affordance for each of the views based on information theoretic modelling and mutual information concept. Furthermore, we benchmark the performance of different views, thus establishing the strengths and weaknesses of each view relevant to their information content and performance of the benchmark. Our results lead us to conclude that multi-view and multi-stream activity recognition has the added potential to improve activity recognition results
- …