22,514 research outputs found
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition
Webly-supervised learning has recently emerged as an alternative paradigm to
traditional supervised learning based on large-scale datasets with manual
annotations. The key idea is that models such as CNNs can be learned from the
noisy visual data available on the web. In this work we aim to exploit web data
for video understanding tasks such as action recognition and detection. One of
the main problems in webly-supervised learning is cleaning the noisy labeled
data from the web. The state-of-the-art paradigm relies on training a first
classifier on noisy data that is then used to clean the remaining dataset. Our
key insight is that this procedure biases the second classifier towards samples
that the first one understands. Here we train two independent CNNs, a RGB
network on web images and video frames and a second network using temporal
information from optical flow. We show that training the networks independently
is vastly superior to selecting the frames for the flow classifier by using our
RGB network. Moreover, we show benefits in enriching the training set with
different data sources from heterogeneous public web databases. We demonstrate
that our framework outperforms all other webly-supervised methods on two public
benchmarks, UCF-101 and Thumos'14.Comment: Submitted to CVIU SI: Computer Vision and the We
Forecasting People Trajectories and Head Poses by Jointly Reasoning on Tracklets and Vislets
In this work, we explore the correlation between people trajectories and
their head orientations. We argue that people trajectory and head pose
forecasting can be modelled as a joint problem. Recent approaches on trajectory
forecasting leverage short-term trajectories (aka tracklets) of pedestrians to
predict their future paths. In addition, sociological cues, such as expected
destination or pedestrian interaction, are often combined with tracklets. In
this paper, we propose MiXing-LSTM (MX-LSTM) to capture the interplay between
positions and head orientations (vislets) thanks to a joint unconstrained
optimization of full covariance matrices during the LSTM backpropagation. We
additionally exploit the head orientations as a proxy for the visual attention,
when modeling social interactions. MX-LSTM predicts future pedestrians location
and head pose, increasing the standard capabilities of the current approaches
on long-term trajectory forecasting. Compared to the state-of-the-art, our
approach shows better performances on an extensive set of public benchmarks.
MX-LSTM is particularly effective when people move slowly, i.e. the most
challenging scenario for all other models. The proposed approach also allows
for accurate predictions on a longer time horizon.Comment: Accepted at IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE 2019. arXiv admin note: text overlap with arXiv:1805.0065
- …