2,081,923 research outputs found
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
Two-Stream Action Recognition-Oriented Video Super-Resolution
We study the video super-resolution (SR) problem for facilitating video
analytics tasks, e.g. action recognition, instead of for visual quality. The
popular action recognition methods based on convolutional networks, exemplified
by two-stream networks, are not directly applicable on video of low spatial
resolution. This can be remedied by performing video SR prior to recognition,
which motivates us to improve the SR procedure for recognition accuracy.
Tailored for two-stream action recognition networks, we propose two video SR
methods for the spatial and temporal streams respectively. On the one hand, we
observe that regions with action are more important to recognition, and we
propose an optical-flow guided weighted mean-squared-error loss for our
spatial-oriented SR (SoSR) network to emphasize the reconstruction of moving
objects. On the other hand, we observe that existing video SR methods incur
temporal discontinuity between frames, which also worsens the recognition
accuracy, and we propose a siamese network for our temporal-oriented SR (ToSR)
training that emphasizes the temporal continuity between consecutive frames. We
perform experiments using two state-of-the-art action recognition networks and
two well-known datasets--UCF101 and HMDB51. Results demonstrate the
effectiveness of our proposed SoSR and ToSR in improving recognition accuracy.Comment: Accepted to ICCV 2019. Code:
https://github.com/AlanZhang1995/TwoStreamS
Perceptual Perspective Taking and Action Recognition
Robots that operate in social environments need to be able to recognise and understand the actions of other robots, and humans, in order to facilitate learning through imitation and collaboration. The success of the simulation theory approach to action recognition and imitation relies on the ability to take the perspective of other people, so as to generate simulated actions from their point of view. In this paper, simulation of visual perception is used to re-create the visual egocentric sensory space and egocentric behaviour space of an observed agent, and through this increase the accuracy of action recognition. To demonstrate the approach, experiments are performed with a robot attributing perceptions to and recognising the actions of a second robot
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Histogram of Oriented Principal Components for Cross-View Action Recognition
Existing techniques for 3D action recognition are sensitive to viewpoint
variations because they extract features from depth images which are viewpoint
dependent. In contrast, we directly process pointclouds for cross-view action
recognition from unknown and unseen views. We propose the Histogram of Oriented
Principal Components (HOPC) descriptor that is robust to noise, viewpoint,
scale and action speed variations. At a 3D point, HOPC is computed by
projecting the three scaled eigenvectors of the pointcloud within its local
spatio-temporal support volume onto the vertices of a regular dodecahedron.
HOPC is also used for the detection of Spatio-Temporal Keypoints (STK) in 3D
pointcloud sequences so that view-invariant STK descriptors (or Local HOPC
descriptors) at these key locations only are used for action recognition. We
also propose a global descriptor computed from the normalized spatio-temporal
distribution of STKs in 4-D, which we refer to as STK-D. We have evaluated the
performance of our proposed descriptors against nine existing techniques on two
cross-view and three single-view human action recognition datasets. The
Experimental results show that our techniques provide significant improvement
over state-of-the-art methods
- …
