110,200 research outputs found
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
TS-RGBD Dataset: a Novel Dataset for Theatre Scenes Description for People with Visual Impairments
Computer vision was long a tool used for aiding visually impaired people to
move around their environment and avoid obstacles and falls. Solutions are
limited to either indoor or outdoor scenes, which limits the kind of places and
scenes visually disabled people can be in, including entertainment places such
as theatres. Furthermore, most of the proposed computer-vision-based methods
rely on RGB benchmarks to train their models resulting in a limited performance
due to the absence of the depth modality.
In this paper, we propose a novel RGB-D dataset containing theatre scenes
with ground truth human actions and dense captions annotations for image
captioning and human action recognition: TS-RGBD dataset. It includes three
types of data: RGB, depth, and skeleton sequences, captured by Microsoft
Kinect.
We test image captioning models on our dataset as well as some skeleton-based
human action recognition models in order to extend the range of environment
types where a visually disabled person can be, by detecting human actions and
textually describing appearances of regions of interest in theatre scenes
Egocentric RGB+Depth Action Recognition in Industry-Like Settings
Action recognition from an egocentric viewpoint is a crucial perception task
in robotics and enables a wide range of human-robot interactions. While most
computer vision approaches prioritize the RGB camera, the Depth modality -
which can further amplify the subtleties of actions from an egocentric
perspective - remains underexplored. Our work focuses on recognizing actions
from egocentric RGB and Depth modalities in an industry-like environment. To
study this problem, we consider the recent MECCANO dataset, which provides a
wide range of assembling actions. Our framework is based on the 3D Video SWIN
Transformer to encode both RGB and Depth modalities effectively. To address the
inherent skewness in real-world multimodal action occurrences, we propose a
training strategy using an exponentially decaying variant of the focal loss
modulating factor. Additionally, to leverage the information in both RGB and
Depth modalities, we opt for late fusion to combine the predictions from each
modality. We thoroughly evaluate our method on the action recognition task of
the MECCANO dataset, and it significantly outperforms the prior work. Notably,
our method also secured first place at the multimodal action recognition
challenge at ICIAP 2023
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
The availability of a large labeled dataset is a key requirement for applying
deep learning methods to solve various computer vision tasks. In the context of
understanding human activities, existing public datasets, while large in size,
are often limited to a single RGB camera and provide only per-frame or per-clip
action annotations. To enable richer analysis and understanding of human
activities, we introduce IKEA ASM---a three million frame, multi-view,
furniture assembly video dataset that includes depth, atomic actions, object
segmentation, and human pose. Additionally, we benchmark prominent methods for
video action recognition, object segmentation and human pose estimation tasks
on this challenging dataset. The dataset enables the development of holistic
methods, which integrate multi-modal and multi-view data to better perform on
these tasks
- …