2,719 research outputs found
EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras
Falls are significant and often fatal for vulnerable populations such as the
elderly. Previous works have addressed the detection of falls by relying on
data capture by a single sensor, images or accelerometers. In this work, we
rely on multimodal descriptors extracted from videos captured by egocentric
cameras. Our proposed method includes a late decision fusion layer that builds
on top of the extracted descriptors. Furthermore, we collect a new dataset on
which we assess our proposed approach. We believe this is the first public
dataset of its kind. The dataset comprises 10,948 video samples by 14 subjects.
We conducted ablation experiments to assess the performance of individual
feature extractors, fusion of visual information, and fusion of both visual and
audio information. Moreover, we experimented with internal and external
cross-validation. Our results demonstrate that the fusion of audio and visual
information through late decision fusion improves detection performance, making
it a promising tool for fall prevention and mitigation
Forecasting Hands and Objects in Future Frames
This paper presents an approach to forecast future presence and location of
human hands and objects. Given an image frame, the goal is to predict what
objects will appear in the future frame (e.g., 5 seconds later) and where they
will be located at, even when they are not visible in the current frame. The
key idea is that (1) an intermediate representation of a convolutional object
recognition model abstracts scene information in its frame and that (2) we can
predict (i.e., regress) such representations corresponding to the future frames
based on that of the current frame. We design a new two-stream convolutional
neural network (CNN) architecture for videos by extending the state-of-the-art
convolutional object detection network, and present a new fully convolutional
regression network for predicting future scene representations. Our experiments
confirm that combining the regressed future representation with our detection
network allows reliable estimation of future hands and objects in videos. We
obtain much higher accuracy compared to the state-of-the-art future object
presence forecast method on a public dataset
Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition
Recognizing interactive action plays an important role in human-robot
interaction and collaboration. Previous methods use late fusion and
co-attention mechanism to capture interactive relations, which have limited
learning capability or inefficiency to adapt to more interacting entities. With
assumption that priors of each entity are already known, they also lack
evaluations on a more general setting addressing the diversity of subjects. To
address these problems, we propose an Interactive Spatiotemporal Token
Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and
interactive relations. Specifically, our network contains a tokenizer to
partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to
represent motions of multiple diverse entities. By extending the entity
dimension, ISTs provide better interactive representations. To jointly learn
along three dimensions in ISTs, multi-head self-attention blocks integrated
with 3D convolutions are designed to capture inter-token correlations. When
modeling correlations, a strict entity ordering is usually irrelevant for
recognizing interactive actions. To this end, Entity Rearrangement is proposed
to eliminate the orderliness in ISTs for interchangeable entities. Extensive
experiments on four datasets verify the effectiveness of ISTA-Net by
outperforming state-of-the-art methods. Our code is publicly available at
https://github.com/Necolizer/ISTA-NetComment: IROS 2023 Camera-ready version. Project website:
https://necolizer.github.io/ISTA-Net
Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions
In this work, we address two coupled tasks of gaze prediction and action
recognition in egocentric videos by exploring their mutual context. Our
assumption is that in the procedure of performing a manipulation task, what a
person is doing determines where the person is looking at, and the gaze point
reveals gaze and non-gaze regions which contain important and complementary
information about the undergoing action. We propose a novel mutual context
network (MCN) that jointly learns action-dependent gaze prediction and
gaze-guided action recognition in an end-to-end manner. Experiments on public
egocentric video datasets demonstrate that our MCN achieves state-of-the-art
performance of both gaze prediction and action recognition
- …