65 research outputs found
StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
Anticipation problem has been studied considering different aspects such as
predicting humans' locations, predicting hands and objects trajectories, and
forecasting actions and human-object interactions. In this paper, we studied
the short-term object interaction anticipation problem from the egocentric
point of view, proposing a new end-to-end architecture named StillFast. Our
approach simultaneously processes a still image and a video detecting and
localizing next-active objects, predicting the verb which describes the future
interaction and determining when the interaction will start. Experiments on the
large-scale egocentric dataset EGO4D show that our method outperformed
state-of-the-art approaches on the considered task. Our method is ranked first
in the public leaderboard of the EGO4D short term object interaction
anticipation challenge 2022. Please see the project web page for code and
additional details: https://iplab.dmi.unict.it/stillfast/
Going deeper into third-person action anticipation
Analysing human actions in videos is gaining a great deal of interest in the field of computer vision. This paper explores and reviews different deep learning techniques used in third-person action anticipation. The task of action anticipation is divided into feature extraction and a predictive model for many architectures. This paper outlines a project plan for action anticipation in the third person using step-based activity. We will use several data sets to compare some of these different architectures based on their prediction accuracy and ability to predict actions in varying time frames
Joint-Based Action Progress Prediction
Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation
Egocentric action anticipation is a challenging task that aims to make
advanced predictions of future actions from current and historical observations
in the first-person view. Most existing methods focus on improving the model
architecture and loss function based on the visual input and recurrent neural
network to boost the anticipation performance. However, these methods, which
merely consider visual information and rely on a single network architecture,
gradually reach a performance plateau. In order to fully understand what has
been observed and capture the dependencies between current observations and
future actions well enough, we propose a novel visual-semantic fusion enhanced
and Transformer GRU-based action anticipation framework in this paper. Firstly,
high-level semantic information is introduced to improve the performance of
action anticipation for the first time. We propose to use the semantic features
generated based on the class labels or directly from the visual observations to
augment the original visual features. Secondly, an effective visual-semantic
fusion module is proposed to make up for the semantic gap and fully utilize the
complementarity of different modalities. Thirdly, to take advantage of both the
parallel and autoregressive models, we design a Transformer based encoder for
long-term sequential modeling and a GRU-based decoder for flexible iteration
decoding. Extensive experiments on two large-scale first-person view datasets,
i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed
method, which achieves new state-of-the-art performance, outperforming previous
approaches by a large margin.Comment: 12 pages, 7 figure
Anticipating Next Active Objects for Egocentric Videos
This paper addresses the problem of anticipating the next-active-object
location in the future, for a given egocentric video clip where the contact
might happen, before any action takes place. The problem is considerably hard,
as we aim at estimating the position of such objects in a scenario where the
observed clip and the action segment are separated by the so-called ``time to
contact'' (TTC) segment. Many methods have been proposed to anticipate the
action of a person based on previous hand movements and interactions with the
surroundings. However, there have been no attempts to investigate the next
possible interactable object, and its future location with respect to the
first-person's motion and the field-of-view drift during the TTC window. We
define this as the task of Anticipating the Next ACTive Object (ANACTO). To
this end, we propose a transformer-based self-attention framework to identify
and locate the next-active-object in an egocentric clip.
We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and
Ego4D. We also provide annotations for the first two datasets. Our approach
performs best compared to relevant baseline methods. We also conduct ablation
studies to understand the effectiveness of the proposed and baseline methods on
varying conditions. Code and ANACTO task annotations will be made available
upon paper acceptance.Comment: 13 pages, 13 figure
Streaming egocentric action anticipation: An evaluation scheme and approach
Egocentric action anticipation aims to predict the future actions the camera
wearer will perform from the observation of the past. While predictions about
the future should be available before the predicted events take place, most
approaches do not pay attention to the computational time required to make such
predictions. As a result, current evaluation schemes assume that predictions
are available right after the input video is observed, i.e., presuming a
negligible runtime, which may lead to overly optimistic evaluations. We propose
a streaming egocentric action evaluation scheme which assumes that predictions
are performed online and made available only after the model has processed the
current input segment, which depends on its runtime. To evaluate all models
considering the same prediction horizon, we hence propose that slower models
should base their predictions on temporal segments sampled ahead of time. Based
on the observation that model runtime can affect performance in the considered
streaming evaluation scenario, we further propose a lightweight action
anticipation model based on feed-forward 3D CNNs which is optimized using
knowledge distillation techniques with a novel past-to-future distillation
loss. Experiments on the three popular datasets EPIC-KITCHENS-55,
EPIC-KITCHENS-100 and EGTEA Gaze+ show that (i) the proposed evaluation scheme
induces a different ranking on state-of-the-art methods as compared to classic
evaluations, (ii) lightweight approaches tend to outmatch more computationally
expensive ones, and (iii) the proposed model based on feed-forward 3D CNNs and
knowledge distillation outperforms current art in the streaming egocentric
action anticipation scenario.Comment: Published in Computer Vision and Image Understanding, 2023. arXiv
admin note: text overlap with arXiv:2110.0538
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain
Wearable cameras allow to acquire images and videos from the user's
perspective. These data can be processed to understand humans behavior. Despite
human behavior analysis has been thoroughly investigated in third person
vision, it is still understudied in egocentric settings and in particular in
industrial scenarios. To encourage research in this field, we present MECCANO,
a multimodal dataset of egocentric videos to study humans behavior
understanding in industrial-like settings. The multimodality is characterized
by the presence of gaze signals, depth maps and RGB videos acquired
simultaneously with a custom headset. The dataset has been explicitly labeled
for fundamental tasks in the context of human behavior understanding from a
first person view, such as recognizing and anticipating human-object
interactions. With the MECCANO dataset, we explored five different tasks
including 1) Action Recognition, 2) Active Objects Detection and Recognition,
3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and
5) Next-Active Objects Detection. We propose a benchmark aimed to study human
behavior in the considered industrial-like scenario which demonstrates that the
investigated tasks and the considered scenario are challenging for
state-of-the-art algorithms. To support research in this field, we publicy
release the dataset at https://iplab.dmi.unict.it/MECCANO/.Comment: arXiv admin note: text overlap with arXiv:2010.0565
- …