29 research outputs found
SKELTER: unsupervised skeleton action denoising and recognition using transformers
Unsupervised Human Action Recognition (U-HAR) methods currently leverage large-scale datasets of human poses to solve this challenging problem. As most of the approaches are dedicated to reaching the best recognition accuracies, no attention has been put into analyzing the resilience of such methods given perturbed data, a likely occurrence in real in-the-wild testing scenarios. Our first contribution is to systematically validate the decrease in performance of current U-HAR state-of-the-art using perturbed or altered data (e.g., obtained by removing some skeletal joints, rotating the entire pose, and injecting geometrical aberrations). Then, we propose a novel framework based on a transformer encoder–decoder with remarkable de-noising capabilities to counter such perturbations effectively. Moreover, we also present additional losses to have robust representations against rotation variances and provide temporal motion consistency. Our model, SKELTER, shows limited drops in performance when skeleton noise is present compared with previous approaches, favoring its use in challenging in-the-wild settings
Detection of Abnormal Fish Trajectories Using a Clustering Based Hierarchical Classifier
We address the analysis of fish trajectories in unconstrained underwater videos to help marine biologist to detect new/rare fish behaviours and to detect environmental changes which can be observed from the abnormal behaviour of fish. The fish trajectories are separated into normal and abnormal classes which indicate the common behaviour of fish and the behaviours that are rare / unusual respectively. The proposed solution is based on a novel type of hierarchical classifier which builds the tree using clustered and labelled data based on similarity of data while using different feature sets at different levels of hierarchy. The paper presents a new method for fish trajectory analysis which has better performance compared to state-of-the-art techniques while the results are significant considering the challenges of underwater environments, low video quality, erratic movement of fish and highly imbalanced trajectory data that we used. Moreover, the proposed method is also powerful enough to classify highly imbalanced real-world datasets.
Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion Representations
This paper aims to address the unsupervised video anomaly detection (VAD)
problem, which involves classifying each frame in a video as normal or
abnormal, without any access to labels. To accomplish this, the proposed method
employs conditional diffusion models, where the input data is the
spatiotemporal features extracted from a pre-trained network, and the condition
is the features extracted from compact motion representations that summarize a
given video segment in terms of its motion and appearance. Our method utilizes
a data-driven threshold and considers a high reconstruction error as an
indicator of anomalous events. This study is the first to utilize compact
motion representations for VAD and the experiments conducted on two large-scale
VAD benchmarks demonstrate that they supply relevant information to the
diffusion model, and consequently improve VAD performances w.r.t the prior art.
Importantly, our method exhibits better generalization performance across
different datasets, notably outperforming both the state-of-the-art and
baseline methods. The code of our method is available at
https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusionComment: Accepted to ICIAP 202
Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance
This paper presents a novel end-to-end method for the problem of
skeleton-based unsupervised human action recognition. We propose a new
architecture with a convolutional autoencoder that uses graph Laplacian
regularization to model the skeletal geometry across the temporal dynamics of
actions. Our approach is robust towards viewpoint variations by including a
self-supervised gradient reverse layer that ensures generalization across
camera views. The proposed method is validated on NTU-60 and NTU-120
large-scale datasets in which it outperforms all prior unsupervised
skeleton-based approaches on the cross-subject, cross-view, and cross-setup
protocols. Although unsupervised, our learnable representation allows our
method even to surpass a few supervised skeleton-based action recognition
methods. The code is available in:
www.github.com/IIT-PAVIS/UHAR_Skeletal_Laplacia
Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention
Short-term action anticipation (STA) in first-person videos is a challenging
task that involves understanding the next active object interactions and
predicting future actions. Existing action anticipation methods have primarily
focused on utilizing features extracted from video clips, but often overlooked
the importance of objects and their interactions. To this end, we propose a
novel approach that applies a guided attention mechanism between the objects,
and the spatiotemporal features extracted from video clips, enhancing the
motion and contextual information, and further decoding the object-centric and
motion-centric information to address the problem of STA in egocentric videos.
Our method, GANO (Guided Attention for Next active Objects) is a multi-modal,
end-to-end, single transformer-based network. The experimental results
performed on the largest egocentric dataset demonstrate that GANO outperforms
the existing state-of-the-art methods for the prediction of the next active
object label, its bounding box location, the corresponding future action, and
the time to contact the object. The ablation study shows the positive
contribution of the guided attention mechanism compared to other fusion
methods. Moreover, it is possible to improve the next active object location
and class label prediction results of GANO by just appending the learnable
object tokens with the region of interest embeddings.Comment: Accepted to IEEE ICIP 2023, see project page here :
https://sanketsans.github.io/guided-attention-egocentric.htm
Guided Attention for Next Active Object @ EGO4D STA Challenge
In this technical report, we describe the Guided-Attention mechanism based
solution for the short-term anticipation (STA) challenge for the EGO4D
challenge. It combines the object detections, and the spatiotemporal features
extracted from video clips, enhancing the motion and contextual information,
and further decoding the object-centric and motion-centric information to
address the problem of STA in egocentric videos. For the challenge, we build
our model on top of StillFast with Guided Attention applied on fast network.
Our model obtains better performance on the validation set and also achieves
state-of-the-art (SOTA) results on the challenge test set for EGO4D Short-Term
Object Interaction Anticipation Challenge.Comment: Winner of CVPR@2023 Ego4D STA challenge. arXiv admin note:
substantial text overlap with arXiv:2305.1295
Anticipating Next Active Objects for Egocentric Videos
This paper addresses the problem of anticipating the next-active-object
location in the future, for a given egocentric video clip where the contact
might happen, before any action takes place. The problem is considerably hard,
as we aim at estimating the position of such objects in a scenario where the
observed clip and the action segment are separated by the so-called ``time to
contact'' (TTC) segment. Many methods have been proposed to anticipate the
action of a person based on previous hand movements and interactions with the
surroundings. However, there have been no attempts to investigate the next
possible interactable object, and its future location with respect to the
first-person's motion and the field-of-view drift during the TTC window. We
define this as the task of Anticipating the Next ACTive Object (ANACTO). To
this end, we propose a transformer-based self-attention framework to identify
and locate the next-active-object in an egocentric clip.
We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and
Ego4D. We also provide annotations for the first two datasets. Our approach
performs best compared to relevant baseline methods. We also conduct ablation
studies to understand the effectiveness of the proposed and baseline methods on
varying conditions. Code and ANACTO task annotations will be made available
upon paper acceptance.Comment: 13 pages, 13 figure
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos
Objects are crucial for understanding human-object interactions. By
identifying the relevant objects, one can also predict potential future
interactions or actions that may occur with these objects. In this paper, we
study the problem of Short-Term Object interaction anticipation (STA) and
propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a
multi-modal end-to-end transformer network, that attends to objects in observed
frames in order to anticipate the next-active-object (NAO) and, eventually, to
guide the model to predict context-aware future actions. The task is
challenging since it requires anticipating future action along with the object
with which the action occurs and the time after which the interaction will
begin, a.k.a. the time to contact (TTC). Compared to existing video modeling
architectures for action anticipation, NAOGAT captures the relationship between
objects and the global scene context in order to predict detections for the
next active object and anticipate relevant future actions given these
detections, leveraging the objects' dynamics to improve accuracy. One of the
key strengths of our approach, in fact, is its ability to exploit the motion
dynamics of objects within a given clip, which is often ignored by other
models, and separately decoding the object-centric and motion-centric
information. Through our experiments, we show that our model outperforms
existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen
Set"), as measured by several additional metrics, such as time to contact, and
next-active-object localization. The code will be available upon acceptance.Comment: Accepted in WACV'2