3 research outputs found
Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection
Anomalies are rare and anomaly detection is often therefore framed as
One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC
techniques constrain the latent representations of normal motions to limited
volumes and detect as abnormal anything outside, which accounts satisfactorily
for the openset'ness of anomalies. But normalcy shares the same openset'ness
property, since humans can perform the same action in several ways, which the
leading techniques neglect. We propose a novel generative model for video
anomaly detection (VAD), which assumes that both normality and abnormality are
multimodal. We consider skeletal representations and leverage state-of-the-art
diffusion probabilistic models to generate multimodal future human poses. We
contribute a novel conditioning on the past motion of people, and exploit the
improved mode coverage capabilities of diffusion processes to generate
different-but-plausible future motions. Upon the statistical aggregation of
future modes, anomaly is detected when the generated set of motions is not
pertinent to the actual future. We validate our model on 4 established
benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive
experiments surpassing state-of-the-art results.Comment: Accepted at ICCV202
Contracting Skeletal Kinematic Embeddings for Anomaly Detection
Detecting the anomaly of human behavior is paramount to timely recognizing
endangering situations, such as street fights or elderly falls. However,
anomaly detection is complex, since anomalous events are rare and because it is
an open set recognition task, i.e., what is anomalous at inference has not been
observed at training. We propose COSKAD, a novel model which encodes skeletal
human motion by an efficient graph convolutional network and learns to COntract
SKeletal kinematic embeddings onto a latent hypersphere of minimum volume for
Anomaly Detection. We propose and analyze three latent space designs for
COSKAD: the commonly-adopted Euclidean, and the new spherical-radial and
hyperbolic volumes. All three variants outperform the state-of-the-art,
including video-based techniques, on the ShangaiTechCampus, the Avenue, and on
the most recent UBnormal dataset, for which we contribute novel skeleton
annotations and the selection of human-related videos. The source code and
dataset will be released upon acceptance.Comment: Submitted to Patter Recognition Journa
Pose Forecasting in Industrial Human-Robot Collaboration
Pushing back the frontiers of collaborative robots in industrial environments, we propose a new Separable-Sparse Graph Convolutional Network (SeS-GCN) for pose forecasting. For the first time, SeS-GCN bottlenecks the interaction of the spatial, temporal and channel-wise dimensions in GCNs, and it learns sparse adjacency matrices by a teacher-student framework. Compared to the state-of-the-art, it only uses 1.72% of the parameters and it is ∼4 times faster, while still performing comparably in forecasting accuracy on Human3.6M at 1 s in the future, which enables cobots to be aware of human operators. As a second contribution, we present a new benchmark of Cobots and Humans in Industrial COllaboration (CHICO ). CHICO includes multi-view videos, 3D poses and trajectories of 20 human operators and cobots, engaging in 7 realistic industrial actions. Additionally, it reports 226 genuine collisions, taking place during the human-cobot interaction. We test SeS-GCN on CHICO for two important perception tasks in robotics: human pose forecasting, where it reaches an average error of 85.3 mm (MPJPE) at 1 sec in the future with a run time of 2.3 ms, and collision detection, by comparing the forecasted human motion with the known cobot motion, obtaining an F1-score of 0.64