Video anomaly understanding (VAU) aims to automatically comprehend unusual
occurrences in videos, thereby enabling various applications such as traffic
surveillance and industrial manufacturing. While existing VAU benchmarks
primarily concentrate on anomaly detection and localization, our focus is on
more practicality, prompting us to raise the following crucial questions: "what
anomaly occurred?", "why did it happen?", and "how severe is this abnormal
event?". In pursuit of these answers, we present a comprehensive benchmark for
Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of
the proposed benchmark involves three sets of human annotations to indicate the
"what", "why" and "how" of an anomaly, including 1) anomaly type, start and end
times, and event descriptions, 2) natural language explanations for the cause
of an anomaly, and 3) free text reflecting the effect of the abnormality. In
addition, we also introduce MMEval, a novel evaluation metric designed to
better align with human preferences for CUVA, facilitating the measurement of
existing LLMs in comprehending the underlying cause and corresponding effect of
video anomalies. Finally, we propose a novel prompt-based method that can serve
as a baseline approach for the challenging CUVA. We conduct extensive
experiments to show the superiority of our evaluation metric and the
prompt-based approach. Our code and dataset are available at
https://github.com/fesvhtr/CUVA.Comment: Accepted in CVPR2024, Codebase: https://github.com/fesvhtr/CUV