3 research outputs found
Disentangled Causal Graph Learning forOnline Unsupervised Root Cause Analysis
The task of root cause analysis (RCA) is to identify the root causes of
system faults/failures by analyzing system monitoring data. Efficient RCA can
greatly accelerate system failure recovery and mitigate system damages or
financial losses. However, previous research has mostly focused on developing
offline RCA algorithms, which often require manually initiating the RCA
process, a significant amount of time and data to train a robust model, and
then being retrained from scratch for a new system fault.
In this paper, we propose CORAL, a novel online RCA framework that can
automatically trigger the RCA process and incrementally update the RCA model.
CORAL consists of Trigger Point Detection, Incremental Disentangled Causal
Graph Learning, and Network Propagation-based Root Cause Localization. The
Trigger Point Detection component aims to detect system state transitions
automatically and in near-real-time. To achieve this, we develop an online
trigger point detection approach based on multivariate singular spectrum
analysis and cumulative sum statistics. To efficiently update the RCA model, we
propose an incremental disentangled causal graph learning approach to decouple
the state-invariant and state-dependent information. After that, CORAL applies
a random walk with restarts to the updated causal graph to accurately identify
root causes. The online RCA process terminates when the causal graph and the
generated root cause list converge. Extensive experiments on three real-world
datasets with case studies demonstrate the effectiveness and superiority of the
proposed framework