47 research outputs found
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Cloud services are omnipresent and critical cloud service failure is a fact
of life. In order to retain customers and prevent revenue loss, it is important
to provide high reliability guarantees for these services. One way to do this
is by predicting outages in advance, which can help in reducing the severity as
well as time to recovery. It is difficult to forecast critical failures due to
the rarity of these events. Moreover, critical failures are ill-defined in
terms of observable data. Our proposed method, Outage-Watch, defines critical
service outages as deteriorations in the Quality of Service (QoS) captured by a
set of metrics. Outage-Watch detects such outages in advance by using current
system state to predict whether the QoS metrics will cross a threshold and
initiate an extreme event. A mixture of Gaussian is used to model the
distribution of the QoS metrics for flexibility and an extreme event
regularizer helps in improving learning in tail of the distribution. An outage
is predicted if the probability of any one of the QoS metrics crossing
threshold changes significantly. Our evaluation on a real-world SaaS company
dataset shows that Outage-Watch significantly outperforms traditional methods
with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages
exhibiting a change in service metrics and reduces the Mean Time To Detection
(MTTD) of outages by up to 88% when deployed in an enterprise cloud-service
system, demonstrating efficacy of our proposed method.Comment: Accepted to ESEC/FSE 202