8 research outputs found
Reinforced Approximate Exploratory Data Analysis
Exploratory data analytics (EDA) is a sequential decision making process
where analysts choose subsequent queries that might lead to some interesting
insights based on the previous queries and corresponding results. Data
processing systems often execute the queries on samples to produce results with
low latency. Different downsampling strategy preserves different statistics of
the data and have different magnitude of latency reductions. The optimum choice
of sampling strategy often depends on the particular context of the analysis
flow and the hidden intent of the analyst. In this paper, we are the first to
consider the impact of sampling in interactive data exploration settings as
they introduce approximation errors. We propose a Deep Reinforcement Learning
(DRL) based framework which can optimize the sample selection in order to keep
the analysis and insight generation flow intact. Evaluations with 3 real
datasets show that our technique can preserve the original insight generation
flow while improving the interaction latency, compared to baseline methods.Comment: Appears in the 37th AAAI Conference on Artificial Intelligence
(AAAI), 202
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
Cloud services are omnipresent and critical cloud service failure is a fact
of life. In order to retain customers and prevent revenue loss, it is important
to provide high reliability guarantees for these services. One way to do this
is by predicting outages in advance, which can help in reducing the severity as
well as time to recovery. It is difficult to forecast critical failures due to
the rarity of these events. Moreover, critical failures are ill-defined in
terms of observable data. Our proposed method, Outage-Watch, defines critical
service outages as deteriorations in the Quality of Service (QoS) captured by a
set of metrics. Outage-Watch detects such outages in advance by using current
system state to predict whether the QoS metrics will cross a threshold and
initiate an extreme event. A mixture of Gaussian is used to model the
distribution of the QoS metrics for flexibility and an extreme event
regularizer helps in improving learning in tail of the distribution. An outage
is predicted if the probability of any one of the QoS metrics crossing
threshold changes significantly. Our evaluation on a real-world SaaS company
dataset shows that Outage-Watch significantly outperforms traditional methods
with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages
exhibiting a change in service metrics and reduces the Mean Time To Detection
(MTTD) of outages by up to 88% when deployed in an enterprise cloud-service
system, demonstrating efficacy of our proposed method.Comment: Accepted to ESEC/FSE 202
ESRO: Experience Assisted Service Reliability against Outages
Modern cloud services are prone to failures due to their complex
architecture, making diagnosis a critical process. Site Reliability Engineers
(SREs) spend hours leveraging multiple sources of data, including the alerts,
error logs, and domain expertise through past experiences to locate the root
cause(s). These experiences are documented as natural language text in outage
reports for previous outages. However, utilizing the raw yet rich
semi-structured information in the reports systematically is time-consuming.
Structured information, on the other hand, such as alerts that are often used
during fault diagnosis, is voluminous and requires expert knowledge to discern.
Several strategies have been proposed to use each source of data separately for
root cause analysis. In this work, we build a diagnostic service called ESRO
that recommends root causes and remediation for failures by utilizing
structured as well as semi-structured sources of data systematically. ESRO
constructs a causal graph using alerts and a knowledge graph using outage
reports, and merges them in a novel way to form a unified graph during
training. A retrieval-based mechanism is then used to search the unified graph
and rank the likely root causes and remediation techniques based on the alerts
fired during an outage at inference time. Not only the individual alerts, but
their respective importance in predicting an outage group is taken into account
during recommendation. We evaluated our model on several cloud service outages
of a large SaaS enterprise over the course of ~2 years, and obtained an average
improvement of 27% in rouge scores after comparing the likely root causes
against the ground truth over state-of-the-art baselines. We further establish
the effectiveness of ESRO through qualitative analysis on multiple real outage
examples.Comment: Accepted to 38th IEEE/ACM International Conference on Automated
Software Engineering (ASE 2023