33 research outputs found
DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters
When will a server fail catastrophically in an industrial datacenter? Is it
possible to forecast these failures so preventive actions can be taken to
increase the reliability of a datacenter? To answer these questions, we have
studied what are probably the largest, publicly available datacenter traces,
containing more than 104 million events from 12,500 machines. Among these
samples, we observe and categorize three types of machine failures, all of
which are catastrophic and may lead to information loss, or even worse,
reliability degradation of a datacenter. We further propose a two-stage
framework-DC-Prophet-based on One-Class Support Vector Machine and Random
Forest. DC-Prophet extracts surprising patterns and accurately predicts the
next failure of a machine. Experimental results show that DC-Prophet achieves
an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88
(out of 1). On average, DC-Prophet outperforms other classical machine learning
methods by 39.45% in F3-score.Comment: 13 pages, 5 figures, accepted by 2017 ECML PKD
A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q
The complexity and cost of managing high-performance computing
infrastructures are on the rise. Automating management and repair through
predictive models to minimize human interventions is an attempt to increase
system availability and contain these costs. Building predictive models that
are accurate enough to be useful in automatic management cannot be based on
restricted log data from subsystems but requires a holistic approach to data
analysis from disparate sources. Here we provide a detailed multi-scale
characterization study based on four datasets reporting power consumption,
temperature, workload, and hardware/software events for an IBM Blue Gene/Q
installation. We show that the system runs a rich parallel workload, with low
correlation among its components in terms of temperature and power, but higher
correlation in terms of events. As expected, power and temperature correlate
strongly, while events display negative correlations with load and power. Power
and workload show moderate correlations, and only at the scale of components.
The aim of the study is a systematic, integrated characterization of the
computing infrastructure and discovery of correlation sources and levels to
serve as basis for future predictive modeling efforts.Comment: 12 pages, 7 Figure
Analysis and Clustering of Workload in Google Cluster Trace based on Resource Usage
Cloud computing has gained interest amongst commercial organizations,
research communities, developers and other individuals during the past few
years.In order to move ahead with research in field of data management and
processing of such data, we need benchmark datasets and freely available data
which are publicly accessible. Google in May 2011 released a trace of a cluster
of 11k machines referred as Google Cluster Trace.This trace contains cell
information of about 29 days.This paper provides analysis of resource usage and
requirements in this trace and is an attempt to give an insight into such kind
of production trace similar to the ones in cloud environment.The major
contributions of this paper include Statistical Profile of Jobs based on
resource usage, clustering of Workload Patterns and Classification of jobs into
different types based on k-means clustering.Though there have been earlier
works for analysis of this trace, but our analysis provides several new
findings such as jobs in a production trace are trimodal and there occurs
symmetry in the tasks within a long job typ
A Big Data Analyzer for Large Trace Logs
Current generation of Internet-based services are typically hosted on large
data centers that take the form of warehouse-size structures housing tens of
thousands of servers. Continued availability of a modern data center is the
result of a complex orchestration among many internal and external actors
including computing hardware, multiple layers of intricate software, networking
and storage devices, electrical power and cooling plants. During the course of
their operation, many of these components produce large amounts of data in the
form of event and error logs that are essential not only for identifying and
resolving problems but also for improving data center efficiency and
management. Most of these activities would benefit significantly from data
analytics techniques to exploit hidden statistical patterns and correlations
that may be present in the data. The sheer volume of data to be analyzed makes
uncovering these correlations and patterns a challenging task. This paper
presents BiDAl, a prototype Java tool for log-data analysis that incorporates
several Big Data technologies in order to simplify the task of extracting
information from data traces produced by large clusters and server farms. BiDAl
provides the user with several analysis languages (SQL, R and Hadoop MapReduce)
and storage backends (HDFS and SQLite) that can be freely mixed and matched so
that a custom tool for a specific task can be easily constructed. BiDAl has a
modular architecture so that it can be extended with other backends and
analysis languages in the future. In this paper we present the design of BiDAl
and describe our experience using it to analyze publicly-available traces from
Google data clusters, with the goal of building a realistic model of a complex
data center.Comment: 26 pages, 10 figure
Predicting Scheduling Failures in the Cloud
Cloud Computing has emerged as a key technology to deliver and manage
computing, platform, and software services over the Internet. Task scheduling
algorithms play an important role in the efficiency of cloud computing services
as they aim to reduce the turnaround time of tasks and improve resource
utilization. Several task scheduling algorithms have been proposed in the
literature for cloud computing systems, the majority relying on the
computational complexity of tasks and the distribution of resources. However,
several tasks scheduled following these algorithms still fail because of
unforeseen changes in the cloud environments. In this paper, using tasks
execution and resource utilization data extracted from the execution traces of
real world applications at Google, we explore the possibility of predicting the
scheduling outcome of a task using statistical models. If we can successfully
predict tasks failures, we may be able to reduce the execution time of jobs by
rescheduling failed tasks earlier (i.e., before their actual failing time). Our
results show that statistical models can predict task failures with a precision
up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of
such predictions using the tool kit GloudSim and found that they can improve
the number of finished tasks by up to 40%. We also perform a case study using
the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene
expression correlations analysis study from breast cancer research. We find
that when extending the scheduler of Hadoop with our predictive models, the
percentage of failed jobs can be reduced by up to 45%, with an overhead of less
than 5 minutes
A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q
The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation.We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts
BiDAl: Big Data Analyzer for Cluster Traces
Modern data centers that provide Internet-scale services are stadium-size structures housing tens of thousands of heterogeneous devices (server clusters, networking equipment, power and cooling infrastructures) that must operate continuously and reliably. As part of their operation, these devices produce large amounts of data in the form of event and error logs that are essential not only for identifying problems but also for improving data center efficiency and management. These activities employ data analytics and often exploit hidden statistical patterns and correlations among different factors present in the data. Uncovering these patterns and correlations is challenging due to the sheer volume of data to be analyzed. This paper presents BiDAl, a prototype “log-data analysis framework” that incorporates various Big Data technologies to simplify the analysis of data traces from large clusters. BiDAl is written in Java with a modular and extensible architecture so that different storage backends (currently, HDFS and SQLite are supported), as well as different analysis languages (current implementation supports SQL, R and Hadoop MapReduce) can be easily selected as appropriate. We present the design of BiDAl and describe our experience using it to analyze several public traces of Google data clusters for building a simulation model capable of reproducing observed behavior
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...