Search CORE

112 research outputs found

Towards Data-Driven Autonomics in Data Centers

Author: Babaoglu Ozalp
Sîrbu Alina
Publication venue
Publication date: 01/01/2015
Field of study

Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

Author: Babaoglu Ozalp
Sîrbu Alina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An Architectural Approach to Autonomics and Self-management of Automotive Embedded Electronic Systems

Author: Anthony R
Chen D,
Deboer G
Ekelin C
Friesen V
Persson M
Rettberg A
Scholle D
Publication venue: HAL CCSD
Publication date: 29/01/2008
Field of study

International audienceEmbedded electronic systems in vehicles are of rapidly increasing commercial importance for the automotive industry. While current vehicular embedded systems are extremely limited and static, a more dynamic configurable system would greatly simplify the integration work and increase quality of vehicular systems. This brings in features like separation of concerns, customised software configuration for individual vehicles, seamless connectivity, and plug-and-play capability. Furthermore, such a system can also contribute to increased dependability and resource optimization due to its inherent ability to adjust itself dynamically to changes in software, hardware resources, and environment condition. This paper describes the architectural approach to achieving the goals of dynamically self-configuring automotive embedded electronic systems by the EU research project DySCAS. The architecture solution outlined in this paper captures the application and operational contexts, expected features, middleware services, functions and behaviours, as well as the basic mechanisms and technologies. The paper also covers the architecture conceptualization by presenting the rationale, concerning the architecture structuring, control principles, and deployment concept. In this paper, we also present the adopted architecture V&V strategy and discuss some open issues in regards to the industrial acceptance

A Big Data Analyzer for Large Trace Logs

Author: Babaoglu Ozalp
Balliu Alkida
Marzolla Moreno
Olivetti Dennis
Sîrbu Alina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/09/2015
Field of study

Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.Comment: 26 pages, 10 figure

arXiv.org e-Print Archive

CiteSeerX

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

Author: B Javadi
F Falciano
F Salfner
GL Valentini
Publication venue
Publication date: 01/01/2015
Field of study

The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts.Comment: 12 pages, 7 Figure

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The Green Computing Observatory: a data curation approach for green IT

Author: Fürst F.
Germain-Renaud C.
Jacob T.
Jouvin M.
Kassel G.
Nauroy J.
Philippon G.
Publication venue: HAL CCSD
Publication date: 26/03/2012
Field of study

International audienceThe Green Computing Observatory (GCO) is a collaborative effort to provide the scientific community with a comprehensive set of traces of energy consumption of a production cluster. These traces include the detailed monitoring of the hardware and software, as well as global site information such as the overall consumption and overall cooling. The acquired data is transformed into an XML format built from a specifically designed ontology and published through the Grid Observatory website

HAL-CentraleSupelec

HAL-IN2P3

HAL-Rennes 1

Recommended from our members

Investigation of a teleo-reactive approach for the development of autonomic manager systems

Author: Hawthorne James
Publication venue
Publication date: 10/04/2013
Field of study

As the demand for more capable and more feature-rich software increases, the complexity in design, implementation and maintenance also increases exponentially. This becomes a problem when the complexity prevents developers from writing, improving, fixing or otherwise maintaining software to meet specified demands whilst still reaching an acceptable level of robustness. When complexity becomes too great, the software becomes impossible to effectively be managed by even large teams of people. One way to address the problem is an Autonomic approach to software development. Autonomic software aims to tackle complexity by allowing the software to manage itself, thus reducing the need for human intervention and allowing it to reach a maintainable state. Many techniques have been investigated for development of autonomic systems including policy-based designs, utility-functions and advanced architectures. A unique approach to the problem is the teleo-reactive programming paradigm. This paradigm offers a robust and simple structure on which to develop systems. It allows the developer the freedom to express their intentions in a logical manner whilst the increased robustness reduces the maintenance cost. Teleo-Reactive programming is an established solution to low-level agent based problems such as robot navigation and obstacle avoidance, but this technique shows behaviour which is consistent with higher-level autonomic solutions. This project therefore investigates the extent of the applicability of teleo-reactive programming as an autonomic solution. Can the technique be adapted to allow a more ideal fitness for purpose' for autonomics whilst causing minimal changes to the tried and tested original structure and meaning? Does the technique introduce any additional problems and can these be addressed with improvements to the teleo-reactive framework? Teleo-Reactive programming is an interesting approach to autonomic computing because in a Teleo-Reactive program, its state is not predetermined at any moment in time and is based on a priority system where rules execute based on the current environmental context (i.e. not in any strict procedural way) whilst still aiming at the intended goal. This method has been shown to be very robust and exhibits some of the qualities of autonomic software

Greenwich Academic Literature Archive

DAIM: a Mechanism to Distribute Control Functions within OpenFlow Switches

Author: Banjar A
Braun R
Pupatwibul P
Publication venue: 'Academy Publisher'
Publication date: 01/01/2014
Field of study

Abstract—Telecommunication networks need to support a wide range of services and functionalities with capability of autonomy, scalability and adaptability for managing applications to meet business needs. Networking devices are increasing in complexity among various services and platforms, from different vendors. The network complexity is required experts ’ operators. This paper explores an introduction to networks programmability, by distributing independent computing environment, which would be demonstrated through a structured system named DAIM model (Distributed Active information Model). In addition it seeks to enhance current SDN (Software-Defined Networking) approach which has some scalability issues. The DAIM model can provide richness of nature-inspired adaptation algorithms on a complex distributed computing environment. The DAIM model uses a group of standard switches, databases, and corresponding between them by using DAIM agents. These agents are imposed by a set of network applications, which is integrated with a DAIM model databases. DAIM model also considers challenges of autonomic functionalities, where each network’s device can make its own decisions on the basis of collected information by the DAIM agents. The DAIM model is expected to satisfy the requirement of autonomic functionalities. Moreover, this paper discussed the processing of packets forwarding within DAIM model as well as the risk scenarios of the DAIM model

CiteSeerX

OPUS - University of Technology Sydney

A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

Author: Babaoglu Ozalp
SIRBU ALINA
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation.We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts

Archivio della Ricerca - Università di Pisa