15 research outputs found

    Analysis and Clustering of Workload in Google Cluster Trace based on Resource Usage

    Full text link
    Cloud computing has gained interest amongst commercial organizations, research communities, developers and other individuals during the past few years.In order to move ahead with research in field of data management and processing of such data, we need benchmark datasets and freely available data which are publicly accessible. Google in May 2011 released a trace of a cluster of 11k machines referred as Google Cluster Trace.This trace contains cell information of about 29 days.This paper provides analysis of resource usage and requirements in this trace and is an attempt to give an insight into such kind of production trace similar to the ones in cloud environment.The major contributions of this paper include Statistical Profile of Jobs based on resource usage, clustering of Workload Patterns and Classification of jobs into different types based on k-means clustering.Though there have been earlier works for analysis of this trace, but our analysis provides several new findings such as jobs in a production trace are trimodal and there occurs symmetry in the tasks within a long job typ

    A Big Data Analyzer for Large Trace Logs

    Full text link
    Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.Comment: 26 pages, 10 figure

    Towards Data-Driven Autonomics in Data Centers

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

    Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...

    Cost reduction bounds of proactive management based on request prediction

    Get PDF
    International audienceData Centers (DCs) need to manage their servers periodically to meet user demand efficiently. Since the cost of the energy employed to serve the user demand is lower when DC settings (e.g. number of active servers) are done a priori (proactively), there is a great interest in studying different proactive strategies based on predictions of requests. The amount of savings in energy cost that can be achieved depends not only on the selected proactive strategy but also on the statistics of the demand and the predictors used. Despite its importance, due to the complexity of the problem it is difficult to find studies that quantify the savings that can be obtained. The main contribution of this paper is to propose a generic methodology to quantify the possible cost reduction using proactive management based on predictions. Thus, using this method together with past data it is possible to quantify the efficiency of different predictors as well as optimize proactive strategies. In this paper, the cost reduction is evaluated using both ARMA (Auto Regressive Moving Average) and LV (Last Value) predictors. We then apply this methodology to the Google dataset collected over a period of 29 days to evaluate the benefit that can be obtained with those two predictors in the considered DC

    Cost reduction bounds of proactive management based on request prediction

    Get PDF
    International audienceData Centers (DCs) need to manage their servers periodically to meet user demand efficiently. Since the cost of the energy employed to serve the user demand is lower when DC settings (e.g. number of active servers) are done a priori (proactively), there is a great interest in studying different proactive strategies based on predictions of requests. The amount of savings in energy cost that can be achieved depends not only on the selected proactive strategy but also on the statistics of the demand and the predictors used. Despite its importance, due to the complexity of the problem it is difficult to find studies that quantify the savings that can be obtained. The main contribution of this paper is to propose a generic methodology to quantify the possible cost reduction using proactive management based on predictions. Thus, using this method together with past data it is possible to quantify the efficiency of different predictors as well as optimize proactive strategies. In this paper, the cost reduction is evaluated using both ARMA (Auto Regressive Moving Average) and LV (Last Value) predictors. We then apply this methodology to the Google dataset collected over a period of 29 days to evaluate the benefit that can be obtained with those two predictors in the considered DC

    Performance analysis : a case study on network management system using machine learning

    Get PDF
    Businesses have legacy distributed software systems which are out of traditional data analysis methods due to their complexities. In addition, the software systems evolve and become complex to understand even with the knowledge of system architecture. Machine learning and big data analytic techniques are widely used in many technical domains to get insight from this large business data due to performance and accuracy. This study was conducted to investigate the applicability of machine learning techniques on performance utilization modelling on Nokia’s network management system. The objective was to study and develop resource utilization models based on system performance data and to study future business needs on capacity analysis of the software performance to minimize manual tasks. The performance data was extracted from network management system software which contains resource usages on system level and component level measurements based on input load. In general, the simulated load on a network management system is uniform with less variance. To overcome this during the research, different load profiles were simulated on the system to assess its performance. Later the data was processed and evaluated using set of machine learning techniques (linear regression, MARS, K-NN, random forest, SVR and feed forward neural networks) to construct resource utilization models. Further, the goodness of developed models was evaluated on simulated test and customer data. Overall, no single algorithm performed best on all resource entities, but neural networks performed well on most response variables as a multivariable output model. However, when comparing performance across customer and test datasets, there were some differences which were also studied. Overall, the results show the feasibility on modeling system resource that can be used in capacity analysis. In future iterations, further analysis on remaining system nodes and suggestions have been made in the report