1,047 research outputs found
Technical Report: A Trace-Based Performance Study of Autoscaling Workloads of Workflows in Datacenters
To improve customer experience, datacenter operators offer support for
simplifying application and resource management. For example, running workloads
of workflows on behalf of customers is desirable, but requires increasingly
more sophisticated autoscaling policies, that is, policies that dynamically
provision resources for the customer. Although selecting and tuning autoscaling
policies is a challenging task for datacenter operators, so far relatively few
studies investigate the performance of autoscaling for workloads of workflows.
Complementing previous knowledge, in this work we propose the first
comprehensive performance study in the field. Using trace-based simulation, we
compare state-of-the-art autoscaling policies across multiple application
domains, workload arrival patterns (e.g., burstiness), and system utilization
levels. We further investigate the interplay between autoscaling and regular
allocation policies, and the complexity cost of autoscaling. Our quantitative
study focuses not only on traditional performance metrics and on
state-of-the-art elasticity metrics, but also on time- and memory-related
autoscaling-complexity metrics. Our main results give strong and quantitative
evidence about previously unreported operational behavior, for example, that
autoscaling policies perform differently across application domains and by how
much they differ.Comment: Technical Report for the CCGrid 2018 submission "A Trace-Based
Performance Study of Autoscaling Workloads of Workflows in Datacenters
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
Power Management Techniques for Data Centers: A Survey
With growing use of internet and exponential growth in amount of data to be
stored and processed (known as 'big data'), the size of data centers has
greatly increased. This, however, has resulted in significant increase in the
power consumption of the data centers. For this reason, managing power
consumption of data centers has become essential. In this paper, we highlight
the need of achieving energy efficiency in data centers and survey several
recent architectural techniques designed for power management of data centers.
We also present a classification of these techniques based on their
characteristics. This paper aims to provide insights into the techniques for
improving energy efficiency of data centers and encourage the designers to
invent novel solutions for managing the large power dissipation of data
centers.Comment: Keywords: Data Centers, Power Management, Low-power Design, Energy
Efficiency, Green Computing, DVFS, Server Consolidatio
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
- …