673 research outputs found
Energy-aware scheduling in distributed computing systems
Distributed computing systems, such as data centers, are key for supporting modern computing demands. However, the energy consumption of data centers has become a major concern over the last decade. Worldwide energy consumption in 2012 was estimated to be around 270 TWh, and grim forecasts predict it will quadruple by 2030. Maximizing energy efficiency while also maximizing computing efficiency is a major challenge for modern data centers. This work addresses this challenge by scheduling the operation of modern data centers, considering a multi-objective approach for simultaneously optimizing both efficiency objectives. Multiple data center scenarios are studied, such as scheduling a single data center and scheduling a federation of several geographically-distributed data centers. Mathematical models are formulated for each scenario, considering the modeling of their most relevant components such as computing resources, computing workload, cooling system, networking, and green energy generators, among others. A set of accurate heuristic and metaheuristic algorithms are designed for addressing the scheduling problem. These scheduling algorithms are comprehensively studied, and compared with each other, using statistical tools to evaluate their efficacy when addressing realistic workloads and scenarios. Experimental results show the designed scheduling algorithms are able to significantly increase the energy efficiency of data centers when compared to traditional scheduling methods, while providing a diverse set of trade-off solutions regarding the computing efficiency of the data center. These results confirm the effectiveness of the proposed algorithmic approaches for data center infrastructures.Los sistemas informáticos distribuidos, como los centros de datos, son clave para satisfacer la demanda informática moderna. Sin embargo, su consumo de energético se ha convertido en una gran preocupación. Se estima que mundialmente su consumo energético rondó los 270 TWh en el año 2012, y algunos prevén que este consumo se cuadruplicará para el año 2030. Maximizar simultáneamente la eficiencia energética y computacional de los centros de datos es un desafío crítico. Esta tesis aborda dicho desafío mediante la planificación de la operativa del centro de datos considerando un enfoque multiobjetivo para optimizar simultáneamente ambos objetivos de eficiencia. En esta tesis se estudian múltiples variantes del problema, desde la planificación de un único centro de datos hasta la de una federación de múltiples centros de datos geográficmentea distribuidos. Para esto, se formulan modelos matemáticos para cada variante del problema, modelado sus componentes más relevantes, como: recursos computacionales, carga de trabajo, refrigeración, redes, energía verde, etc. Para resolver el problema de planificación planteado, se diseñan un conjunto de algoritmos heurísticos y metaheurísticos. Estos son estudiados exhaustivamente y su eficiencia es evaluada utilizando una batería de herramientas estadísticas. Los resultados experimentales muestran que los algoritmos de planificación diseñados son capaces de aumentar significativamente la eficiencia energética de un centros de datos en comparación con métodos tradicionales planificación. A su vez, los métodos propuestos proporcionan un conjunto diverso de soluciones con diferente nivel de compromiso respecto a la eficiencia computacional del centro de datos. Estos resultados confirman la eficacia del enfoque algorítmico propuesto
Enhanced Cluster Computing Performance Through Proportional Fairness
The performance of cluster computing depends on how concurrent jobs share
multiple data center resource types like CPU, RAM and disk storage. Recent
research has discussed efficiency and fairness requirements and identified a
number of desirable scheduling objectives including so-called dominant resource
fairness (DRF). We argue here that proportional fairness (PF), long recognized
as a desirable objective in sharing network bandwidth between ongoing flows, is
preferable to DRF. The superiority of PF is manifest under the realistic
modelling assumption that the population of jobs in progress is a stochastic
process. In random traffic the strategy-proof property of DRF proves
unimportant while PF is shown by analysis and simulation to offer a
significantly better efficiency-fairness tradeoff.Comment: Submitted to Performance 201
Mage: Online Interference-Aware Scheduling in Multi-Scale Heterogeneous Systems
Heterogeneity has grown in popularity both at the core and server level as a
way to improve both performance and energy efficiency. However, despite these
benefits, scheduling applications in heterogeneous machines remains
challenging. Additionally, when these heterogeneous resources accommodate
multiple applications to increase utilization, resources are prone to
contention, destructive interference, and unpredictable performance. Existing
solutions examine heterogeneity either across or within a server, leading to
missed performance and efficiency opportunities. We present Mage, a practical
interference-aware runtime that optimizes performance and efficiency in systems
with intra- and inter-server heterogeneity. Mage leverages fast and online data
mining to quickly explore the space of application placements, and determine
the one that minimizes destructive interference between co-resident
applications. Mage continuously monitors the performance of active
applications, and, upon detecting QoS violations, it determines whether
alternative placements would prove more beneficial, taking into account any
overheads from migration. Across 350 application mixes on a heterogeneous CMP,
Mage improves performance by 38% and up to 2x compared to a greedy scheduler.
Across 160 mixes on a heterogeneous cluster, Mage improves performance by 30%
on average and up to 52% over the greedy scheduler, and by 11% over the
combination of Paragon [15] for inter- and intra-server heterogeneity
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters
Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters.
Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks.
We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters.
In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues.
We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads.
Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 .
Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
- …