567 research outputs found

    A unified model for holistic power usage in cloud datacenter servers

    Get PDF
    Cloud datacenters are compute facilities formed by hundreds and thousands of heterogeneous servers requiring significant power requirements to operate effectively. Servers are composed by multiple interacting sub-systems including applications, microelectronic processors, and cooling which reflect their respective power profiles via different parameters. What is presently unknown is how to accurately model the holistic power usage of the entire server when including all these sub-systems together. This becomes increasingly challenging when considering diverse utilization patterns, server hardware characteristics, air and liquid cooling techniques, and importantly quantifying the non-electrical energy cost imposed by cooling operation. Such a challenge arises due to the need for multi-disciplinary expertise required to study server operation holistically. This work provides a unified model for capturing holistic power usage within Cloud datacenter servers. Constructed through controlled laboratory experiments, the model captures the relationship of server power usage between software, hardware, and cooling agnostic of architecture and cooling type (air and liquid). An exciting prospect is the ability to quantify the amount of non-electrical power consumed through cooling, allowing for more realistic and accurate server power profiles. This work represents the first empirically supported analysis and modeling of holistic power usage for Cloud datacenter servers, and bridges a significant gap between computer science and mechanical engineering research. Model validation through experiments demonstrates an average standard error of 3% for server power usage within both air and liquid cooled environments

    Strategies for Increased Energy Awareness in Cloud Federations

    Get PDF
    This chapter first identifies three scenarios that current energy aware cloud solutions cannot handle as isolated IaaS, but their federative efforts offer opportunities to be explored. These scenarios are centered around: (i) multi-datacenter cloud operator, (ii) commercial cloud federations, (iii) academic cloud federations. Based on these scenarios, we identify energy-aware scheduling policies to be applied in the management solutions of cloud federations. Among others, these policies should consider the behavior of independent administrative domains, the frequently contradicting goals of the participating clouds and federation wide energy consumption

    Holistic energy and failure aware workload scheduling in Cloud datacenters

    Get PDF
    The global uptake of Cloud computing has attracted increased interest within both academia and industry resulting in the formation of large-scale and complex distributed systems. This has led to increased failure occurrence within computing systems that induce substantial negative impact upon system performance and task reliability perceived by users. Such systems also consume vast quantities of power, resulting in significant operational costs perceived by providers. Virtualization – a commonly deployed technology within Cloud datacenters – can enable flexible scheduling of virtual machines to maximize system reliability and energy-efficiency. However, existing work address these two objectives separately, providing limited understanding towards studying the explicit trade-offs towards dependable and energy-efficient compute infrastructure. In this paper, we propose two failure-aware energy-efficient scheduling algorithms that exploit the holistic operational characteristics of the Cloud datacenter comprising the cooling unit, computing infrastructure and server failures. By comprehensively modeling the power and failure profiles of a Cloud datacenter, we propose workload scheduling algorithms Ella-W and Ella-B, capable of reducing cooling and compute energy while minimizing the impact of system failures. A novel and overall metric is proposed that combines energy efficiency and reliability to specify the performance of various algorithms. We evaluate our algorithms against Random, MaxUtil, TASA, MTTE and OBFIT under various system conditions of failure prediction accuracy and workload intensity. Evaluation results demonstrate that Ella-W can reduce energy usage by 29.5% and improve task completion rate by 3.6%, while Ella-B reduces energy usage by 32.7% with no degradation to task completion rate

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Get PDF
    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

    Get PDF
    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

    Towards near-threshold server processors

    Get PDF
    The popularity of cloud computing has led to a dramatic increase in the number of data centers in the world. The ever-increasing computational demands along with the slowdown in technology scaling has ushered an era of power-limited servers. Techniques such as near-threshold computing (NTC) can be used to improve energy efficiency in the post-Dennard scaling era. This paper describes an architecture based on the FD-SOI process technology for near-threshold operation in servers. Our work explores the trade-offs in energy and performance when running a wide range of applications found in private and public clouds, ranging from traditional scale-out applications, such as web search or media streaming, to virtualized banking applications. Our study demonstrates the benefits of near-threshold operation and proposes several directions to synergistically increase the energy proportionality of a near-threshold server

    Green Approach for Joint Management of Geo-Distributed Data Centers and Interconnection Networks

    Get PDF
    Every time an Internet user downloads a video, shares a picture, or sends an email, his/her device addresses a data center and often several of them. These complex systems feed the web and all Internet applications with their computing power and information storage, but they are very energy hungry. The energy consumed by Information and Communication Technology (ICT) infrastructures is currently more than 4\% of the worldwide consumption and it is expected to double in the next few years. Data centers and communication networks are responsible for a large portion of the ICT energy consumption and this has stimulated in the last years a research effort to reduce or mitigate their environmental impact. Most of the approaches proposed tackle the problem by separately optimizing the power consumption of the servers in data centers and of the network. However, the Cloud computing infrastructure of most providers, which includes traditional telcos that are extending their offer, is rapidly evolving toward geographically distributed data centers strongly integrated with the network interconnecting them. Distributed data centers do not only bring services closer to users with better quality, but also provide opportunities to improve energy efficiency exploiting the variation of prices in different time zones, the locally generated green energy, and the storage systems that are becoming popular in energy networks. In this paper, we propose an energy aware joint management framework for geo-distributed data centers and their interconnection network. The model is based on virtual machine migration and formulated using mixed integer linear programming. It can be solved using state-of-the art solvers such as CPLEX in reasonable time. The proposed approach covers various aspects of Cloud computing systems. Alongside, it jointly manages the use of green and brown energies using energy storage technologies. The obtained results show that significant energy cost savings can be achieved compared to a baseline strategy, in which data centers do not collaborate to reduce energy and do not use the power coming from renewable resources

    The MANGO FET-HPC Project: an overview

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this paper, we provide an overview of the MANGO project and its goal. The MANGO project aims at addressing power, performance and predictability (the PPP space) in future High-Performance Computing systems. It starts from the fundamental intuition that effective techniques for all three goals ultimately rely on customization to adapt the computing resources to reach the desired Quality of Service (QoS). From this starting point, MANGO will explore different but interrelated mechanisms at various architectural levels, as well as at the level of the system software. In particular, to explore a new positioning across the PPP space, MANGO will investigate system-wide, holistic, proactive thermal and power management aimed at extreme-scale energy efficiency.The MANGO project starts in October 2015 and is funded by the European Commission under the Horizon 2020 FET-HPC program. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza Alonso, D.; Cilardo, A.; Fornaciari, W.; Kovac, M.... (2015). The MANGO FET-HPC Project: an overview. IEEE Computer Society. https://doi.org/10.1109/CSE.2015.57
    • …
    corecore