941 research outputs found

    ATP: a Datacenter Approximate Transmission Protocol

    Full text link
    Many datacenter applications such as machine learning and streaming systems do not need the complete set of data to perform their computation. Current approximate applications in datacenters run on a reliable network layer like TCP. To improve performance, they either let sender select a subset of data and transmit them to the receiver or transmit all the data and let receiver drop some of them. These approaches are network oblivious and unnecessarily transmit more data, affecting both application runtime and network bandwidth usage. On the other hand, running approximate application on a lossy network with UDP cannot guarantee the accuracy of application computation. We propose to run approximate applications on a lossy network and to allow packet loss in a controlled manner. Specifically, we designed a new network protocol called Approximate Transmission Protocol, or ATP, for datacenter approximate applications. ATP opportunistically exploits available network bandwidth as much as possible, while performing a loss-based rate control algorithm to avoid bandwidth waste and re-transmission. It also ensures bandwidth fair sharing across flows and improves accurate applications' performance by leaving more switch buffer space to accurate flows. We evaluated ATP with both simulation and real implementation using two macro-benchmarks and two real applications, Apache Kafka and Flink. Our evaluation results show that ATP reduces application runtime by 13.9% to 74.6% compared to a TCP-based solution that drops packets at sender, and it improves accuracy by up to 94.0% compared to UDP

    Energy-Efficient Workload Placement with Bounded Slowdown in Disaggregated Datacenters

    Get PDF
    Disaggregated Data Center (DDC) is a modern datacenter architecture that decouples hardware resources from monolithic servers into pools of resources that can be dynamically composed to match diverse workload requirements. While disaggregation improves resource utilization, it could negatively impact workload slowdown due to the latency of accessing disaggregated resources over the datacenter network. To this end, we consider CPU and memory disaggregation and conduct measurements to experimentally profile several popular datacenter workloads in order to characterize the impact of disaggregation on workload execution slowdown. We then develop a workload placement algorithm, called Iterative Rounding-based Placement ( IRoP), that given a set of workloads, determines where to place each workload (i.e., on which CPU) and how much local and remote memory is allocated to it. The key insight in designing IRoP is that the impact of remote memory latency on slowdown can be substantially masked by assigning workloads to higher-performing CPUs, albeit at the cost of higher power consumption. As such, IRoP aims to find a workload placement that minimizes the DDC power consumption while respecting a bounded slowdown for each workload. We provide extensive simulation results to demonstrate the flexibility of IRoP in providing a wide range of trade-offs between power consumption and workload slowdown. We also compare IRoP with several existing baselines. Our results indicate that IRoP can reduce power consumption and slowdown in the considered scenarios by up to 8% and 12%, respectively

    Datacenter management for on-site intermittent and uncertain renewable energy sources

    Get PDF
    Les technologies de l'information et de la communication sont devenues, au cours des dernières années, un pôle majeur de consommation énergétique avec les conséquences environnementales associées. Dans le même temps, l'émergence du Cloud computing et des grandes plateformes en ligne a causé une augmentation en taille et en nombre des centres de données. Pour réduire leur impact écologique, alimenter ces centres avec des sources d'énergies renouvelables (EnR) apparaît comme une piste de solution. Cependant, certaines EnR telles que les énergies solaires et éoliennes sont liées aux conditions météorologiques, et sont par conséquent intermittentes et incertaines. L'utilisation de batteries ou d'autres dispositifs de stockage est souvent envisagée pour compenser ces variabilités de production. De par leur coût important, économique comme écologique, ainsi que les pertes énergétiques engendrées, l'utilisation de ces dispositifs sans intégration supplémentaire est insuffisante. La consommation électrique d'un centre de données dépend principalement de l'utilisation des ressources de calcul et de communication, qui est déterminée par la charge de travail et les algorithmes d'ordonnancement utilisés. Pour utiliser les EnR efficacement tout en préservant la qualité de service du centre, une gestion coordonnée des ressources informatiques, des sources électriques et du stockage est nécessaire. Il existe une grande diversité de centres de données, ayant différents types de matériel, de charge de travail et d'utilisation. De la même manière, suivant les EnR, les technologies de stockage et les objectifs en termes économiques ou environnementaux, chaque infrastructure électrique est modélisée et gérée différemment des autres. Des travaux existants proposent des méthodes de gestion d'EnR pour des couples bien spécifiques de modèles électriques et informatiques. Cependant, les multiples combinaisons de ces deux parties rendent difficile l'extrapolation de ces approches et de leurs résultats à des infrastructures différentes. Cette thèse explore de nouvelles méthodes pour résoudre ce problème de coordination. Une première contribution reprend un problème d'ordonnancement de tâches en introduisant une abstraction des sources électriques. Un algorithme d'ordonnancement est proposé, prenant les préférences des sources en compte, tout en étant conçu pour être indépendant de leur nature et des objectifs de l'infrastructure électrique. Une seconde contribution étudie le problème de planification de l'énergie d'une manière totalement agnostique des infrastructures considérées. Les ressources informatiques et la gestion de la charge de travail sont encapsulées dans une boîte noire implémentant un ordonnancement sous contrainte de puissance. La même chose s'applique pour le système de gestion des EnR et du stockage, qui agit comme un algorithme d'optimisation d'engagement de sources pour répondre à une demande. Une optimisation coopérative et multiobjectif, basée sur un algorithme évolutionnaire, utilise ces deux boîtes noires afin de trouver les meilleurs compromis entre les objectifs électriques et informatiques. Enfin, une troisième contribution vise les incertitudes de production des EnR pour une infrastructure plus spécifique. En utilisant une formulation en processus de décision markovien (MDP), la structure du problème de décision sous-jacent est étudiée. Pour plusieurs variantes du problème, des méthodes sont proposées afin de trouver les politiques optimales ou des approximations de celles-ci avec une complexité raisonnable.In recent years, information and communication technologies (ICT) became a major energy consumer, with the associated harmful ecological consequences. Indeed, the emergence of Cloud computing and massive Internet companies increased the importance and number of datacenters around the world. In order to mitigate economical and ecological cost, powering datacenters with renewable energy sources (RES) began to appear as a sustainable solution. Some of the commonly used RES, such as solar and wind energies, directly depends on weather conditions. Hence they are both intermittent and partly uncertain. Batteries or other energy storage devices (ESD) are often considered to relieve these issues, but they result in additional energy losses and are too costly to be used alone without more integration. The power consumption of a datacenter is closely tied to the computing resource usage, which in turn depends on its workload and on the algorithms that schedule it. To use RES as efficiently as possible while preserving the quality of service of a datacenter, a coordinated management of computing resources, electrical sources and storage is required. A wide variety of datacenters exists, each with different hardware, workload and purpose. Similarly, each electrical infrastructure is modeled and managed uniquely, depending on the kind of RES used, ESD technologies and operating objectives (cost or environmental impact). Some existing works successfully address this problem by considering a specific couple of electrical and computing models. However, because of this combined diversity, the existing approaches cannot be extrapolated to other infrastructures. This thesis explores novel ways to deal with this coordination problem. A first contribution revisits batch tasks scheduling problem by introducing an abstraction of the power sources. A scheduling algorithm is proposed, taking preferences of electrical sources into account, though designed to be independent from the type of sources and from the goal of the electrical infrastructure (cost, environmental impact, or a mix of both). A second contribution addresses the joint power planning coordination problem in a totally infrastructure-agnostic way. The datacenter computing resources and workload management is considered as a black-box implementing a scheduling under variable power constraint algorithm. The same goes for the electrical sources and storage management system, which acts as a source commitment optimization algorithm. A cooperative multiobjective power planning optimization, based on a multi-objective evolutionary algorithm (MOEA), dialogues with the two black-boxes to find the best trade-offs between electrical and computing internal objectives. Finally, a third contribution focuses on RES production uncertainties in a more specific infrastructure. Based on a Markov Decision Process (MDP) formulation, the structure of the underlying decision problem is studied. For several variants of the problem, tractable methods are proposed to find optimal policies or a bounded approximation

    A Survey of FPGA Optimization Methods for Data Center Energy Efficiency

    Get PDF
    This article provides a survey of academic literature about field programmable gate array (FPGA) and their utilization for energy efficiency acceleration in data centers. The goal is to critically present the existing FPGA energy optimization techniques and discuss how they can be applied to such systems. To do so, the article explores current energy trends and their projection to the future with particular attention to the requirements set out by the European Code of Conduct for Data Center Energy Efficiency. The article then proposes a complete analysis of over ten years of research in energy optimization techniques, classifying them by purpose, method of application, and impacts on the sources of consumption. Finally, we conclude with the challenges and possible innovations we expect for this sector.Comment: Accepted for publication in IEEE Transactions on Sustainable Computin

    Timely Long Tail Identification through Agent Based Monitoring and Analytics

    Get PDF
    The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud data enters that demonstrate the challenge of data skew within modern distributed systems, this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide
    corecore