Search CORE

6,226 research outputs found

In Datacenter Performance, The Only Constant Is Change

Author: Duplyakin Dmitry
Maricq Aleksander
Ricci Robert
Uta Alexandru
Publication venue
Publication date: 10/03/2020
Field of study

All computing infrastructure suffers from performance variability, be it bare-metal or virtualized. This phenomenon originates from many sources: some transient, such as noisy neighbors, and others more permanent but sudden, such as changes or wear in hardware, changes in the underlying hypervisor stack, or even undocumented interactions between the policies of the computing resource provider and the active workloads. Thus, performance measurements obtained on clouds, HPC facilities, and, more generally, datacenter environments are almost guaranteed to exhibit performance regimes that evolve over time, which leads to undesirable nonstationarities in application performance. In this paper, we present our analysis of performance of the bare-metal hardware available on the CloudLab testbed where we focus on quantifying the evolving performance regimes using changepoint detection. We describe our findings, backed by a dataset with nearly 6.9M benchmark results collected from over 1600 machines over a period of 2 years and 9 months. These findings yield a comprehensive characterization of real-world performance variability patterns in one computing facility, a methodology for studying such patterns on other infrastructures, and contribute to a better understanding of performance variability in general.Comment: To be presented at the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid, http://cloudbus.org/ccgrid2020/) on May 11-14, 2020 in Melbourne, Victoria, Australi

arXiv.org e-Print Archive

VU Research Portal

Crossref

RCD: Rapid Close to Deadline Scheduling for Datacenter Networks

Author: Madni Asad M.
Noormohammadpour Mohammad
Raghavendra Cauligi S.
Rao Sriram
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/07/2017
Field of study

Datacenter-based Cloud Computing services provide a flexible, scalable and yet economical infrastructure to host online services such as multimedia streaming, email and bulk storage. Many such services perform geo-replication to provide necessary quality of service and reliability to users resulting in frequent large inter- datacenter transfers. In order to meet tenant service level agreements (SLAs), these transfers have to be completed prior to a deadline. In addition, WAN resources are quite scarce and costly, meaning they should be fully utilized. Several recently proposed schemes, such as B4, TEMPUS, and SWAN have focused on improving the utilization of inter-datacenter transfers through centralized scheduling, however, they fail to provide a mechanism to guarantee that admitted requests meet their deadlines. Also, in a recent study, authors propose Amoeba, a system that allows tenants to define deadlines and guarantees that the specified deadlines are met, however, to admit new traffic, the proposed system has to modify the allocation of already admitted transfers. In this paper, we propose Rapid Close to Deadline Scheduling (RCD), a close to deadline traffic allocation technique that is fast and efficient. Through simulations, we show that RCD is up to 15 times faster than Amoeba, provides high link utilization along with deadline guarantees, and is able to make quick decisions on whether a new request can be fully satisfied before its deadline.Comment: World Automation Congress (WAC), IEEE, 201

arXiv.org e-Print Archive

Crossref

FigShare

Enabling Work-conserving Bandwidth Guarantees for Multi-tenant Datacenters via Dynamic Tenant-Queue Binding

Author: Chen Kai
Hu Shuihai
Hu Yih-Chun
Liu Zhuotao
Wang Yi
Wu Haitao
Zhang Gong
Publication venue
Publication date: 07/01/2018
Field of study

Today's cloud networks are shared among many tenants. Bandwidth guarantees and work conservation are two key properties to ensure predictable performance for tenant applications and high network utilization for providers. Despite significant efforts, very little prior work can really achieve both properties simultaneously even some of them claimed so. In this paper, we present QShare, an in-network based solution to achieve bandwidth guarantees and work conservation simultaneously. QShare leverages weighted fair queuing on commodity switches to slice network bandwidth for tenants, and solves the challenge of queue scarcity through balanced tenant placement and dynamic tenant-queue binding. QShare is readily implementable with existing switching chips. We have implemented a QShare prototype and evaluated it via both testbed experiments and simulations. Our results show that QShare ensures bandwidth guarantees while driving network utilization to over 91% even under unpredictable traffic demands.Comment: The initial work is published in IEEE INFOCOM 201

arXiv.org e-Print Archive

Crossref

ATP: a Datacenter Approximate Transmission Protocol

Author: Liu Ke
Tsai Shin-Yeh
Zhang Yiying
Publication venue
Publication date: 06/01/2019
Field of study

Many datacenter applications such as machine learning and streaming systems do not need the complete set of data to perform their computation. Current approximate applications in datacenters run on a reliable network layer like TCP. To improve performance, they either let sender select a subset of data and transmit them to the receiver or transmit all the data and let receiver drop some of them. These approaches are network oblivious and unnecessarily transmit more data, affecting both application runtime and network bandwidth usage. On the other hand, running approximate application on a lossy network with UDP cannot guarantee the accuracy of application computation. We propose to run approximate applications on a lossy network and to allow packet loss in a controlled manner. Specifically, we designed a new network protocol called Approximate Transmission Protocol, or ATP, for datacenter approximate applications. ATP opportunistically exploits available network bandwidth as much as possible, while performing a loss-based rate control algorithm to avoid bandwidth waste and re-transmission. It also ensures bandwidth fair sharing across flows and improves accurate applications' performance by leaving more switch buffer space to accurate flows. We evaluated ATP with both simulation and real implementation using two macro-benchmarks and two real applications, Apache Kafka and Flink. Our evaluation results show that ATP reduces application runtime by 13.9% to 74.6% compared to a TCP-based solution that drops packets at sender, and it improves accuracy by up to 94.0% compared to UDP

arXiv.org e-Print Archive