6,226 research outputs found
In Datacenter Performance, The Only Constant Is Change
All computing infrastructure suffers from performance variability, be it
bare-metal or virtualized. This phenomenon originates from many sources: some
transient, such as noisy neighbors, and others more permanent but sudden, such
as changes or wear in hardware, changes in the underlying hypervisor stack, or
even undocumented interactions between the policies of the computing resource
provider and the active workloads. Thus, performance measurements obtained on
clouds, HPC facilities, and, more generally, datacenter environments are almost
guaranteed to exhibit performance regimes that evolve over time, which leads to
undesirable nonstationarities in application performance. In this paper, we
present our analysis of performance of the bare-metal hardware available on the
CloudLab testbed where we focus on quantifying the evolving performance regimes
using changepoint detection. We describe our findings, backed by a dataset with
nearly 6.9M benchmark results collected from over 1600 machines over a period
of 2 years and 9 months. These findings yield a comprehensive characterization
of real-world performance variability patterns in one computing facility, a
methodology for studying such patterns on other infrastructures, and contribute
to a better understanding of performance variability in general.Comment: To be presented at the 20th IEEE/ACM International Symposium on
Cluster, Cloud and Internet Computing (CCGrid,
http://cloudbus.org/ccgrid2020/) on May 11-14, 2020 in Melbourne, Victoria,
Australi
RCD: Rapid Close to Deadline Scheduling for Datacenter Networks
Datacenter-based Cloud Computing services provide a flexible, scalable and
yet economical infrastructure to host online services such as multimedia
streaming, email and bulk storage. Many such services perform geo-replication
to provide necessary quality of service and reliability to users resulting in
frequent large inter- datacenter transfers. In order to meet tenant service
level agreements (SLAs), these transfers have to be completed prior to a
deadline. In addition, WAN resources are quite scarce and costly, meaning they
should be fully utilized. Several recently proposed schemes, such as B4,
TEMPUS, and SWAN have focused on improving the utilization of inter-datacenter
transfers through centralized scheduling, however, they fail to provide a
mechanism to guarantee that admitted requests meet their deadlines. Also, in a
recent study, authors propose Amoeba, a system that allows tenants to define
deadlines and guarantees that the specified deadlines are met, however, to
admit new traffic, the proposed system has to modify the allocation of already
admitted transfers. In this paper, we propose Rapid Close to Deadline
Scheduling (RCD), a close to deadline traffic allocation technique that is fast
and efficient. Through simulations, we show that RCD is up to 15 times faster
than Amoeba, provides high link utilization along with deadline guarantees, and
is able to make quick decisions on whether a new request can be fully satisfied
before its deadline.Comment: World Automation Congress (WAC), IEEE, 201
Enabling Work-conserving Bandwidth Guarantees for Multi-tenant Datacenters via Dynamic Tenant-Queue Binding
Today's cloud networks are shared among many tenants. Bandwidth guarantees
and work conservation are two key properties to ensure predictable performance
for tenant applications and high network utilization for providers. Despite
significant efforts, very little prior work can really achieve both properties
simultaneously even some of them claimed so.
In this paper, we present QShare, an in-network based solution to achieve
bandwidth guarantees and work conservation simultaneously. QShare leverages
weighted fair queuing on commodity switches to slice network bandwidth for
tenants, and solves the challenge of queue scarcity through balanced tenant
placement and dynamic tenant-queue binding. QShare is readily implementable
with existing switching chips. We have implemented a QShare prototype and
evaluated it via both testbed experiments and simulations. Our results show
that QShare ensures bandwidth guarantees while driving network utilization to
over 91% even under unpredictable traffic demands.Comment: The initial work is published in IEEE INFOCOM 201
ATP: a Datacenter Approximate Transmission Protocol
Many datacenter applications such as machine learning and streaming systems
do not need the complete set of data to perform their computation. Current
approximate applications in datacenters run on a reliable network layer like
TCP. To improve performance, they either let sender select a subset of data and
transmit them to the receiver or transmit all the data and let receiver drop
some of them. These approaches are network oblivious and unnecessarily transmit
more data, affecting both application runtime and network bandwidth usage. On
the other hand, running approximate application on a lossy network with UDP
cannot guarantee the accuracy of application computation. We propose to run
approximate applications on a lossy network and to allow packet loss in a
controlled manner. Specifically, we designed a new network protocol called
Approximate Transmission Protocol, or ATP, for datacenter approximate
applications. ATP opportunistically exploits available network bandwidth as
much as possible, while performing a loss-based rate control algorithm to avoid
bandwidth waste and re-transmission. It also ensures bandwidth fair sharing
across flows and improves accurate applications' performance by leaving more
switch buffer space to accurate flows. We evaluated ATP with both simulation
and real implementation using two macro-benchmarks and two real applications,
Apache Kafka and Flink. Our evaluation results show that ATP reduces
application runtime by 13.9% to 74.6% compared to a TCP-based solution that
drops packets at sender, and it improves accuracy by up to 94.0% compared to
UDP
- …