174 research outputs found
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
An edge-queued datagram service for all datacenter traffic
Modern datacenters support a wide range of protocols and in-network switch enhancements aimed at improving performance. Unfortunately, the resulting protocols often do not coexist gracefully because they inevitably interact via queuing in the network. In this paper we describe EQDS, a new datagram service for datacenters that moves almost all of the queuing out of the core network and into the sending host. This enables it to support multiple (conflicting) higher layer protocols, while only sending packets into the network according to any receiver-driven credit scheme. EQDS can transparently speed up legacy TCP and RDMA stacks, and enables transport protocol evolution, while benefiting from future switch enhancements without needing to modify higher layer stacks. We show through simulation and multiple implementations that EQDS can reduce FCT of legacy TCP by 2x, improve the NVMeOF-RDMA throughput by 30%, and safely run TCP alongside RDMA on the same network
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based
fabric. However, the RDMA protocol is efficient only on (nearly) lossless
networks, emphasizing the vital role of congestion control on RoCE networks.
Unfortunately, the native RoCE congestion control scheme, based on Priority
Flow Control (PFC), suffers from many drawbacks such as unfairness,
head-of-line-blocking, and deadlock. Therefore, in recent years many schemes
have been proposed to provide additional congestion control for RoCE networks
to minimize PFC drawbacks. However, these schemes are proposed for general
datacenter environments. In contrast to the general datacenters that are built
using commodity hardware and run general-purpose workloads, high-performance
distributed training platforms deploy high-end accelerators and network
components and exclusively run training workloads using collectives
(All-Reduce, All-To-All) communication libraries for communication.
Furthermore, these platforms usually have a private network, separating their
communication traffic from the rest of the datacenter traffic. Scalable
topology-aware collective algorithms are inherently designed to avoid incast
patterns and balance traffic optimally. These distinct features necessitate
revisiting previously proposed congestion control schemes for general-purpose
datacenter environments. In this paper, we thoroughly analyze some of the SOTA
RoCE congestion control schemes vs. PFC when running on distributed training
platforms. Our results indicate that previously proposed RoCE congestion
control schemes have little impact on the end-to-end performance of training
workloads, motivating the necessity of designing an optimized, yet
low-overhead, congestion control scheme based on the characteristics of
distributed training platforms and workloads
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
As communication protocols evolve, datacenter network utilization increases.
As a result, congestion is more frequent, causing higher latency and packet
loss. Combined with the increasing complexity of workloads, manual design of
congestion control (CC) algorithms becomes extremely difficult. This calls for
the development of AI approaches to replace the human effort. Unfortunately, it
is currently not possible to deploy AI models on network devices due to their
limited computational capabilities. Here, we offer a solution to this problem
by building a computationally-light solution based on a recent reinforcement
learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC
by x500 by distilling its complex neural network into decision trees. This
transformation enables real-time inference within the -sec decision-time
requirement, with a negligible effect on quality. We deploy the transformed
policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used
in production, RL-CC is the only method that performs well on all benchmarks
tested over a large range of number of flows. It balances multiple metrics
simultaneously: bandwidth, latency, and packet drops. These results suggest
that data-driven methods for CC are feasible, challenging the prior belief that
handcrafted heuristics are necessary to achieve optimal performance
- …