64 research outputs found
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based
fabric. However, the RDMA protocol is efficient only on (nearly) lossless
networks, emphasizing the vital role of congestion control on RoCE networks.
Unfortunately, the native RoCE congestion control scheme, based on Priority
Flow Control (PFC), suffers from many drawbacks such as unfairness,
head-of-line-blocking, and deadlock. Therefore, in recent years many schemes
have been proposed to provide additional congestion control for RoCE networks
to minimize PFC drawbacks. However, these schemes are proposed for general
datacenter environments. In contrast to the general datacenters that are built
using commodity hardware and run general-purpose workloads, high-performance
distributed training platforms deploy high-end accelerators and network
components and exclusively run training workloads using collectives
(All-Reduce, All-To-All) communication libraries for communication.
Furthermore, these platforms usually have a private network, separating their
communication traffic from the rest of the datacenter traffic. Scalable
topology-aware collective algorithms are inherently designed to avoid incast
patterns and balance traffic optimally. These distinct features necessitate
revisiting previously proposed congestion control schemes for general-purpose
datacenter environments. In this paper, we thoroughly analyze some of the SOTA
RoCE congestion control schemes vs. PFC when running on distributed training
platforms. Our results indicate that previously proposed RoCE congestion
control schemes have little impact on the end-to-end performance of training
workloads, motivating the necessity of designing an optimized, yet
low-overhead, congestion control scheme based on the characteristics of
distributed training platforms and workloads
TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications
Datacenters running on-line, data-intensive applications (OLDIs) consume
significant amounts of energy. However, reducing their energy is challenging
due to their tight response time requirements. A key aspect of OLDIs is that
each user query goes to all or many of the nodes in the cluster, so that the
overall time budget is dictated by the tail of the replies' latency
distribution; replies see latency variations both in the network and compute.
Previous work proposes to achieve load-proportional energy by slowing down the
computation at lower datacenter loads based directly on response times (i.e.,
at lower loads, the proposal exploits the average slack in the time budget
provisioned for the peak load). In contrast, we propose TimeTrader to reduce
energy by exploiting the latency slack in the sub- critical replies which
arrive before the deadline (e.g., 80% of replies are 3-4x faster than the
tail). This slack is present at all loads and subsumes the previous work's
load-related slack. While the previous work shifts the leaves' response time
distribution to consume the slack at lower loads, TimeTrader reshapes the
distribution at all loads by slowing down individual sub-critical nodes without
increasing missed deadlines. TimeTrader exploits slack in both the network and
compute budgets. Further, TimeTrader leverages Earliest Deadline First
scheduling to largely decouple critical requests from the queuing delays of
sub- critical requests which can then be slowed down without hurting critical
requests. A combination of real-system measurements and at-scale simulations
shows that without adding to missed deadlines, TimeTrader saves 15-19% and
41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512
nodes, whereas previous work saves 0% and 31-37%.Comment: 13 page
Dual Queue Coupled AQM: Deployable Very Low Queuing Delay for All
On the Internet, sub-millisecond queueing delay and capacity-seeking have
traditionally been considered mutually exclusive. We introduce a service that
offers both: Low Latency Low Loss Scalable throughput (L4S). When tested under
a wide range of conditions emulated on a testbed using real residential
broadband equipment, queue delay remained both low (median 100--300 s) and
consistent (99th percentile below 2 ms even under highly dynamic workloads),
without compromising other metrics (zero congestion loss and close to full
utilization). L4S exploits the properties of `Scalable' congestion controls
(e.g., DCTCP, TCP Prague). Flows using such congestion control are however very
aggressive, which causes a deployment challenge as L4S has to coexist with
so-called `Classic' flows (e.g., Reno, CUBIC). This paper introduces an
architectural solution: `Dual Queue Coupled Active Queue Management', which
enables balance between Scalable and Classic flows. It counterbalances the more
aggressive response of Scalable flows with more aggressive marking, without
having to inspect flow identifiers. The Dual Queue structure has been
implemented as a Linux queuing discipline. It acts like a semi-permeable
membrane, isolating the latency of Scalable and `Classic' traffic, but coupling
their capacity into a single bandwidth pool. This paper justifies the design
and implementation choices, and visualizes a representative selection of
hundreds of thousands of experiment runs to test our claims.Comment: Preprint. 17pp, 12 Figs, 60 refs. Submitted to IEEE/ACM Transactions
on Networkin
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
As communication protocols evolve, datacenter network utilization increases.
As a result, congestion is more frequent, causing higher latency and packet
loss. Combined with the increasing complexity of workloads, manual design of
congestion control (CC) algorithms becomes extremely difficult. This calls for
the development of AI approaches to replace the human effort. Unfortunately, it
is currently not possible to deploy AI models on network devices due to their
limited computational capabilities. Here, we offer a solution to this problem
by building a computationally-light solution based on a recent reinforcement
learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC
by x500 by distilling its complex neural network into decision trees. This
transformation enables real-time inference within the -sec decision-time
requirement, with a negligible effect on quality. We deploy the transformed
policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used
in production, RL-CC is the only method that performs well on all benchmarks
tested over a large range of number of flows. It balances multiple metrics
simultaneously: bandwidth, latency, and packet drops. These results suggest
that data-driven methods for CC are feasible, challenging the prior belief that
handcrafted heuristics are necessary to achieve optimal performance
Transport Protocols for Data Center Communication
Data centers are becoming more and more important since there is a number of services covered especially by them. At the same time it is reasonable to maintain the costs of data centers low from a number of perspectives.
To this end, one could propose a number of changes in the data center environment. While there is a number of studies that focus on different aspects of the data center environment, one of the most important factors that can be studied and changed is the transport protocol used in the data center environment. This change will have an impact on a number of factors in the data centers.
For the purpose of this thesis a number of transport protocols were studied, starting from the broadly used TCP to a number of especially designed for data centers ones. These variations were studied for the changes they impose and the positive results they bring.
At the same time the significance of DCTCP, the most extensively studied and deployed data center environment protocol was made apparent and the positive results from its deployment. This study outlines the necessity to know its behavior while coexisting with TCP as well since its deployment in the wide Internet would bring positive results for latency, losses and buffer queues minimization.
To this end, the protocol was studied by emulating network behavior in Mininet network emulator and it was found out that its coexistence with TCP is possible without the TCP traffic starving as long as some parameters settings are followed
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
Recommended from our members
Measurement-Driven Algorithm and System Design for Wireless and Datacenter Networks
The growing number of mobile devices and data-intensive applications pose unique challenges for wireless access networks as well as datacenter networks that enable modern cloud-based services. With the enormous increase in volume and complexity of traffic from applications such as video streaming and cloud computing, the interconnection networks have become a major performance bottleneck. In this thesis, we study algorithms and architectures spanning several layers of the networking protocol stack that enable and accelerate novel applications and that are easily deployable and scalable. The design of these algorithms and architectures is motivated by measurements and observations in real world or experimental testbeds.
In the first part of this thesis, we address the challenge of wireless content delivery in crowded areas. We present the AMuSe system, whose objective is to enable scalable and adaptive WiFi multicast. AMuSe is based on accurate receiver feedback and incurs a small control overhead. This feedback information can be used by the multicast sender to optimize multicast service quality, e.g., by dynamically adjusting transmission bitrate. Specifically, we develop an algorithm for dynamic selection of a subset of the multicast receivers as feedback nodes which periodically send information about the channel quality to the multicast sender. Further, we describe the Multicast Dynamic Rate Adaptation (MuDRA) algorithm that utilizes AMuSe's feedback to optimally tune the physical layer multicast rate. MuDRA balances fast adaptation to channel conditions and stability, which is essential for multimedia applications.
We implemented the AMuSe system on the ORBIT testbed and evaluated its performance in large groups with approximately 200 WiFi nodes. Our extensive experiments demonstrate that AMuSe can provide accurate feedback in a dense multicast environment. It outperforms several alternatives even in the case of external interference and changing network conditions. Further, our experimental evaluation of MuDRA on the ORBIT testbed shows that MuDRA outperforms other schemes and supports high throughput multicast flows to hundreds of nodes while meeting quality requirements. As an example application, MuDRA can support multiple high quality video streams, where 90% of the nodes report excellent or very good video quality.
Next, we specifically focus on ensuring high Quality of Experience (QoE) for video streaming over WiFi multicast. We formulate the problem of joint adaptation of multicast transmission rate and video rate for ensuring high video QoE as a utility maximization problem and propose an online control algorithm called DYVR which is based on Lyapunov optimization techniques. We evaluated the performance of DYVR through analysis, simulations, and experiments using a testbed composed of Android devices and o the shelf APs. Our evaluation shows that DYVR can ensure high video rates while guaranteeing a low but acceptable number of segment losses, buffer underflows, and video rate switches.
We leverage the lessons learnt from AMuSe for WiFi to address the performance issues with LTE evolved Multimedia Broadcast/Multicast Service (eMBMS). We present the Dynamic Monitoring (DyMo) system which provides low-overhead and real-time feedback about eMBMS performance. DyMo employs eMBMS for broadcasting instructions which indicate the reporting rates as a function of the observed Quality of Service (QoS) for each UE. This simple feedback mechanism collects very limited QoS reports which can be used for network optimization. We evaluated the performance of DyMo analytically and via simulations. DyMo infers the optimal eMBMS settings with extremely low overhead, while meeting strict QoS requirements under different UE mobility patterns and presence of network component failures.
In the second part of the thesis, we study datacenter networks which are key enablers of the end-user applications such as video streaming and storage. Datacenter applications such as distributed file systems, one-to-many virtual machine migrations, and large-scale data processing involve bulk multicast flows. We propose a hardware and software system for enabling physical layer optical multicast in datacenter networks using passive optical splitters. We built a prototype and developed a simulation environment to evaluate the performance of the system for bulk multicasting. Our evaluation shows that the optical multicast architecture can achieve higher throughput and lower latency than IP multicast and peer-to-peer multicast schemes with lower switching energy consumption.
Finally, we study the problem of congestion control in datacenter networks. Quantized Congestion Control (QCN), a switch-supported standard, utilizes direct multi-bit feedback from the network for hardware rate limiting. Although QCN has been shown to be fast-reacting and effective, being a Layer-2 technology limits its adoption in IP-routed Layer 3 datacenters. We address several design challenges to overcome QCN feedback's Layer- 2 limitation and use it to design window-based congestion control (QCN-CC) and load balancing (QCN-LB) schemes. Our extensive simulations, based on real world workloads, demonstrate the advantages of explicit, multi-bit congestion feedback, especially in a typical environment where intra-datacenter traffic with short Round Trip Times (RTT: tens of s) run in conjunction with web-facing traffic with long RTTs (tens of milliseconds)
HINT: Supporting Congestion Control Decisions with P4-driven In-Band Network Telemetry
Years of research on congestion controls have highlighted how end-to-end and in-network protocols might perform poorly in some contexts. Recent advances in data plane network programmability could also bring advantages in transport protocols, enabling mining and processing in-network congestion signals. However, the new machine learning-based congestion control class has only partially used data from the network, favoring a more sophisticated model design but neglecting possibly precious pieces of data. In this paper, we present HINT, an in-band network telemetry architecture designed to provide insights into network congestion to the end-host TCP algorithm during the learning process. In particular, the key idea is to adapt switches’ behavior via P4 and instruct them to insert simple device information, such as processing delay and queue occupancy, directly into transferred packets. Initial experimental results show that this approach comes with a little network overhead but can improve the visibility and, consequently, the accuracy of TCP decisions of the end-host. At the same time, the programmability of both switches and hosts also enables customization of the default behavior as the user’s needs change
- …