396 research outputs found
ATP: a Datacenter Approximate Transmission Protocol
Many datacenter applications such as machine learning and streaming systems
do not need the complete set of data to perform their computation. Current
approximate applications in datacenters run on a reliable network layer like
TCP. To improve performance, they either let sender select a subset of data and
transmit them to the receiver or transmit all the data and let receiver drop
some of them. These approaches are network oblivious and unnecessarily transmit
more data, affecting both application runtime and network bandwidth usage. On
the other hand, running approximate application on a lossy network with UDP
cannot guarantee the accuracy of application computation. We propose to run
approximate applications on a lossy network and to allow packet loss in a
controlled manner. Specifically, we designed a new network protocol called
Approximate Transmission Protocol, or ATP, for datacenter approximate
applications. ATP opportunistically exploits available network bandwidth as
much as possible, while performing a loss-based rate control algorithm to avoid
bandwidth waste and re-transmission. It also ensures bandwidth fair sharing
across flows and improves accurate applications' performance by leaving more
switch buffer space to accurate flows. We evaluated ATP with both simulation
and real implementation using two macro-benchmarks and two real applications,
Apache Kafka and Flink. Our evaluation results show that ATP reduces
application runtime by 13.9% to 74.6% compared to a TCP-based solution that
drops packets at sender, and it improves accuracy by up to 94.0% compared to
UDP
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
Real-time Audio-Visual Media Transport over QUIC
We consider the problem of how to transport low-latency, interactive, real-time traffic over QUIC. This is needed to support applications like WebRTC, but difficult to support due to the reliable, unframed, nature of QUIC streams. We review the needs of low-latency real-time applications and how they have been supported in previous protocols, then propose a minimal set of extensions to QUIC to provide such support. Compared to a raw datagram service, our extensions provide meaningful support for partially reliable and real-time flows, in a backwards compatible manner
MLTCP: Congestion Control for DNN Training
We present MLTCP, a technique to augment today's congestion control
algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP
enables the communication phases of jobs that compete for network bandwidth to
interleave with each other, thereby utilizing the network efficiently. At the
heart of MLTCP lies a very simple principle based on a key conceptual insight:
DNN training flows should scale their congestion window size based on the
number of bytes sent at each training iteration. We show that integrating this
principle into today's congestion control protocols is straightforward: by
adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of
different jobs into an interleaved state within a few training iterations,
regardless of the number of competing flows or the start time of each flow. Our
experiments with popular DNN training jobs demonstrate that enabling MLTCP
accelerates the average and 99th percentile training iteration time by up to 2x
and 4x, respectively
Cross-layer signalling and middleware: a survey for inelastic soft real-time applications in MANETs
This paper provides a review of the different cross-layer design and protocol tuning approaches that may be used to meet a growing need to support inelastic soft real-time streams in MANETs. These streams are characterised by critical timing and throughput requirements and low packet loss tolerance levels. Many cross-layer approaches exist either for provision of QoS to soft real-time streams in static wireless networks or to improve the performance of real and non-real-time transmissions in MANETs. The common ground and lessons learned from these approaches, with a view to the potential provision of much needed support to real-time applications in MANETs, is therefore discussed
- …