13 research outputs found

    Datacenter Traffic Control: Understanding Techniques and Trade-offs

    Get PDF
    Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial

    An edge-queued datagram service for all datacenter traffic

    Get PDF
    Modern datacenters support a wide range of protocols and in-network switch enhancements aimed at improving performance. Unfortunately, the resulting protocols often do not coexist gracefully because they inevitably interact via queuing in the network. In this paper we describe EQDS, a new datagram service for datacenters that moves almost all of the queuing out of the core network and into the sending host. This enables it to support multiple (conflicting) higher layer protocols, while only sending packets into the network according to any receiver-driven credit scheme. EQDS can transparently speed up legacy TCP and RDMA stacks, and enables transport protocol evolution, while benefiting from future switch enhancements without needing to modify higher layer stacks. We show through simulation and multiple implementations that EQDS can reduce FCT of legacy TCP by 2x, improve the NVMeOF-RDMA throughput by 30%, and safely run TCP alongside RDMA on the same network

    Consistent high performance and flexible congestion control architecture

    Get PDF
    The part of TCP software stack that controls how fast a data sender transfers packets is usually referred as congestion control, because it was originally introduced to avoid network congestion of multiple competing flows. During the recent 30 years of Internet evolution, traditional TCP congestion control architecture, though having a army of specially-engineered implementations and improvements over the original software, suffers increasingly more from surprisingly poor performance in today's complicated network conditions. We argue the traditional TCP congestion control family has little hope of achieving consistent high performance due to a fundamental architectural deficiency: hardwiring packet-level events to control responses. In this thesis, we propose Performance-oriented Congestion Control (PCC), a new congestion control architecture in which each sender continuously observes the connection between its rate control actions and empirically experienced performance, enabling it to use intelligent control algorithms to consistently adopt actions that result in high performance. We first build the above foundation of PCC architecture analytically prove the viability of this new congestion control architecture. Specifically, we show that, controversial to intuition, with certain form of utility function and a theoretically simplified rate control algorithm, selfishly competing senders converge to a fair and stable Nash Equilibrium. With this architectural and theoretical guideline, we then design and implement the first congestion control protocol in PCC family: PCC Allegro. PCC Allegro immediate demonstrates its architectural benefits with significant, often more than 10X, performance gain on a wide spectrum of challenging network conditions. With these very encouraging performance validation, we further advance PCC's architecture on both utilty function framework and the learning rate control algorithm. Taking a principled approach using online learning theory, we designed PCC Vivace with a new strictly socially concave utility function framework and a gradient-ascend based learning rate control algorithm. PCC Vivace significantly improves performance on fast-changing networks, yields better tradeoff in convergence speed and stability and better TCP friendliness comparing to PCC Allegro and other state-of-art new congestion control protocols. Moreover, PCC Vivace's expressive utility function framework can be tuned differently at different competing flows to produce predictable converged throughput ratios for each flow. This opens significant future potential for PCC Vivace in centrally control networking paradigm like Software Defined Networks (SDN). Finally, with all these research advances, we aim to push PCC architecture to production use with a a user-space tunneling proxy and successfully integration with Google's QUIC transport framework

    Network and Server Resource Management Strategies for Data Centre Infrastructures: A Survey

    Get PDF
    The advent of virtualisation and the increasing demand for outsourced, elastic compute charged on a pay-as-you-use basis has stimulated the development of large-scale Cloud Data Centres (DCs) housing tens of thousands of computer clusters. Of the signi�cant capital outlay required for building and operating such infrastructures, server and network equipment account for 45% and 15% of the total cost, respectively, making resource utilisation e�ciency paramount in order to increase the operators' Return-on-Investment (RoI). In this paper, we present an extensive survey on the management of server and network resources over virtualised Cloud DC infrastructures, highlighting key concepts and results, and critically discussing their limitations and implications for future research opportunities. We highlight the need for and bene �ts of adaptive resource provisioning that alleviates reliance on static utilisation prediction models and exploits direct measurement of resource utilisation on servers and network nodes. Coupling such distributed measurement with logically-centralised Software De�ned Networking (SDN) principles, we subsequently discuss the challenges and opportunities for converged resource management over converged ICT environments, through unifying control loops to globally orchestrate adaptive and load-sensitive resource provisioning

    Enhancing programmability for adaptive resource management in next generation data centre networks

    Get PDF
    Recently, Data Centre (DC) infrastructures have been growing rapidly to support a wide range of emerging services, and provide the underlying connectivity and compute resources that facilitate the "*-as-a-Service" model. This has led to the deployment of a multitude of services multiplexed over few, very large-scale centralised infrastructures. In order to cope with the ebb and flow of users, services and traffic, infrastructures have been provisioned for peak-demand resulting in the average utilisation of resources to be low. This overprovisionning has been further motivated by the complexity in predicting traffic demands over diverse timescales and the stringent economic impact of outages. At the same time, the emergence of Software Defined Networking (SDN), is offering new means to monitor and manage the network infrastructure to address this underutilisation. This dissertation aims to show how measurement-based resource management can improve performance and resource utilisation by adaptively tuning the infrastructure to the changing operating conditions. To achieve this dynamicity, the infrastructure must be able to centrally monitor, notify and react based on the current operating state, from per-packet dynamics to longstanding traffic trends and topological changes. However, the management and orchestration abilities of current SDN realisations is too limiting and must evolve for next generation networks. The current focus has been on logically centralising the routing and forwarding decisions. However, in order to achieve the necessary fine-grained insight, the data plane of the individual device must be programmable to collect and disseminate the metrics of interest. The results of this work demonstrates that a logically centralised controller can dynamically collect and measure network operating metrics to subsequently compute and disseminate fine-tuned environment-specific settings. They show how this approach can prevent TCP throughput incast collapse and improve TCP performance by an order of magnitude for partition-aggregate traffic patterns. Futhermore, the paradigm is generalised to show the benefits for other services widely used in DCs such as, e.g, routing, telemetry, and security

    Mitigating interconnect and end host congestion in modern networks

    Get PDF
    One of the most critical building blocks of the Internet is the mechanism to mitigate network congestion. While existing congestion control approaches have served their purpose well in the last decades, the last few years saw a significant increase in new applications and user demand, stressing the network infrastructure to the extent that new ways of handling congestion are required. This dissertation identifies the congestion problems caused by the increased scale of the network usage, both in inter-AS connects and on end hosts in data centers, and presents abstractions and frameworks that allow for improved solutions to mitigate congestion. To mitigate inter-AS congestion, we develop Unison, a framework that allows an ISP to jointly optimize its intra-domain routes and inter-domain routes, in collaboration with content providers. The basic idea is to provide the ISP operator and the neighbors of the ISP with an abstraction of the ISP network in the form of a virtual switch (vSwitch). Unison allows the ISP to provide hints to its neighbors, suggesting alternative routes that can improve their performance. We investigate how the vSwitch abstraction can be used to maximize the throughput of the ISP. To mitigate end-host congestion in data center networks, we develop a backpressure mechanism for queuing architecture in congested end hosts to cope with tens of thousands of flows. We show that current end-host mechanisms can lead to high CPU utilization, high tail latency, and low throughput in cases of congestion of egress traffic. We introduce the design, implementation, and evaluation of zero-drop networking (zD) stack, a new architecture for handling congestion of scheduled buffers. Besides queue overflow, another cause of congestion is CPU resource exhaustion. The CPU cost of processing packets in networking stacks, however, has not been fully investigated in the literature. Much of the focus of the community has been on scaling servers in terms of aggregate traffic intensity, but bottlenecks caused by the increasing number of concurrent flows have received little attention. We conduct a comprehensive analysis on the CPU cost of processing packets and identify the root cause that leads to high CPU overhead and degraded performance in terms of throughput and RTT. Our work highlights considerations beyond packets per second for the design of future stacks that scale to millions of flows.Ph.D

    Improved algorithms for TCP congestion control

    Get PDF
    Reliable and efficient data transfer on the Internet is an important issue. Since late 70’s the protocol responsible for that has been the de facto standard TCP, which has proven to be successful through out the years, its self-managed congestion control algorithms have retained the stability of the Internet for decades. However, the variety of existing new technologies such as high-speed networks (e.g. fibre optics) with high-speed long-delay set-up (e.g. cross-Atlantic links) and wireless technologies have posed lots of challenges to TCP congestion control algorithms. The congestion control research community proposed solutions to most of these challenges. This dissertation adds to the existing work by: firstly tackling the highspeed long-delay problem of TCP, we propose enhancements to one of the existing TCP variants (part of Linux kernel stack). We then propose our own variant: TCP-Gentle. Secondly, tackling the challenge of differentiating the wireless loss from congestive loss in a passive way and we propose a novel loss differentiation algorithm which quantifies the noise in packet inter arrival times and use this information together with the span (ratio of maximum to minimum packet inter arrival times) to adapt the multiplicative decrease factor according to a predefined logical formula. Finally, extending the well-known drift model of TCP to account for wireless loss and some hypothetical cases (e.g. variable multiplicative decrease), we have undertaken stability analysis for the new version of the model

    Application-Aware Network Design Using Software Defined Networking for Application Performance Optimization for Big Data and Video Streaming

    Get PDF
    Title from PDF of title page viewed October 30, 2017Dissertation advisor: Deep MedhiVitaIncludes bibliographical references (pages 122-135)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2017This dissertation investigates improvement in application performance. For applications, we consider two classes: Hadoop MapReduce and video streaming. The Hadoop MapReduce (M/R) framework has become the de facto standard for Big Data analytics. However, the lack of network-awareness of the default MapReduce resource manager in a traditional IP network can cause unbalanced job scheduling and network bottlenecks; such factors can eventually lead to an increase in the Hadoop MapReduce job completion time. Dynamic Video streaming over the HTTP (MPEG-DASH) is becoming the defacto dominating transport for today’s video applications. It has been implemented in today’s major media carriers such as Youtube and Netflix. It enables new video applications to fully utilize the existing physical IP network infrastructure. For new 3D immersive medias such as Virtual Reality and 360-degree videos are drawing great attentions from both consumers and researchers in recent years. One of the biggest challenges in streaming such 3D media is the high band width demands and video quality. A new Tile-based video is introduced in both video codec and streaming layer to reduce the transferred media size. In this dissertation, we propose a Software-Defined Network (SDN) approach in an Application-Aware Network (AAN) platform. We first present an architecture for our approach and then show how this architecture can be applied to two aforementioned application areas. Our approach provides both underlying network functions and application level forwarding logics for Hadoop MapReduce and video streaming. By incorporating a comprehensive view of the network, the SDN controller can optimize MapReduce work loads and DASH flows for videos by application-aware traffic reroute. We quantify the improvement for both Hadoop and MPEG-DASH in terms of job completion time and user’s quality of experience (QoE), respectively. Based on our experiments, we observed that our AAN platform for Hadoop MapReduce job optimization offer a significant improvement compared to a static, traditional IP network environment by reducing job run time by 16% to 300% for various MapReduce benchmark jobs. As for MPEG-DASH based video streaming, we can increase user perceived video bitrate by 100%.Introduction -- Research survey -- Proposed architecture -- AAN-SDN for Hadoop -- Study of User QoE Improvement for Dynamic Adaptive Streaming over HTTP (MPEG-DASH) -- AAN-SDN For MPEG-DASH -- Conclusion -- Appendix A. Mininet Topology Source Code For DASH Setup -- Appendix B. Hadoop Installation Source Code -- Appendix C. Openvswitch Installation Source Code -- Appendix D. HiBench Installation Guid
    corecore