11 research outputs found
Improving software middleboxes and datacenter task schedulers
Over the last decades, shared systems have contributed to the popularity of many technologies. From Operating Systems to the Internet, they have all brought significant cost savings by allowing the underlying infrastructure to be shared. A common challenge in these systems is to ensure that resources are fairly divided without compromising utilization efficiency. In this thesis, we look at problems in two shared systemsâsoftware middleboxes and datacenter task schedulersâand propose ways of improving both efficiency and fairness. We begin by presenting Sprayer, a system that uses packet spraying to load balance packets to cores in software middleboxes. Sprayer eliminates the imbalance problems of per-flow solutions and addresses the new challenges of handling shared flow state that come with packet spraying. We show that Sprayer significantly improves fairness and seamlessly uses the entire capacity, even when there is a single flow in the system. After that, we present Stateful Dominant Resource Fairness (SDRF), a task scheduling policy for datacenters that looks at past allocations and enforces fairness in the long run. We prove that SDRF keeps the fundamental properties of DRFâthe allocation policy it is built onâwhile benefiting users with lower usage. To efficiently implement SDRF, we also introduce live tree, a general-purpose data structure that keeps elements with predictable time-varying priorities sorted. Our trace-driven simulations indicate that SDRF reduces usersâ waiting time on average. This improves fairness, by increasing the number of completed tasks for users with lower demands, with small impact on high-demand users.Nas Ășltimas dĂ©cadas, sistemas compartilhados contribuĂram para a popularidade de muitas tecnologias. Desde Sistemas Operacionais atĂ© a Internet, esses sistemas trouxeram economias significativas ao permitir que a infraestrutura subjacente fosse compartilhada. Um desafio comum a esses sistemas Ă© garantir que os recursos sejam divididos de forma justa, sem comprometer a eficiĂȘncia de utilização. Esta dissertação observa problemas em dois sistemas compartilhados distintosâmiddleboxes em software e escalonadores de tarefas de datacentersâe propĂ”e maneiras de melhorar tanto a eficiĂȘncia como a justiça. Primeiro Ă© apresentado o sistema Sprayer, que usa espalhamento para direcionar pacotes entre os nĂșcleos em middleboxes em software. O Sprayer elimina os problemas de desbalanceamento causados pelas soluçÔes baseadas em fluxos e lida com os novos desafios de manipular estados de fluxo, consequentes do espalhamento de pacotes. Ă mostrado que o Sprayer melhora a justiça de forma significativa e consegue usar toda a capacidade, mesmo quando hĂĄ apenas um fluxo no sistema. Depois disso, Ă© apresentado o SDRF, uma polĂtica de alocação de tarefas para datacenters que considera as alocaçÔes passadas e garante justiça ao longo do tempo. Prova-se que o SDRF mantĂ©m as propriedades fundamentais do DRFâa polĂtica de alocação em que ele se baseiaâenquanto beneficia os usuĂĄrios com menor utilização. Para implementar o SDRF de forma eficiente, tambĂ©m Ă© introduzida a ĂĄrvore viva, uma estrutura de dados genĂ©rica que mantĂ©m ordenados elementos cujas prioridades variam com o tempo. SimulaçÔes com dados reais indicam que o SDRF reduz o tempo de espera na mĂ©dia. Isso melhora a justiça, ao aumentar o nĂșmero de tarefas completas dos usuĂĄrios com menor demanda, tendo um impacto pequeno nos usuĂĄrios de maior demanda
Techniques for improving the scalability of data center networks
Data centers require highly scalable data and control planes for ensuring good performance of distributed applications. Along the data plane, network throughput and latency directly impact application performance metrics. This has led researchers to propose high bisection bandwidth network topologies based on multi-rooted trees for data center networks. However, such topologies require efficient traffic splitting algorithms to fully utilize all available bandwidth. Along the control plane, the centralized controller for software-defined networks presents new scalability challenges. The logically centralized controller needs to scale according to network demands. Also, since all services are implemented in the centralized controller, it should allow easy integration of different types of network services.^ In this dissertation, we propose techniques to address scalability challenges along the data and control planes of data center networks.^ Along the data plane, we propose a fine-grained trac splitting technique for data center networks organized as multi-rooted trees. Splitting individual flows can provide better load balance but is not preferred because of potential packet reordering that conventional wisdom suggests may negatively interact with TCP congestion control. We demonstrate that, due to symmetry of the network topology, TCP is able to tolerate the induced packet reordering and maintain a single estimate of RTT.^ Along the control plane, we design a scalable distributed SDN control plane architecture. We propose algorithms to evenly distribute the load among the controller nodes of the control plane. The algorithms evenly distribute the load by dynamically configuring the switch to controller node mapping and adding/removing controller nodes in response to changing traffic patterns. ^ Each SDN controller platform may have different performance characteristics. In such cases, it may be desirable to run different services on different controllers to match the controller performance characteristics with service requirements. To address this problem, we propose an architecture, FlowBricks, that allows network operators to compose an SDN control plane with services running on top of heterogeneous controller platforms
State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
With the slowdown of Moore's law, CPU-oriented packet processing in software
will be significantly outpaced by emerging line speeds of network interface
cards (NICs). Single-core packet-processing throughput has saturated.
We consider the problem of high-speed packet processing with multiple CPU
cores. The key challenge is state--memory that multiple packets must read and
update. The prevailing method to scale throughput with multiple cores involves
state sharding, processing all packets that update the same state, i.e., flow,
at the same core. However, given the heavy-tailed nature of realistic flow size
distributions, this method will be untenable in the near future, since total
throughput is severely limited by single core performance.
This paper introduces state-compute replication, a principle to scale the
throughput of a single stateful flow across multiple cores using replication.
Our design leverages a packet history sequencer running on a NIC or
top-of-the-rack switch to enable multiple cores to update state without
explicit synchronization. Our experiments with realistic data center and
wide-area Internet traces shows that state-compute replication can scale total
packet-processing throughput linearly with cores, deterministically and
independent of flow size distributions, across a range of realistic
packet-processing programs
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
Software-defined datacenter network debugging
Software-defined Networking (SDN) enables flexible network management, but as networks
evolve to a large number of end-points with diverse network policies, higher
speed, and higher utilization, abstraction of networks by SDN makes monitoring and
debugging network problems increasingly harder and challenging. While some problems
impact packet processing in the data plane (e.g., congestion), some cause policy
deployment failures (e.g., hardware bugs); both create inconsistency between operator
intent and actual network behavior. Existing debugging tools are not sufficient to
accurately detect, localize, and understand the root cause of problems observed in a
large-scale networks; either they lack in-network resources (compute, memory, or/and
network bandwidth) or take long time for debugging network problems.
This thesis presents three debugging tools: PathDump, SwitchPointer, and Scout,
and a technique for tracing packet trajectories called CherryPick. We call for a different
approach to network monitoring and debugging: in contrast to implementing
debugging functionality entirely in-network, we should carefully partition the debugging
tasks between end-hosts and network elements. Towards this direction, we present
CherryPick, PathDump, and SwitchPointer. The core of CherryPick is to cherry-pick the
links that are key to representing an end-to-end path of a packet, and to embed picked
linkIDs into its header on its way to destination.
PathDump is an end-host based network debugger based on tracing packet trajectories,
and exploits resources at the end-hosts to implement various monitoring and
debugging functionalities. PathDump currently runs over a real network comprising
only of commodity hardware, and yet, can support surprisingly a large class of network
debugging problems with minimal in-network functionality.
The key contributions of SwitchPointer is to efficiently provide network visibility
to end-host based network debuggers like PathDump by using switch memory as a
"directory service" â each switch, rather than storing telemetry data necessary for
debugging functionalities, stores pointers to end hosts where relevant telemetry data is
stored. The key design choice of thinking about memory as a directory service allows
to solve performance problems that were hard or infeasible with existing designs.
Finally, we present and solve a network policy fault localization problem that arises
in operating policy management frameworks for a production network. We develop
Scout, a fully-automated system that localizes faults in a large scale policy deployment
and further pin-points the physical-level failures which are most likely cause for
observed faults
Configurable data center switch architectures
In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces
Squeezing the most benefit from network parallelism in datacenters
One big non-blocking switch is one of the most powerful and pervasive abstractions in datacenter networking. As Moore's law begins to wane, using parallelism to scale out processing units, vs. scale them up, is becoming exceedingly popular. The one-big-switch abstraction, for example, is typically implemented via leveraging massive degrees of parallelism behind the scene. In particular, in today's datacenters that exhibit a high degree of multi-pathing, each logical path between a communicating pair in the one-big-switch abstraction is mapped to a set of paths that can carry traffic in parallel. Similarly, each one-big-switch abstraction function, such as the firewall functionality, is mapped to a set of distributed hardware and software switches.
Efficiently deploying this pool of networking connectivity and preserving the functional correctness of network functions, in spite of the parallelism, are challenging. Efficiently balancing the load among multiple paths is challenging because microbursts, responsible for the majority of packet loss in datacenters today, usually last for only a few microseconds. Even the fastest traffic engineering schemes today have control loops that are several orders of magnitude slower (a few milliseconds to a few seconds), and are therefore ineffective in controlling microbursts. Correctly implementing network functions in the face of parallelism is hard because the distributed set of elements that in parallel implement a one-big-switch abstraction can inevitably have inconsistent states that may cause them to behave differently than one physical switch.
The first part of this thesis presents DRILL, a datacenter fabric for Clos networks which performs micro load balancing to distribute load as evenly as possible on microsecond timescales. To achieve this, DRILL employs packet-level decisions at each switch based on local queue occupancies and randomized algorithms to distribute load. Despite making per-packet forwarding decisions, by enforcing a tight control on queue occupancies, DRILL manages to keep the degree of packet reordering low. DRILL adapts to topological asymmetry (e.g. failures) in Clos networks by decomposing the network into symmetric components. Using a detailed switch hardware model, we simulate DRILL and show it outperforms recent edge-based load balancers particularly in the tail latency under heavy load, e.g., under 80% load, it reduces the 99.99th percentile of flow completion times of Presto and CONGA by 32% and 35%, respectively. Finally, we analyze DRILL's stability and throughput-efficiency.
In the second part, we focus on the correctness of one-big-switch abstraction's implementation. We first show that naively using parallelism to scale networking elements can cause incorrect behavior. For example, we show that an IDS system which operates correctly as a single network element can erroneously and permanently block hosts when it is replicated. We then provide a system, COCONUT, for seamless scale-out of network forwarding elements; that is, an SDN application programmer can program to what functionally appears to be a single forwarding element, but which may be replicated behind the scenes. To do this, we identify the key property for seamless scale out, weak causality, and guarantee it through a practical and scalable implementation of vector clocks in the data plane. We build a prototype of COCONUT and experimentally demonstrate its correct behavior. We also show that its abstraction enables a more efficient implementation of seamless scale-out compared to a naive baseline.
Finally, reasoning about network behavior requires a new model that enables us to distinguish between observable and unobservable events. So in the last part, we present the Input/Output Automaton (IOA) model and formalize networks' behaviors. Using this framework, we prove that COCONUT enables seamless scale out of networking elements, i.e., the user-perceived behavior of any COCONUT element implemented with a distributed set of concurrent replicas is provably indistinguishable from its singleton implementation
Hybrid routing in delay tolerant networks
This work addresses the integration of today\\u27s infrastructure-based networks with infrastructure-less networks. The resulting Hybrid Routing System allows for communication over both network types and can help to overcome cost, communication, and overload problems. Mobility aspect resulting from infrastructure-less networks are analyzed and analytical models developed. For development and deployment of the Hybrid Routing System an overlay-based framework is presented