2,417 research outputs found

    Fabric-on-a-Chip: Toward Consolidating Packet Switching Functions on Silicon

    Get PDF
    The switching capacity of an Internet router is often dictated by the memory bandwidth required to bu¤er arriving packets. With the demand for greater capacity and improved service provisioning, inherent memory bandwidth limitations are encountered rendering input queued (IQ) switches and combined input and output queued (CIOQ) architectures more practical. Output-queued (OQ) switches, on the other hand, offer several highly desirable performance characteristics, including minimal average packet delay, controllable Quality of Service (QoS) provisioning and work-conservation under any admissible traffic conditions. However, the memory bandwidth requirements of such systems is O(NR), where N denotes the number of ports and R the data rate of each port. Clearly, for high port densities and data rates, this constraint dramatically limits the scalability of the switch. In an effort to retain the desirable attributes of output-queued switches, while significantly reducing the memory bandwidth requirements, distributed shared memory architectures, such as the parallel shared memory (PSM) switch/router, have recently received much attention. The principle advantage of the PSM architecture is derived from the use of slow-running memory units operating in parallel to distribute the memory bandwidth requirement. At the core of the PSM architecture is a memory management algorithm that determines, for each arriving packet, the memory unit in which it will be placed. However, to date, the computational complexity of this algorithm is O(N), thereby limiting the scalability of PSM switches. In an effort to overcome the scalability limitations, it is the goal of this dissertation to extend existing shared-memory architecture results while introducing the notion of Fabric on a Chip (FoC). In taking advantage of recent advancements in integrated circuit technologies, FoC aims to facilitate the consolidation of as many packet switching functions as possible on a single chip. Accordingly, this dissertation introduces a novel pipelined memory management algorithm, which plays a key role in the context of on-chip output- queued switch emulation. We discuss in detail the fundamental properties of the proposed scheme, along with hardware-based implementation results that illustrate its scalability and performance attributes. To complement the main effort and further support the notion of FoC, we provide performance analysis of output queued cell switches with heterogeneous traffic. The result is a flexible tool for obtaining bounds on the memory requirements in output queued switches under a wide range of tra¢ c scenarios. Additionally, we present a reconfigurable high-speed hardware architecture for real-time generation of packets for the various traffic scenarios. The work presented in this thesis aims at providing pragmatic foundations for designing next-generation, high-performance Internet switches and routers

    FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short

    Full text link
    We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned "purified" transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2x lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies

    Performance modeling of virtual switching systems

    Get PDF
    International audienceVirtual switches are a key elements within the new paradigms of Software Defined Networking (SDN) and Network Function Virtualization (NFV). Unlike proprietary networking appliances, virtual switches come with a high level of flexibility in the management of their physical resources such as the number of CPU cores, their allocation to the switching function, and the capacities of the RX queues, which gives the opportunity for an efficient sizing of the system resources. We propose a model for the performance evaluation of a virtual switch. Our model resorts to servers with vacation to capture the involved interactions between queues resulting from the implemented polling strategies. The solution to the model is found using a simple fixed-point iteration and it provides estimates for customary performance metrics such as the attained throughput, the packet latency, the buffer occupancy and the packet loss rate. In the tens of explored examples, the predictions of the model were found to be accurate, thereby allowing their use for the purpose of sizing problems

    Segment Routing: a Comprehensive Survey of Research Activities, Standardization Efforts and Implementation Results

    Full text link
    Fixed and mobile telecom operators, enterprise network operators and cloud providers strive to face the challenging demands coming from the evolution of IP networks (e.g. huge bandwidth requirements, integration of billions of devices and millions of services in the cloud). Proposed in the early 2010s, Segment Routing (SR) architecture helps face these challenging demands, and it is currently being adopted and deployed. SR architecture is based on the concept of source routing and has interesting scalability properties, as it dramatically reduces the amount of state information to be configured in the core nodes to support complex services. SR architecture was first implemented with the MPLS dataplane and then, quite recently, with the IPv6 dataplane (SRv6). IPv6 SR architecture (SRv6) has been extended from the simple steering of packets across nodes to a general network programming approach, making it very suitable for use cases such as Service Function Chaining and Network Function Virtualization. In this paper we present a tutorial and a comprehensive survey on SR technology, analyzing standardization efforts, patents, research activities and implementation results. We start with an introduction on the motivations for Segment Routing and an overview of its evolution and standardization. Then, we provide a tutorial on Segment Routing technology, with a focus on the novel SRv6 solution. We discuss the standardization efforts and the patents providing details on the most important documents and mentioning other ongoing activities. We then thoroughly analyze research activities according to a taxonomy. We have identified 8 main categories during our analysis of the current state of play: Monitoring, Traffic Engineering, Failure Recovery, Centrally Controlled Architectures, Path Encoding, Network Programming, Performance Evaluation and Miscellaneous...Comment: SUBMITTED TO IEEE COMMUNICATIONS SURVEYS & TUTORIAL

    Architecture design and performance analysis of practical buffered-crossbar packet switches

    Get PDF
    Combined input crosspoint buffered (CICB) packet switches were introduced to relax inputoutput arbitration timing and provide high throughput under admissible traffic. However, the amount of memory required in the crossbar of an N x N switch is N2x k x L, where k is the crosspoint buffer size and needs to be of size RTT in cells, L is the packet size. RTT is the round-trip time which is defined by the distance between line cards and switch fabric. When the switch size is large or RTT is not negligible, the memory amount required makes the implementation costly or infeasible for buffered crossbar switches. To reduce the required memory amount, a family of shared memory combined-input crosspoint-buffered (SMCB) packet switches, where the crosspoint buffers are shared among inputs, are introduced in this thesis. One of the proposed switches uses a memory speedup of in and dynamic memory allocation, and the other switch avoids speedup by arbitrating the access of inputs to the crosspoint buffers. These two switches reduce the required memory of the buffered crossbar by 50% or more and achieve equivalent throughput under independent and identical traffic with uniform distributions when using random selections. The proposed mSMCB switch is extended to support differentiated services and long RTT. To support P traffic classes with different priorities, CICB switches have been reported to use N2x k x L x P amount of memory to avoid blocking of high priority cells.The proposed SMCB switch with support for differentiated services requires 1/mP of the memory amount in the buffered crossbar and achieves similar throughput performance to that of a CICB switch with similar priority management, while using no speedup in the shared memory. The throughput performance of SMCB switch with crosspoint buffers shared by inputs (I-SMCB) is studied under multicast traffic. An output-based shared-memory crosspoint buffered (O-SMCB) packet switch is proposed where the crosspoint buffers are shared by two outputs and use no speedup. The proposed O-SMCB switch provides high performance under admissible uniform and nonuniform multicast traffic models while using 50% of the memory used in CICB switches. Furthermore, the O-SMCB switch provides higher throughput than the I-SMCB switch. As SMCB switches can efficiently support an RTT twice as long as that supported by CICB switches and as the performance of SMCB switches is bounded by a matching between inputs and crosspoint buffers, a new family of CICB switches with flexible access to crosspoint buffers are proposed to support longer RTTs than SMCB switches and to provide higher throughput under a wide variety of admissible traffic models. The CICB switches with flexible access allow an input to use any available crosspoint buffer at a given output. The proposed switches reduce the required crosspoint buffer size by a factor of N , keep the service of cells in sequence, and use no speedup. This new class of switches achieve higher throughput performance than CICB switches under a large variety of traffic models, while supporting long RTTs. Crosspoint buffered switches that are implemented in single chips have limited scalability. To support a large number of ports in crosspoint buffered switches, memory-memory-memory (MMM) Clos-network switches are an alternative. The MMM switches that use minimum memory amount at the central module is studied. Although, this switch can provide a moderate throughput, MMM switch may serve cells out of sequence. As keeping cells in sequence in an MMM switch may require buffers be distributed per flow, an MMM with extended memory in the switch modules is studied. To solve the out of sequence problem in MMM switches, a queuing architecture is proposed for an MMM switch. The service of cells in sequence is analyzed

    Squeezing the most benefit from network parallelism in datacenters

    Get PDF
    One big non-blocking switch is one of the most powerful and pervasive abstractions in datacenter networking. As Moore's law begins to wane, using parallelism to scale out processing units, vs. scale them up, is becoming exceedingly popular. The one-big-switch abstraction, for example, is typically implemented via leveraging massive degrees of parallelism behind the scene. In particular, in today's datacenters that exhibit a high degree of multi-pathing, each logical path between a communicating pair in the one-big-switch abstraction is mapped to a set of paths that can carry traffic in parallel. Similarly, each one-big-switch abstraction function, such as the firewall functionality, is mapped to a set of distributed hardware and software switches. Efficiently deploying this pool of networking connectivity and preserving the functional correctness of network functions, in spite of the parallelism, are challenging. Efficiently balancing the load among multiple paths is challenging because microbursts, responsible for the majority of packet loss in datacenters today, usually last for only a few microseconds. Even the fastest traffic engineering schemes today have control loops that are several orders of magnitude slower (a few milliseconds to a few seconds), and are therefore ineffective in controlling microbursts. Correctly implementing network functions in the face of parallelism is hard because the distributed set of elements that in parallel implement a one-big-switch abstraction can inevitably have inconsistent states that may cause them to behave differently than one physical switch. The first part of this thesis presents DRILL, a datacenter fabric for Clos networks which performs micro load balancing to distribute load as evenly as possible on microsecond timescales. To achieve this, DRILL employs packet-level decisions at each switch based on local queue occupancies and randomized algorithms to distribute load. Despite making per-packet forwarding decisions, by enforcing a tight control on queue occupancies, DRILL manages to keep the degree of packet reordering low. DRILL adapts to topological asymmetry (e.g. failures) in Clos networks by decomposing the network into symmetric components. Using a detailed switch hardware model, we simulate DRILL and show it outperforms recent edge-based load balancers particularly in the tail latency under heavy load, e.g., under 80% load, it reduces the 99.99th percentile of flow completion times of Presto and CONGA by 32% and 35%, respectively. Finally, we analyze DRILL's stability and throughput-efficiency. In the second part, we focus on the correctness of one-big-switch abstraction's implementation. We first show that naively using parallelism to scale networking elements can cause incorrect behavior. For example, we show that an IDS system which operates correctly as a single network element can erroneously and permanently block hosts when it is replicated. We then provide a system, COCONUT, for seamless scale-out of network forwarding elements; that is, an SDN application programmer can program to what functionally appears to be a single forwarding element, but which may be replicated behind the scenes. To do this, we identify the key property for seamless scale out, weak causality, and guarantee it through a practical and scalable implementation of vector clocks in the data plane. We build a prototype of COCONUT and experimentally demonstrate its correct behavior. We also show that its abstraction enables a more efficient implementation of seamless scale-out compared to a naive baseline. Finally, reasoning about network behavior requires a new model that enables us to distinguish between observable and unobservable events. So in the last part, we present the Input/Output Automaton (IOA) model and formalize networks' behaviors. Using this framework, we prove that COCONUT enables seamless scale out of networking elements, i.e., the user-perceived behavior of any COCONUT element implemented with a distributed set of concurrent replicas is provably indistinguishable from its singleton implementation
    • …
    corecore