668 research outputs found

    Datacenter Traffic Control: Understanding Techniques and Trade-offs

    Get PDF
    Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial

    FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short

    Full text link
    We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned "purified" transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2x lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies

    A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

    Full text link
    Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect

    Energy-Efficient Interconnection Networks for High-Performance Computing

    Get PDF
    In recent years, energy has become one of the most important factors for de- signing and operating large scale computing systems. This is particularly true in high-performance computing, where systems often consist of thousands of nodes. Especially after the end of Dennard’s scaling, the demand for energy- proportionality in components, where energy is depending linearly on utilization, increases continuously. As the main contributor to the overall power consumption, processors have received the main attention so far. The increasing energy proportionality of processors, however, shifts the focus to other components such as interconnection networks. Their share of the overall power consumption is expected to increase to 20% or more while other components further increase their efficiency in the near future. Hence, it is crucial to improve energy proportionality in interconnection networks likewise to reduce overall power and energy consumption. To facilitate these attempts, this work provides comprehensive studies about energy saving in interconnection networks at different levels. First, interconnection networks differ fundamentally from other components in their underlying technology. To gain a deeper understanding of these differences and to identify targets for energy savings, this work provides a detailed power analysis of current network hardware. Furthermore, various applications at different scales are analyzed regarding their communication patterns and locality properties. The findings show that communication makes up only a small fraction of the execution time and networks are actually idling most of the time. Another observation is that point-to-point communication often only occurs within various small subsets of all participants, which indicates that a coordinated mapping could further decrease network traffic. Based on these studies, three different energy-saving policies are designed, which all differ in their implementation and focus. Then, these policies are evaluated in an event-based, power-aware network simulator. While two policies that operate completely local at link level, enable significant energy savings of more than 90% in most analyses, the hybrid one does not provide further benefits despite significant additional design effort. Additionally, these studies include network design parameters, such as transition time between different link configurations, as well as the three most common topologies in supercomputing systems. The final part of this work addresses the interactions of congestion management and energy-saving policies. Although both network management strategies aim for different goals and use opposite approaches, they complement each other and can increase energy efficiency in all studies as well as improve the performance overhead as opposed to plain energy saving

    Configurable data center switch architectures

    Get PDF
    In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces

    Transactional Data Structures

    Get PDF

    Analysing Mechanisms for Virtual Channel Management in Low-Diameter networks

    Full text link
    To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valiant load balancing scheme is a very popular choice for non-minimal routing. Evolved adaptive routing mechanisms implemented in real systems are based on this Valiant scheme. All these low-diameter networks are deadlock-prone when non-minimal routing is employed. Routing deadlocks occur when packets cannot progress due to cyclic dependencies. Therefore, developing efficient deadlock-free packet routing mechanisms is critical for the progress of these emergent networks. The routing function includes the routing algorithm for path selection and the buffers management policy that dictates how packets allocate the buffers of the switches on their paths. For the same routing algorithm, a different buffer management mechanism can lead to a very different performance. Moreover, certain mechanisms considered efficient for avoiding deadlocks, may still suffer from hard to pinpoint instabilities that make erratic the network response. This paper focuses on exploring the impact of these buffers management policies on the performance of current interconnection networks, showing a 90\% of performance drop if an incorrect buffers management policy is used. Moreover, this study not only characterizes some of these undesirable scenarios but also proposes practicable solutions

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    FlexVC: Flexible virtual channel management in low-diameter networks

    Get PDF
    Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.This work has been supported by the Spanish Government (grant SEV2015-0493 of the Severo Ochoa Program), the Spanish Ministry of Economy, Industry and Competitiveness (contracts TIN2015-65316), the Spanish Research Agency (AEI/FEDER, UE - TIN2016-76635-C2-2-R), the Spanish Ministry of Education (FPU grant FPU13/00337), the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014- SGR-1272), the European Union FP7 programme (RoMoL ERC Advanced Grant GA 321253), the European HiPEAC Network of Excellence and the European Union’s Horizon 2020 research and innovation programme (Mont-Blanc project under grant agreement No 671697).Peer ReviewedPostprint (author's final draft

    DESIGN OF EFFICIENT PACKET MARKING-BASED CONGESTION MANAGEMENT TECHNIQUES FOR CLUSTER INTERCONNECTS

    Full text link
    El crecimiento de los computadores paralelos basados en redes de altas prestaciones ha aumentado el interés y esfuerzo de la comunidad investigadora en desarrollar nuevas técnicas que permitan obtener el mejor rendimiento de estas redes. En particular, el desarrollo de nuevas técnicas que permitan un encaminamiento eficiente y que reduzcan la latencia de los paquetes, aumentando así la productividad de la red. Sin embargo, una alta tasa de utilización de la red podría conllevar el que se conoce como "congestión de red", el cual puede causar una degradación del rendimiento. El control de la congestión en redes multietapa es un problema importante que no está completamente resuelto. Con el fin de evitar la degradación del rendimiento de la red cuando aparece congestión, se han propuesto diferentes mecanismos para el control de la congestión. Muchos de estos mecanismos están basados en notificación explícita de la congestión. Para este propósito, los switches detectan congestión y dependiendo de la estrategia aplicada, los paquetes son marcados con la finalidad de advertir a los nodos origenes. Como respuesta, los nodos origenes aplican acciones correctivas para ajustar su tasa de inyección de paquetes. El propósito de esta tesis es analizar las diferentes estratégias de detección y corrección de la congestión en redes multietapa, y proponer nuevos mecanismos de control de la congestión encaminados a este tipo de redes sin descarte de paquetes. Las nuevas propuestas están basadas en una estrategia más refinada de marcaje de paquetes en combinación con un conjunto de acciones correctivas justas que harán al mecanismo capaz de controlar la congestión de manera efectiva con independencia del grado de congestión y de las condiciones de tráfico.Ferrer Pérez, JL. (2012). DESIGN OF EFFICIENT PACKET MARKING-BASED CONGESTION MANAGEMENT TECHNIQUES FOR CLUSTER INTERCONNECTS [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18197Palanci
    corecore