183 research outputs found

    Fast, Scalable, and Flexible C++ Hardware Architectures for Network Data Plane Queuing and Traffic Management

    Get PDF
    Les réseaux de communication actuels et ceux de la prochaine génération font face à des défis majeurs en matière de commutation et de routage de paquets dans le contexte des réseaux définis par logiciel (SDN–Software Defined Networking), de l’augmentation du trafic provenant des diffèrents utilisateurs et dispositifs connectés, et des exigences strictes de faible latence et haut débit. Avec la prochaine technologie de communication, la cinquième génération (5G) permettra des applications et services indispensables tels que l’automatisation des usines, les systèmes de transport intelligents, les réseaux d’énergie intelligents, la chirurgie à distance, la réalité virtuelle/augmentée, etc. Ces services et applications exigent des performances très strictes en matière de latence, de l’ordre de 1 ms, et des débits de données atteignant 1 Gb/s. Tout le trafic Internet transitant dans le réseau est traité par le plan de données, aussi appelé traitement associé au chemin dit d’accès rapide. Le plan de données contient des dispositifs et équipements qui gèrent le transfert et le traitement des différents trafics. La hausse de la demande en bande passante Internet a accru le besoin de processeurs plus puissants, spécialement conçus pour le traitement rapide des paquets, à savoir les unités de traitement réseau ou les processeurs de réseau (NPU–Network Processing Units). Les NPU sont des dispositifs de réseau pouvant couvrir l’ensemble du modèle d’interconnexion de systèmes ouverts (OSI–Open Systems Interconnect) en raison de leurs capacités d’opération haute vitesse et de fonctionnalités traitant des millions de paquets par seconde. Compte tenu des besoins en matière de haut débit sur les réseaux actuels, les NPU doivent accélérer les fonctionnalités de traitement des paquets afin d’atteindre les requis des réseaux actuels et pour pouvoir supporter la nouvelle génération 5G. Les NPU fournissent divers ensembles de fonctionnalités, depuis l’analyse, la classification, la mise en file d’attente, la gestion du trafic et la mise en mémoire tampon du trafic réseau.----------ABSTRACT: Current and next generation networks are facing major challenges in packet switching and routing in the context of software defined networking (SDN), with a significant increase of traffic from different connected users and devices, and tight requirements of high-speed networking devices with high throughput and low latency. The network trend with the upcoming fifth generation communication technology (5G) is such that it would enable some desired applications and services such as factory automation, intelligent transportation systems, smart grid, health care remote surgery, virtual/augmented reality, etc. that require stringent performance of latency in the order of 1 ms and data rates as high as 1 Gb/s. All traffic passing through the network is handled by the data plane, that is called the fast path processing. The data plane contains networking devices that handles the forwarding and processing of the different traffic. The request for higher Internet bandwidth has increased the need for more powerful processors, specifically designed for fast packet processing, namely the network processors or the network processing units (NPUs). NPUs are networking devices which can span the entire open systems interconnection (OSI) model due to their high-speed capabilities while processing millions of packets per second. With the high-speed requirement in today’s networks, NPUs must accelerate packet processing functionalities to meet the required throughput and latency in today’s networks, and to best support the upcoming next generation networks. NPUs provide various sets of functionalities, from parsing, classification, queuing, traffic management, and buffering of the network traffic. In this thesis, we are mainly interested in the queuing and traffic management functionalities in the context of high-speed networking devices. These functionalities are essential to provide and guarantee quality of service (QoS) to various traffic, especially in the network data plane of NPUs, routers, switches, etc., and may represent a bottleneck as they reside in the critical path. We present new architecures for NPUs and networking equipment. These functionalities are integrated as external accelerators that can be used to speed up the operation through field-programmable gate arrays (FPGAs). Also, we aim to propose a high-level coding style that can be used to solve the optimization problems related to the different requirement in networking that are parallel processing, pipelined operation, minimizing memory access latency, etc., leading to faster time-to-market and lower development efforts from high-level design in comparison to hand-written hardware description language (HDL) coded designs

    Module-per-Object: a Human-Driven Methodology for C++-based High-Level Synthesis Design

    Full text link
    High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to hand-written VHDL code while keeping a high abstraction level, human-readable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parametrization while eliminating the incidence of code duplication.Comment: 9 pages. Paper accepted for publication at The 27th IEEE International Symposium on Field-Programmable Custom Computing Machines, San Diego CA, April 28 - May 1, 201

    HPQS: A fast, high-capacity, hybrid priority queuing system for high-speed networking devices

    Get PDF
    In this paper, we present a fast hybrid priority queue architecture intended for scheduling and prioritizing packets in a network data plane. Due to increasing traffic and tight requirements of high-speed networking devices, a high capacity priority queue, with constant latency and guaranteed performance is needed. We aim at reducing latency to best support the upcoming 5G wireless standards. The proposed hybrid priority queuing system (HPQS) enables pipelined queue operations with almost constant time complexity in practice. The proposed architecture is implemented in C++, and is synthesized with the Vivado High-Level Synthesis (HLS) tool. Two configurations are proposed. The first one is intended for scheduling with a multi-queuing system for which implementation results of 64 up to 512 independent queues are reported. The second configuration is intended for large capacity priority queues, that are placed and routed on a ZC706 board and a XCVU440-FLGB2377-3-E Xilinx FPGA supporting a total capacity of 1/2 million packet tags. The reported results are compared across a range of priority queue depths and performance metrics with existing approaches. The proposed HPQS supports links operating at 40 Gb/s

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    System architecture and hardware implementations for a reconfigurable MPLS router

    Get PDF
    With extremely wide bandwidth and good channel properties, optical fibers have brought fast and reliable data transmission to today’s data communications. However, to handle heavy traffic flowing through optical physical links, much faster processing speed is required or else congestion can take place at network nodes. Also, to provide people with voice, data and all categories of multimedia services, distinguishing between different data flows is a requirement. To address these router performance, Quality of Service /Class of Service and traffic engineering issues, Multi-Protocol Label Switching (MPLS) was proposed for IP-based Internetworks. In addition, routers flexible in hardware architecture in order to support ever-evolving protocols and services without causing big infrastructure modification or replacement are also desirable. Therefore, reconfigurable hardware implementation of MPLS was proposed in this project to obtain the overall fast processing speed at network nodes. The long-term goal of this project is to develop a reconfigurable MPLS router, which uniquely integrates the best features of operations being conducted in software and in run-time-reconfigurable hardware. The scope of this thesis includes system architecture and service algorithm considerations, Verilog coding and testing for an actual device. The hardware and software co-design technique was used to partition and schedule the protocol code for execution on both a general-purpose processor and stream-based hardware. A novel RPS scheme that is practically easy to build and can realize pipelined packet-by-packet data transfer at each output was proposed to take the place of the traditional crossbar switching. In RPS, packets with variable lengths can be switched intelligently without performing packet segmentation and reassembly. Primary theoretical analysis of queuing issues was discussed and an improved multiple queue service scheduling policy UD-WRR was proposed, which can reduce packet-waiting time without sacrificing the performance. In order to have the tests carried out appropriately, dedicated circuitry for the MPLS functional block to interface a specific MAC chip was implemented as well. The hardware designs for all functions were realized with a single Field Programmable Gate Array (FPGA) device in this project. The main result presented in this thesis was the MPLS function implementation realizing a major part of layer three routing at the reconfigurable hardware level, which advanced a great step towards the goal of building a router that is both fast and flexible

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    A VLSI architecture for a data compression engine in a communications network

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1989.Includes bibliographical references.by Brian Ta-Cheng Hou.M.S

    Optimizing Sparse Linear Algebra on Reconfigurable Architecture

    Full text link
    The rise of cloud computing and deep machine learning in recent years have led to a tremendous growth of workloads that are not only large, but also have highly sparse representations. A large fraction of machine learning problems are formulated as sparse linear algebra problems in which the entries in the matrices are mostly zeros. Not surprisingly, optimizing linear algebra algorithms to take advantage of this sparseness is critical for efficient computation on these large datasets. This thesis presents a detailed analysis of several approaches to sparse matrix-matrix multiplication, a core computation of linear algebra kernels. While the arithmetic count of operations for the nonzero elements of the matrices are the same regardless of the algorithm used to perform matrix-matrix multiplication, there is significant variation in the overhead of navigating the sparse data structures to match the nonzero elements with the correct indices. This work explores approaches to minimize these overheads as well as the number of memory accesses for sparse matrices. To provide concrete numbers, the thesis examines inner product, outer product and row-wise algorithms on Transmuter, a flexible accelerator that can reconfigure its cache and crossbars at runtime to meet the demands of the program. This thesis shows how the reconfigurability of Transmuter can improve complex workloads that contain multiple kernels with varying compute and memory requirements, such as the Graphsage deep neural network and the Sinkhorn algorithm for optimal transport distance. Finally, we examine a novel Transmuter feature: register-to-register queues for efficient communication between its processing elements, enabling systolic array style computation for signal processing algorithms.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169877/1/dohypark_1.pd
    • …
    corecore