23 research outputs found

    Characterizing latency overheads in the deployment of FPGA accelerators

    Get PDF
    FPGA hardware accelerators have recently enjoyed significant attention as platforms for further accelerating computation in the datacenter but they potentially add additional layers of hardware and software interfacing that can further increase communication latency. In this paper, we characterize these overheads for streaming applications where latency can be an important consideration. We examine the latency and throughput characteristics of traditional server-based PCIe connected accelerators, and the more recent approach of network attached FPGA accelerators. We additionally quantify the additional overhead introduced by virtualising accelerators on FPGAs

    HP4 High-Performance Programmable Packet Parser

    Get PDF
    Now, header parsing is the main topic in the modern network systems to support many operations such as packet processing and security functions. The header parser design has a significant effect on the network devices' performances (latency, throughput, and resource utilization). However, the header parser design suffers from a lot number of difficulties, such as the incrementing in network throughput and a variety of protocols. Therefore, the programmable hardware packet parsing is the best solution to meet the dynamic reconfiguration and speed needs. Field Programmable Gate Array (FPGA) is an appropriate device for programmable high-speed packet implementation. This paper introduces a novel FPGA High-Performance Programmable Packet Parser architecture (HP4). HP4 automatically generated by the P4 (Programming protocol-independent Packet Processors) to optimize the speed, dynamic reconfiguration, and resource consumption. The HP4 shows a pipelined packet parser dynamic reconfiguration and low latency. In addition to high throughput (over 600 Gb/s), HP4 resource utilization is less than 7.5 percent of Virtex-7 870HT, and latency is about 88 ns. HP4 can use in a high-speed dynamic packet switch and network security

    Implementation of Ultra-Low Latency and High-Speed Communication Channels for an FPGA-Based HPC Cluster

    Get PDF
    RÉSUMÉ Les clusters basés sur les FPGA bénéficient de leur flexibilité et de leurs performances en termes de puissance de calcul et de faible consommation. Et puisque la consommation de puissance devient un élément de plus en plus importants sur le marché des superordinateurs, le domaine d’exploration multi-FPGA devient chaque année plus populaire. Les performances des ordinateurs n’ont jamais cessé d’augmenter mais la latence des réseaux d’interconnexion n’a pas suivi leur taux d’amélioration. Dans le but d’augmenter le niveau d’abstraction et les fonctionnalités des interconnexions, la complexité des piles de communication atteinte à nos jours engendre des coûts et affecte la latence des communications, ce qui rend ces piles de communication très souvent inefficaces, voire inutiles. Les protocoles de communication commerciaux existants et les contrôleurs d’interfaces réseau FPGA-FPGA n’ont la performance pour supporter ni les applications à temps critique ni un partitionnement étroitement couplé des systèmes sur puce. Au lieu de cela, les approches de communication personnalisées sont souvent préférées. Dans ce travail, nous proposons une implémentation de canaux de communication à haut débit et à faible latence pour une grappe de FPGA. Le système est constitué de deux BEE3, chacun contenant 4 FPGA de la famille Virtex-5 interconnectés par une topologie en anneau. Notre approche exploite la technologie à transducteur à plusieurs gigabits par seconde pour l’obtention d’une bande passante fiable de 8Gbps. Le module de propriété intellectuelle (IP) de communication proposé permet le transfert de données entre des milliers de coprocesseurs sur le réseau, grâce à l’implémentation d’un réseau direct avec capacité de routage de paquets. Les résultats expérimentaux ont montré une latence de seulement 34 cycles d’horloge entre deux noeuds voisins, ce qui est un des plus bas parmi ceux rapportés dans la littérature. En outre, nous proposons une architecture adaptée au calcul à haute performance qui comporte un traitement extensible, parallèle et distribué. Pour une plateforme à 8 FPGA, l’architecture fournit 35.6Go/s de bande passante effective pour la mémoire externe, une bande passante globale de réseau de 128Gbps et une puissance de calcul de 8.9GFLOPS. Un solveur matrice-vecteur de grande taille est partitionné et mis en oeuvre à travers le cluster. Nous avons obtenu une performance et une efficacité de calcul concurrentielles grâce à la faible empreinte du protocole de communication entre les éléments de traitement distribués. Ce travail contribue à soutenir de nouvelles recherches dans le domaine du calcul parallèle intensif et permet le partitionnement de système sur puce à grande taille sur des clusters à base de FPGA.----------ABSTRACT An FPGA-based cluster profits from the flexibility and the performance potential FPGA technology provides. Since price and power consumption are becoming increasingly important elements in the High-Performance Computing market, the multi-FPGA exploration field is getting more popular each year. Network latency has failed to keep up with other improvements in computer performance. Complex communication stacks have sacrificed latency and increased overhead to achieve other goals, being in most of the time inefficient and unnecessary. The existing commercial offthe- shelf communication protocols and Network Interfaces Controllers for FPGA-to-FPGA interconnection lack of performance to support time-critical applications and tightly coupled System-on-Chip partitioning. Instead, custom communication approaches are preferred. In this work, ultra-low latency and high-speed communication channels for an FPGA-based cluster are presented. Two BEE3s grouping 8 FPGAs Virtex-5 interconnected in a ring topology, compose the targeting platform. Our approach exploits Multi-Gigabit Transceiver technology to achieve reliable 8Gbps channel bandwidth. The proposed communication IP supports data transfer from coprocessors over the network, by means of a direct network implementation with hop-by-hop packet routing capability. Experimental results showed a latency of only 34 clock cycles between two neighboring nodes, being one of the lowest in the literature. In addition, it is proposed an architecture suitable for High-Performance Computing which includes performing scalable, parallel, and distributed processing. For an 8 FPGAs platform, the architecture provides 35.6GB/s off-chip memory throughput, 128Gbps network aggregate bandwidth, and 8.9GFLOPS computing power. A large and dense matrix-vector solver is partitioned and implemented across the cluster. We achieved competitive performance and computational efficiency as a result of the low communication overhead among the distributed processing elements. This work contributes to support new researches on the intense parallel computing fields, and enables large System-on-Chip partitioning and scaling on FPGA-based clusters

    Modelling and characterisation of distributed hardware acceleration

    Get PDF
    Hardware acceleration has become more commonly utilised in networked computing systems. The growing complexity of applications mean that traditional CPU architectures can no longer meet stringent latency constraints. Alternative computing architectures such as GPUs and FPGAs are increasingly available, along with simpler, more software-like development flows. The work presented in this thesis characterises the overheads associated with these accelerator architectures. A holistic view encompassing both computation and communication latency must be considered. Experimental results obtained through this work show that networkattached accelerators scale better than server-hosted deployments, and that host ingestion overheads are comparable to network traversal times in some cases. Along with the choice of processing platforms, it is becoming more important to consider how workloads are partitioned and where in the network tasks are being performed. Manual allocation and evaluation of tasks to network nodes does not scale with network and workload complexity. A mathematical formulation of this problem is presented within this thesis that takes into account all relevant performance metrics. Unlike other works, this model takes into account growing hardware heterogeneity and workload complexity, and is generalisable to a range of scenarios. This model can be used in an optimisation that generates lower cost results with latency performance close to theoretical maximums compared to naive placement approaches. With the mathematical formulation and experimental results that characterise hardware accelerator overheads, the work presented in this thesis can be used to make informed design decisions about both where to allocate tasks and deploy accelerators in the network, and the associated costs

    A Modular Approach to Adaptive Reactive Streaming Systems

    Get PDF
    The latest generations of FPGA devices offer large resource counts that provide the headroom to implement large-scale and complex systems. However, there are increasing challenges for the designer, not just because of pure size and complexity, but also in harnessing effectively the flexibility and programmability of the FPGA. A central issue is the need to integrate modules from diverse sources to promote modular design and reuse. Further, the capability to perform dynamic partial reconfiguration (DPR) of FPGA devices means that implemented systems can be made reconfigurable, allowing components to be changed during operation. However, use of DPR typically requires low-level planning of the system implementation, adding to the design challenge. This dissertation presents ReShape: a high-level approach for designing systems by interconnecting modules, which gives a ‘plug and play’ look and feel to the designer, is supported by tools that carry out implementation and verification functions, and is carried through to support system reconfiguration during operation. The emphasis is on the inter-module connections and abstracting the communication patterns that are typical between modules – for example, the streaming of data that is common in many FPGA-based systems, or the reading and writing of data to and from memory modules. ShapeUp is also presented as the static precursor to ReShape. In both, the details of wiring and signaling are hidden from view, via metadata associated with individual modules. ReShape allows system reconfiguration at the module level, by supporting type checking of replacement modules and by managing the overall system implementation, via metadata associated with its FPGA floorplan. The methodology and tools have been implemented in a prototype for a broad domain-specific setting – networking systems – and have been validated on real telecommunications design projects

    Consensus protocols exploiting network programmability

    Get PDF
    Services rely on replication mechanisms to be available at all time. The service demanding high availability is replicated on a set of machines called replicas. To maintain the consistency of replicas, a consensus protocol such as Paxos or Raft is used to synchronize the replicas' state. As a result, failures of a minority of replicas will not affect the service as other non-faulty replicas continue serving requests. A consensus protocol is a procedure to achieve an agreement among processors in a distributed system involving unreliable processors. Unfortunately, achieving such an agreement involves extra processing on every request, imposing a substantial performance degradation. Consequently, performance has long been a concern for consensus protocols. Although many efforts have been made to improve consensus performance, it continues to be an important problem for researchers. This dissertation presents a novel approach to improving consensus performance. Essentially, it exploits the programmability of a new breed of network devices to accelerate consensus protocols that traditionally run on commodity servers. The benefits of using programmable network devices to run consensus protocols are twofold: The network switches process packets faster than commodity servers and consensus messages travel fewer hops in the network. It means that the system throughput is increased and the latency of requests is reduced. The evaluation of our network-accelerated consensus approach shows promising results. Individual components of our FPGA- based and switch-based consensus implementations can process 10 million and 2.5 billion consensus messages per second, respectively. Our FPGA-based system as a whole delivers 4.3 times performance of a traditional software consensus implementation. The latency is also better for our system and is only one third of the latency of the software consensus implementation when both systems are under half of their maximum throughputs. In order to drive even higher performance, we apply a partition mechanism to our switch-based system, leading to 11 times better throughput and 5 times better latency. By dynamically switching between software-based and network-based implementations, our consensus systems not only improve performance but also use energy more efficiently. Encouraged by those benefits, we developed a fault-tolerant non-volatile memory system. A prototype using software memory controller demonstrated reasonable overhead over local memory access, showing great promise as scalable main memory. Our network-based consensus approach would have a great impact in data centers. It not only improves performance of replication mechanisms which relied on consensus, but also enhances performance of services built on top of those replication mechanisms. Our approach also motivates others to move new functionalities into the network, such as, key-value store and stream processing. We expect that in the near future, applications that typically run on traditional servers will be folded into networks for performance

    On the Edge of Secure Connectivity via Software-Defined Networking

    Get PDF
    Securing communication in computer networks has been an essential feature ever since the Internet, as we know it today, was started. One of the best known and most common methods for secure communication is to use a Virtual Private Network (VPN) solution, mainly operating with an IP security (IPsec) protocol suite originally published in 1995 (RFC1825). It is clear that the Internet, and networks in general, have changed dramatically since then. In particular, the onset of the Cloud and the Internet-of-Things (IoT) have placed new demands on secure networking. Even though the IPsec suite has been updated over the years, it is starting to reach the limits of its capabilities in its present form. Recent advances in networking have thrown up Software-Defined Networking (SDN), which decouples the control and data planes, and thus centralizes the network control. SDN provides arbitrary network topologies and elastic packet forwarding that have enabled useful innovations at the network level. This thesis studies SDN-powered VPN networking and explains the benefits of this combination. Even though the main context is the Cloud, the approaches described here are also valid for non-Cloud operation and are thus suitable for a variety of other use cases for both SMEs and large corporations. In addition to IPsec, open source TLS-based VPN (e.g. OpenVPN) solutions are often used to establish secure tunnels. Research shows that a full-mesh VPN network between multiple sites can be provided using OpenVPN and it can be utilized by SDN to create a seamless, resilient layer-2 overlay for multiple purposes, including the Cloud. However, such a VPN tunnel suffers from resiliency problems and cannot meet the increasing availability requirements. The network setup proposed here is similar to Software-Defined WAN (SD-WAN) solutions and is extremely useful for applications with strict requirements for resiliency and security, even if best-effort ISP is used. IPsec is still preferred over OpenVPN for some use cases, especially by smaller enterprises. Therefore, this research also examines the possibilities for high availability, load balancing, and faster operational speeds for IPsec. We present a novel approach involving the separation of the Internet Key Exchange (IKE) and the Encapsulation Security Payload (ESP) in SDN fashion to operate from separate devices. This allows central management for the IKE while several separate ESP devices can concentrate on the heavy processing. Initially, our research relied on software solutions for ESP processing. Despite the ingenuity of the architectural concept, and although it provided high availability and good load balancing, there was no anti-replay protection. Since anti-replay protection is vital for secure communication, another approach was required. It thus became clear that the ideal solution for such large IPsec tunneling would be to have a pool of fast ESP devices, but to confine the IKE operation to a single centralized device. This would obviate the need for load balancing but still allow high availability via the device pool. The focus of this research thus turned to the study of pure hardware solutions on an FPGA, and their feasibility and production readiness for application in the Cloud context. Our research shows that FPGA works fluently in an SDN network as a standalone IPsec accelerator for ESP packets. The proposed architecture has 10 Gbps throughput, yet the latency is less than 10 µs, meaning that this architecture is especially efficient for data center use and offers increased performance and latency requirements. The high demands of the network packet processing can be met using several different approaches, so this approach is not just limited to the topics presented in this thesis. Global network traffic is growing all the time, so the development of more efficient methods and devices is inevitable. The increasing number of IoT devices will result in a lot of network traffic utilising the Cloud infrastructures in the near future. Based on the latest research, once SDN and hardware acceleration have become fully integrated into the Cloud, the future for secure networking looks promising. SDN technology will open up a wide range of new possibilities for data forwarding, while hardware acceleration will satisfy the increased performance requirements. Although it still remains to be seen whether SDN can answer all the requirements for performance, high availability and resiliency, this thesis shows that it is a very competent technology, even though we have explored only a minor fraction of its capabilities

    Design of a scalable network interface to support enhanced TCP and UDP processing for high speed networks

    Get PDF
    Communication networks have advanced rapidly in providing additional services, with improvements made to their bandwidth and the integration of advanced technology. As the speed of networks exceeds 10 Gbps, the time frame for completing the processing of TCP and UDP packets has become extremely short. The design and implementation of high performance Network Interfaces (NIs) that can support offload protocol functions for current and next-generation networks is challenging. In this thesis two software approaches are presented to enhance protocol processing of TCP and UDP in the network interface. A novel software Large Receive Offload (LRO) approach for enhancing the receiving side has been proposed. The LRO works by aggregating the incoming TCP and UDP packets into larger packets inside the NI’s buffer. The receiving side software has been improved to support out-of-order packets. The second proposed software solution is applied on the Large Send Offload (LSO). The proposed LSO function processing is implemented by segmenting TCP and UDP messages that are larger than the Maximum Transmission Unit to the Maximum Segment Size. New packet headers are generated for each new outgoing packet. A scalable programmable NI based 32-bit RISC core is presented that can support 100 Gbps network speeds. Acceleration of the processing time frame required at the NI has been implemented to prevent hazards (such as Data Hazard and Control Hazard) during the execution of the LRO and the LSO functions. An R2000/3000 RISC has been used in order to test the LRO and LSO functions and to discover the instruction set that is most suitable. Following this the VHDL NI was implemented with three pipeline RISC cores, a simple DMA controller and Content Addressable Memory. An evaluation of the desired RISC clock rate that is required to process TCP and UDP streams at 100 Gbps was conducted. It was determined that a RISC core running at 752 MHz with a DMA clock of 3753 MHz was able to process packets 512 bytes or larger fast enough to support 100 Gbps network speeds
    corecore