174 research outputs found

    Solutions for the optimization of the software interface on an FPGA-based NIC

    Get PDF
    The theme of the research is the study of solutions for the optimization of the software interface on FPGA-based Network Interface Cards. The research activity was carried out in the APE group at INFN (Istituto Nazionale di Fisica Nucleare), which has been historically active in designing of high performance scalable networks for hybrid nodes (CPU/GPU) clusters. The result of the research is validated on two projects the APE group is currently working on, both allowing fast prototyping for solutions and hardware-software co-design: APEnet (a PCIe FPGA-based 3D torus network controller) and NaNet (FPGA-based family of NICs mainly dedicated to real-time, low-latency computing systems such as fast control systems or High Energy Physics Data Acquisition Systems). NaNet is also used to validate a GPU-controlled device driver to improve network perfomances, i.e. even lower latency of the communication, while used in combination with existing user-space software. This research is also gaining results in the "Horizon2020 FET-HPC ExaNeSt project", which aims to prototype and develop solutions for some of the crucial problems on the way towards production of Exascale-level Supercomputers, where the APE group is actively contribuiting to the development of the network / interconnection infrastructure

    GPU peer-to-peer techniques applied to a cluster interconnect

    Full text link
    Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201

    Enabling Shared Memory Communication in Networks of MPSoCs

    Get PDF
    Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cutting‐edge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a state‐of‐the‐art MPSoC, and the challenges to be overcome given the device's limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution production‐ready

    Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Get PDF
    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth
    corecore