174 research outputs found
Solutions for the optimization of the software interface on an FPGA-based NIC
The theme of the research is the study of solutions for the optimization of the software interface on FPGA-based Network Interface Cards. The research activity was carried out in the APE group at INFN (Istituto Nazionale di Fisica Nucleare), which has been historically active in designing of high performance scalable networks for hybrid nodes (CPU/GPU) clusters.
The result of the research is validated on two projects the APE group is currently working on, both allowing fast prototyping for solutions and hardware-software co-design: APEnet (a PCIe FPGA-based 3D torus network controller) and NaNet (FPGA-based family of NICs mainly dedicated to real-time, low-latency computing systems such as fast control systems or High Energy Physics Data Acquisition Systems). NaNet is also used to validate a GPU-controlled device driver to improve network perfomances, i.e. even lower latency of the communication, while used in combination with existing user-space software.
This research is also gaining results in the "Horizon2020 FET-HPC ExaNeSt project", which aims to prototype and develop solutions for some of the crucial problems on the way towards production of Exascale-level Supercomputers, where the APE group is actively contribuiting to the development of the network / interconnection infrastructure
GPU peer-to-peer techniques applied to a cluster interconnect
Modern GPUs support special protocols to exchange data directly across the
PCI Express bus. While these protocols could be used to reduce GPU data
transmission times, basically by avoiding staging to host memory, they require
specific hardware features which are not available on current generation
network adapters. In this paper we describe the architectural modifications
required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class
GPUs on an FPGA-based cluster interconnect. Besides, the current software
implementation, which integrates this feature by minimally extending the RDMA
programming model, is discussed, as well as some issues raised while employing
it in a higher level API like MPI. Finally, the current limits of the technique
are studied by analyzing the performance improvements on low-level benchmarks
and on two GPU-accelerated applications, showing when and how they seem to
benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201
Enabling Shared Memory Communication in Networks of MPSoCs
Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cutting‐edge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a state‐of‐the‐art MPSoC, and the challenges to be overcome given the device's limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution production‐ready
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network
APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth
- …