97 research outputs found
Status of the APENet project
Comment: 6 pages, 5 figures, poster presented at Lattice 2005 (Algorithms and Machines), Dublin, July 25-3
APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters
Many scientific computations need multi-node parallelism for matching up both
space (memory) and time (speed) ever-increasing requirements. The use of GPUs
as accelerators introduces yet another level of complexity for the programmer
and may potentially result in large overheads due to the complex memory
hierarchy. Additionally, top-notch problems may easily employ more than a
Petaflops of sustained computing power, requiring thousands of GPUs
orchestrated with some parallel programming model. Here we describe APEnet+,
the new generation of our interconnect, which scales up to tens of thousands of
nodes with linear cost, thus improving the price/performance ratio on large
clusters. The project target is the development of the Apelink+ host adapter
featuring a low latency, high bandwidth direct network, state-of-the-art wire
speeds on the links and a PCIe X8 gen2 host interface. It features hardware
support for the RDMA programming model and experimental acceleration of GPU
networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI
library driver are available, allowing for painless porting of standard
applications. Finally, we give an insight of future work and intended
developments
GPU peer-to-peer techniques applied to a cluster interconnect
Modern GPUs support special protocols to exchange data directly across the
PCI Express bus. While these protocols could be used to reduce GPU data
transmission times, basically by avoiding staging to host memory, they require
specific hardware features which are not available on current generation
network adapters. In this paper we describe the architectural modifications
required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class
GPUs on an FPGA-based cluster interconnect. Besides, the current software
implementation, which integrates this feature by minimally extending the RDMA
programming model, is discussed, as well as some issues raised while employing
it in a higher level API like MPI. Finally, the current limits of the technique
are studied by analyzing the performance improvements on low-level benchmarks
and on two GPU-accelerated applications, showing when and how they seem to
benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201
- …