75 research outputs found
APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters
Many scientific computations need multi-node parallelism for matching up both
space (memory) and time (speed) ever-increasing requirements. The use of GPUs
as accelerators introduces yet another level of complexity for the programmer
and may potentially result in large overheads due to the complex memory
hierarchy. Additionally, top-notch problems may easily employ more than a
Petaflops of sustained computing power, requiring thousands of GPUs
orchestrated with some parallel programming model. Here we describe APEnet+,
the new generation of our interconnect, which scales up to tens of thousands of
nodes with linear cost, thus improving the price/performance ratio on large
clusters. The project target is the development of the Apelink+ host adapter
featuring a low latency, high bandwidth direct network, state-of-the-art wire
speeds on the links and a PCIe X8 gen2 host interface. It features hardware
support for the RDMA programming model and experimental acceleration of GPU
networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI
library driver are available, allowing for painless porting of standard
applications. Finally, we give an insight of future work and intended
developments
GPU peer-to-peer techniques applied to a cluster interconnect
Modern GPUs support special protocols to exchange data directly across the
PCI Express bus. While these protocols could be used to reduce GPU data
transmission times, basically by avoiding staging to host memory, they require
specific hardware features which are not available on current generation
network adapters. In this paper we describe the architectural modifications
required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class
GPUs on an FPGA-based cluster interconnect. Besides, the current software
implementation, which integrates this feature by minimally extending the RDMA
programming model, is discussed, as well as some issues raised while employing
it in a higher level API like MPI. Finally, the current limits of the technique
are studied by analyzing the performance improvements on low-level benchmarks
and on two GPU-accelerated applications, showing when and how they seem to
benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201
The Brain on Low Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States
Efficient brain simulation is a scientific grand challenge, a
parallel/distributed coding challenge and a source of requirements and
suggestions for future computing architectures. Indeed, the human brain
includes about 10^15 synapses and 10^11 neurons activated at a mean rate of
several Hz. Full brain simulation poses Exascale challenges even if simulated
at the highest abstraction level. The WaveScalES experiment in the Human Brain
Project (HBP) has the goal of matching experimental measures and simulations of
slow waves during deep-sleep and anesthesia and the transition to other brain
states. The focus is the development of dedicated large-scale
parallel/distributed simulation technologies. The ExaNeSt project designs an
ARM-based, low-power HPC architecture scalable to million of cores, developing
a dedicated scalable interconnect system, and SWA/AW simulations are included
among the driving benchmarks. At the joint between both projects is the INFN
proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation
engine. DPSNN can be configured to stress either the networking or the
computation features available on the execution platforms. The simulation
stresses the networking component when the neural net - composed by a
relatively low number of neurons, each one projecting thousands of synapses -
is distributed over a large number of hardware cores. When growing the number
of neurons per core, the computation starts to be the dominating component for
short range connections. This paper reports about preliminary performance
results obtained on an ARM-based HPC prototype developed in the framework of
the ExaNeSt project. Furthermore, a comparison is given of instantaneous power,
total energy consumption, execution time and energetic cost per synaptic event
of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server
platforms
- …