33,643 research outputs found
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems
In this report we present a novel approach to model coupling for
shared-memory multicore systems hosting OpenCL-compliant accelerators, which we
call The Glasgow Model Coupling Framework (GMCF). We discuss the implementation
of a prototype of GMCF and its application to coupling the Weather Research and
Forecasting Model and an OpenCL-accelerated version of the Large Eddy Simulator
for Urban Flows (LES) developed at DPRI.
The first stage of this work concerned the OpenCL port of the LES. The
methodology used for the OpenCL port is a combination of automated analysis and
code generation and rule-based manual parallelization. For the evaluation, the
non-OpenCL LES code was compiled using gfortran, fort and pgfortran}, in each
case with auto-parallelization and auto-vectorization. The OpenCL-accelerated
version of the LES achieves a 7 times speed-up on a NVIDIA GeForce GTX 480
GPGPU, compared to the fastest possible compilation of the original code
running on a 12-core Intel Xeon E5-2640.
In the second stage of this work, we built the Glasgow Model Coupling
Framework and successfully used it to couple an OpenMP-parallelized WRF
instance with an OpenCL LES instance which runs the LES code on the GPGPI. The
system requires only very minimal changes to the original code. The report
discusses the rationale, aims, approach and implementation details of this
work.Comment: This work was conducted during a research visit at the Disaster
Prevention Research Institute of Kyoto University, supported by an EPSRC
Overseas Travel Grant, EP/L026201/
FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels
Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors
© 2016 Cheung, Schultz and Luk.NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation
Neuromorphic Hardware In The Loop: Training a Deep Spiking Network on the BrainScaleS Wafer-Scale System
Emulating spiking neural networks on analog neuromorphic hardware offers
several advantages over simulating them on conventional computers, particularly
in terms of speed and energy consumption. However, this usually comes at the
cost of reduced control over the dynamics of the emulated networks. In this
paper, we demonstrate how iterative training of a hardware-emulated network can
compensate for anomalies induced by the analog substrate. We first convert a
deep neural network trained in software to a spiking network on the BrainScaleS
wafer-scale neuromorphic system, thereby enabling an acceleration factor of 10
000 compared to the biological time domain. This mapping is followed by the
in-the-loop training, where in each training step, the network activity is
first recorded in hardware and then used to compute the parameter updates in
software via backpropagation. An essential finding is that the parameter
updates do not have to be precise, but only need to approximately follow the
correct gradient, which simplifies the computation of updates. Using this
approach, after only several tens of iterations, the spiking network shows an
accuracy close to the ideal software-emulated prototype. The presented
techniques show that deep spiking networks emulated on analog neuromorphic
devices can attain good computational performance despite the inherent
variations of the analog substrate.Comment: 8 pages, 10 figures, submitted to IJCNN 201
- …