1,786 research outputs found
Optimising Simulation Data Structures for the Xeon Phi
In this paper, we propose a lock-free architecture
to accelerate logic gate circuit simulation using SIMD multi-core
machines. We evaluate its performance on different test circuits
simulated on the Intel Xeon Phi and 2 other machines. Comparisons
are presented of this software/hardware combination with
reported performances of GPU and other multi-core simulation
platforms. Comparisons are also given between the lock free
architecture and a leading commercial simulator running on the
same Intel hardware
QuEST and High Performance Simulation of Quantum Computers
We introduce QuEST, the Quantum Exact Simulation Toolkit, and compare it to
ProjectQ, qHipster and a recent distributed implementation of Quantum++. QuEST
is the first open source, OpenMP and MPI hybridised, GPU accelerated simulator
of universal quantum circuits. Embodied as a C library, it is designed so that
a user's code can be deployed seamlessly to any platform from a laptop to a
supercomputer. QuEST is capable of simulating generic quantum circuits of
general single-qubit gates and multi-qubit controlled gates, on pure and mixed
states, represented as state-vectors and density matrices, and under the
presence of decoherence. Using the ARCUS Phase-B and ARCHER supercomputers, we
benchmark QuEST's simulation of random circuits of up to 38 qubits, distributed
over up to 2048 compute nodes, each with up to 24 cores. We directly compare
QuEST's performance to ProjectQ's on single machines, and discuss the
differences in distribution strategies of QuEST, qHipster and Quantum++. QuEST
shows excellent scaling, both strong and weak, on multicore and distributed
architectures.Comment: 8 pages, 8 figures; fixed typos; updated QuEST URL and fixed typo in
Fig. 4 caption where ProjectQ and QuEST were swapped in speedup subplot
explanation; added explanation of simulation algorithm, updated bibliography;
stressed technical novelty of QuEST; mentioned new density matrix suppor
An ultra low-power hardware accelerator for automatic speech recognition
Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft
BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations
Objective: The advent of High-Performance Computing (HPC) in recent years has
led to its increasing use in brain study through computational models. The
scale and complexity of such models are constantly increasing, leading to
challenging computational requirements. Even though modern HPC platforms can
often deal with such challenges, the vast diversity of the modeling field does
not permit for a single acceleration (or homogeneous) platform to effectively
address the complete array of modeling requirements. Approach: In this paper we
propose and build BrainFrame, a heterogeneous acceleration platform,
incorporating three distinct acceleration technologies, a Dataflow Engine, a
Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform.
As a challenging proof of concept, we analyze the performance of BrainFrame on
different instances of a state-of-the-art neuron model, modeling the Inferior-
Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley
representation. The model instances take into account not only the neuronal-
network dimensions but also different network-connectivity circumstances that
can drastically change application workload characteristics. Main results: The
synthetic approach of three HPC technologies demonstrated that BrainFrame is
better able to cope with the modeling diversity encountered. Our performance
analysis shows clearly that the model directly affect performance and all three
technologies are required to cope with all the model use cases.Comment: 16 pages, 18 figures, 5 table
A low-power, high-performance speech recognition accelerator
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft
Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems
In this report we present a novel approach to model coupling for
shared-memory multicore systems hosting OpenCL-compliant accelerators, which we
call The Glasgow Model Coupling Framework (GMCF). We discuss the implementation
of a prototype of GMCF and its application to coupling the Weather Research and
Forecasting Model and an OpenCL-accelerated version of the Large Eddy Simulator
for Urban Flows (LES) developed at DPRI.
The first stage of this work concerned the OpenCL port of the LES. The
methodology used for the OpenCL port is a combination of automated analysis and
code generation and rule-based manual parallelization. For the evaluation, the
non-OpenCL LES code was compiled using gfortran, fort and pgfortran}, in each
case with auto-parallelization and auto-vectorization. The OpenCL-accelerated
version of the LES achieves a 7 times speed-up on a NVIDIA GeForce GTX 480
GPGPU, compared to the fastest possible compilation of the original code
running on a 12-core Intel Xeon E5-2640.
In the second stage of this work, we built the Glasgow Model Coupling
Framework and successfully used it to couple an OpenMP-parallelized WRF
instance with an OpenCL LES instance which runs the LES code on the GPGPI. The
system requires only very minimal changes to the original code. The report
discusses the rationale, aims, approach and implementation details of this
work.Comment: This work was conducted during a research visit at the Disaster
Prevention Research Institute of Kyoto University, supported by an EPSRC
Overseas Travel Grant, EP/L026201/
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
- …