21,140 research outputs found
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
Automatic Computation of Cross Sections in HEP
For the study of reactions in High Energy Physics (HEP) automatic computation
systems have been developed and are widely used nowadays. GRACE is one of such
systems and it has achieved much success in analyzing experimental data. Since
we deal with the cross section whose value can be given by calculating hundreds
of Feynman diagrams, we manage the large scale calculation, so that effective
symbolic manipulation, the treat of singularity in the numerical integration
are required. The talk will describe the software design of GRACE system and
computational techniques in the GRACE.Comment: 6 pages, Latex, ICCP
Making extreme computations possible with virtual machines
State-of-the-art algorithms generate scattering amplitudes for high-energy
physics at leading order for high-multiplicity processes as compiled code (in
Fortran, C or C++). For complicated processes the size of these libraries can
become tremendous (many GiB). We show that amplitudes can be translated to
byte-code instructions, which even reduce the size by one order of magnitude.
The byte-code is interpreted by a Virtual Machine with runtimes comparable to
compiled code and a better scaling with additional legs. We study the
properties of this algorithm, as an extension of the Optimizing Matrix Element
Generator (O'Mega). The bytecode matrix elements are available as alternative
input for the event generator WHIZARD. The bytecode interpreter can be
implemented very compactly, which will help with a future implementation on
massively parallel GPUs.Comment: 5 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:1411.383
- …