21,140 research outputs found

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    Automatic Computation of Cross Sections in HEP

    Get PDF
    For the study of reactions in High Energy Physics (HEP) automatic computation systems have been developed and are widely used nowadays. GRACE is one of such systems and it has achieved much success in analyzing experimental data. Since we deal with the cross section whose value can be given by calculating hundreds of Feynman diagrams, we manage the large scale calculation, so that effective symbolic manipulation, the treat of singularity in the numerical integration are required. The talk will describe the software design of GRACE system and computational techniques in the GRACE.Comment: 6 pages, Latex, ICCP

    Making extreme computations possible with virtual machines

    Full text link
    State-of-the-art algorithms generate scattering amplitudes for high-energy physics at leading order for high-multiplicity processes as compiled code (in Fortran, C or C++). For complicated processes the size of these libraries can become tremendous (many GiB). We show that amplitudes can be translated to byte-code instructions, which even reduce the size by one order of magnitude. The byte-code is interpreted by a Virtual Machine with runtimes comparable to compiled code and a better scaling with additional legs. We study the properties of this algorithm, as an extension of the Optimizing Matrix Element Generator (O'Mega). The bytecode matrix elements are available as alternative input for the event generator WHIZARD. The bytecode interpreter can be implemented very compactly, which will help with a future implementation on massively parallel GPUs.Comment: 5 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:1411.383
    • …
    corecore