Search CORE

5,808 research outputs found

Continuing Progress on a Lattice QCD Software Infrastructure

Author: B Joó
Chen J
Edwards R G
Fowler R J
Joó J
Kronfeld A
McClendon C
Publication venue: 'IOP Publishing'
Publication date: 13/06/2008
Field of study

We report on the progress of the software effort in the QCD Application Area of SciDAC. In particular, we discuss how the software developed under SciDAC enabled the aggressive exploitation of leadership computers, and we report on progress in the area of QCD software for multi-core architectures.Comment: 5 Pages, to appear in the Proceedings of SciDAC 2008 conference, (Seattle, July 13-17, 2008), Conference Poster Presentation Proceeding

arXiv.org e-Print Archive

Crossref

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Author: DeTar Carleton
Gottlieb Steven
Li Ruizi
Toussaint Doug
Publication venue
Publication date: 30/11/2017
Field of study

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

QPACE 2 and Domain Decomposition on the Intel Xeon Phi

Author: Arts Paul
Bloch Jacques
Georg Peter
Glaessle Benjamin
Heybrock Simon
Komatsubara Yu
Lohmayer Robert
Mages Simon
Mendl Bernhard
Meyer Nils
Parcianello Alessio
Pleiter Dirk
Rappl Florian
Rossi Mauro
Solbrig Stefan
Tecchiolli Giampietro
Wettig Tilo
Zanier Gianpaolo
Publication venue
Publication date: 01/01/2015
Field of study

We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech. We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number of benchmarks.Comment: plenary talk at Lattice 2014, to appear in the conference proceedings PoS(LATTICE2014), 15 pages, 9 figure

arXiv.org e-Print Archive

Juelich Shared Electronic Resources

GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP

Full detector simulation was among the largest CPU consumer in all CERN experiment software stacks for the first two runs of the Large Hadron Collider (LHC). In the early 2010's, the projections were that simulation demands would scale linearly with luminosity increase, compensated only partially by an increase of computing resources. The extension of fast simulation approaches to more use cases, covering a larger fraction of the simulation budget, is only part of the solution due to intrinsic precision limitations. The remainder corresponds to speeding-up the simulation software by several factors, which is out of reach using simple optimizations on the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport codes in order to make them benefit from fine-grained parallelism features such as vectorization, but also from increased code and data locality. This paper presents extensively the results and achievements of this R&D, as well as the conclusions and lessons learnt from the beta prototype.Comment: 34 pages, 26 figures, 24 table

arXiv.org e-Print Archive

CERN Document Server

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

Author: de Groot Bert L.
Esztermann Ansgar
Fechner Martin
Grubmüller Helmut
Kutzner Carsten
Páll Szilárd
Publication venue: 'Wiley'
Publication date: 03/07/2015
Field of study

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Though hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed since these cards do not support ECC memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime

arXiv.org e-Print Archive

PubMed Central

MPG.PuRe

Parallel performance results for the OpenMOC neutron transport code on multicore platforms

Author: Boyd III William Robert Dawson
Forget Benoit Robert Yves
He S.
Siegel A.
Smith Kord S.
Publication venue: 'SAGE Publications'
Publication date: 01/02/2016
Field of study

The shift toward multicore architectures has ushered in a new era of shared memory parallelism for scientific applications. This transition has introduced challenges for the nuclear engineering community, as it seeks to design high-fidelity full-core reactor physics simulation tools. This article describes the parallel transport sweep algorithm in the OpenMOC method of characteristics (MOC) neutron transport code for multicore platforms using OpenMP. Strong and weak scaling studies are performed for both Intel Xeon and IBM Blue Gene/Q (BG/Q) multicore processors. The results demonstrate 100% parallel efficiency for 12 threads on 12 cores on Intel Xeon platforms and over 90% parallel efficiency with 64 threads on 16 cores on the IBM BG/Q. These results illustrate the potential for hardware acceleration for MOC neutron transport on modern multicore and future many-core architectures. In addition, this work highlights the pitfalls of programming for multicore architectures, with a focal point on false sharing.National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant 1122374)United States. Department of Energy (Center for Exascale Simulation of Advanced Reactors. Contract DE-AC02-06CH11357

DSpace@MIT