5,808 research outputs found
Continuing Progress on a Lattice QCD Software Infrastructure
We report on the progress of the software effort in the QCD Application Area
of SciDAC. In particular, we discuss how the software developed under SciDAC
enabled the aggressive exploitation of leadership computers, and we report on
progress in the area of QCD software for multi-core architectures.Comment: 5 Pages, to appear in the Proceedings of SciDAC 2008 conference,
(Seattle, July 13-17, 2008), Conference Poster Presentation Proceeding
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
With recent developments in parallel supercomputing architecture, many core,
multi-core, and GPU processors are now commonplace, resulting in more levels of
parallelism, memory hierarchy, and programming complexity. It has been
necessary to adapt the MILC code to these new processors starting with NVIDIA
GPUs, and more recently, the Intel Xeon Phi processors. We report on our
efforts to port and optimize our code for the Intel Knights Landing
architecture. We consider performance of the MILC code with MPI and OpenMP, and
optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on
the staggered conjugate gradient and gauge force. We also consider performance
on recent NVIDIA GPUs using the QUDA library
QPACE 2 and Domain Decomposition on the Intel Xeon Phi
We give an overview of QPACE 2, which is a custom-designed supercomputer
based on Intel Xeon Phi processors, developed in a collaboration of Regensburg
University and Eurotech. We give some general recommendations for how to write
high-performance code for the Xeon Phi and then discuss our implementation of a
domain-decomposition-based solver and present a number of benchmarks.Comment: plenary talk at Lattice 2014, to appear in the conference proceedings
PoS(LATTICE2014), 15 pages, 9 figure
GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP
Full detector simulation was among the largest CPU consumer in all CERN
experiment software stacks for the first two runs of the Large Hadron Collider
(LHC). In the early 2010's, the projections were that simulation demands would
scale linearly with luminosity increase, compensated only partially by an
increase of computing resources. The extension of fast simulation approaches to
more use cases, covering a larger fraction of the simulation budget, is only
part of the solution due to intrinsic precision limitations. The remainder
corresponds to speeding-up the simulation software by several factors, which is
out of reach using simple optimizations on the current code base. In this
context, the GeantV R&D project was launched, aiming to redesign the legacy
particle transport codes in order to make them benefit from fine-grained
parallelism features such as vectorization, but also from increased code and
data locality. This paper presents extensively the results and achievements of
this R&D, as well as the conclusions and lessons learnt from the beta
prototype.Comment: 34 pages, 26 figures, 24 table
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
The molecular dynamics simulation package GROMACS runs efficiently on a wide
variety of hardware from commodity workstations to high performance computing
clusters. Hardware features are well exploited with a combination of SIMD,
multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as
accelerators to compute interactions offloaded from the CPU. Here we evaluate
which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most
economical way. We have assembled and benchmarked compute nodes with various
CPU/GPU combinations to identify optimal compositions in terms of raw
trajectory production rate, performance-to-price ratio, energy efficiency, and
several other criteria. Though hardware prices are naturally subject to trends
and fluctuations, general tendencies are clearly visible. Adding any type of
GPU significantly boosts a node's simulation performance. For inexpensive
consumer-class GPUs this improvement equally reflects in the
performance-to-price ratio. Although memory issues in consumer-class GPUs could
pass unnoticed since these cards do not support ECC memory, unreliable GPUs can
be sorted out with memory checking tools. Apart from the obvious determinants
for cost-efficiency like hardware expenses and raw performance, the energy
consumption of a node is a major cost factor. Over the typical hardware
lifetime until replacement of a few years, the costs for electrical power and
cooling can become larger than the costs of the hardware itself. Taking that
into account, nodes with a well-balanced ratio of CPU and consumer-class GPU
resources produce the maximum amount of GROMACS trajectory over their lifetime
Parallel performance results for the OpenMOC neutron transport code on multicore platforms
The shift toward multicore architectures has ushered in a new era of shared memory parallelism for scientific applications. This transition has introduced challenges for the nuclear engineering community, as it seeks to design high-fidelity full-core reactor physics simulation tools. This article describes the parallel transport sweep algorithm in the OpenMOC method of characteristics (MOC) neutron transport code for multicore platforms using OpenMP. Strong and weak scaling studies are performed for both Intel Xeon and IBM Blue Gene/Q (BG/Q) multicore processors. The results demonstrate 100% parallel efficiency for 12 threads on 12 cores on Intel Xeon platforms and over 90% parallel efficiency with 64 threads on 16 cores on the IBM BG/Q. These results illustrate the potential for hardware acceleration for MOC neutron transport on modern multicore and future many-core architectures. In addition, this work highlights the pitfalls of programming for multicore architectures, with a focal point on false sharing.National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant 1122374)United States. Department of Energy (Center for Exascale Simulation of Advanced Reactors. Contract DE-AC02-06CH11357
- …