7,135 research outputs found
Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC
As the Large Hadron Collider (LHC) continues its upward progression in energy
and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the
challenges of the experiments in processing increasingly complex events will
also continue to increase. Improvements in computing technologies and
algorithms will be a key part of the advances necessary to meet this challenge.
Parallel computing techniques, especially those using massively parallel
computing (MPC), promise to be a significant part of this effort. In these
proceedings, we discuss these algorithms in the specific context of a
particularly important problem: the reconstruction of charged particle tracks
in the trigger algorithms in an experiment, in which high computing performance
is critical for executing the track reconstruction in the available time. We
discuss some areas where parallel computing has already shown benefits to the
LHC experiments, and also demonstrate how a MPC-based trigger at the CMS
experiment could not only improve performance, but also extend the reach of the
CMS trigger system to capture events which are currently not practical to
reconstruct at the trigger level.Comment: 14 pages, 6 figures. Proceedings of 2nd International Summer School
on Intelligent Signal Processing for Frontier Research and Industry
(INFIERI2014), to appear in JINST. Revised version in response to referee
comment
Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
A sparse octree gravitational N-body code that runs entirely on the GPU processor
We present parallel algorithms for constructing and traversing sparse octrees
on graphics processing units (GPUs). The algorithms are based on parallel-scan
and sort methods. To test the performance and feasibility, we implemented them
in CUDA in the form of a gravitational tree-code which completely runs on the
GPU.(The code is publicly available at:
http://castle.strw.leidenuniv.nl/software.html) The tree construction and
traverse algorithms are portable to many-core devices which have support for
CUDA or OpenCL programming languages. The gravitational tree-code outperforms
tuned CPU code during the tree-construction and shows a performance improvement
of more than a factor 20 overall, resulting in a processing rate of more than
2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35
pages, 12 figures, single colum
Parallel Tempering Simulation of the three-dimensional Edwards-Anderson Model with Compact Asynchronous Multispin Coding on GPU
Monte Carlo simulations of the Ising model play an important role in the
field of computational statistical physics, and they have revealed many
properties of the model over the past few decades. However, the effect of
frustration due to random disorder, in particular the possible spin glass
phase, remains a crucial but poorly understood problem. One of the obstacles in
the Monte Carlo simulation of random frustrated systems is their long
relaxation time making an efficient parallel implementation on state-of-the-art
computation platforms highly desirable. The Graphics Processing Unit (GPU) is
such a platform that provides an opportunity to significantly enhance the
computational performance and thus gain new insight into this problem. In this
paper, we present optimization and tuning approaches for the CUDA
implementation of the spin glass simulation on GPUs. We discuss the integration
of various design alternatives, such as GPU kernel construction with minimal
communication, memory tiling, and look-up tables. We present a binary data
format, Compact Asynchronous Multispin Coding (CAMSC), which provides an
additional speedup compared with the traditionally used Asynchronous
Multispin Coding (AMSC). Our overall design sustains a performance of 33.5
picoseconds per spin flip attempt for simulating the three-dimensional
Edwards-Anderson model with parallel tempering, which significantly improves
the performance over existing GPU implementations.Comment: 15 pages, 18 figure
- …