17 research outputs found
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
COMET-GPU: A GPGPU-Enabled Deterministic Solver for the Continuous-Energy Coarse Mesh Transport Method (COMET)
The Continuous-Energy Coarse Mesh Transport (COMET) method is a neutron transport solution method that uses a unique hybrid stochastic-deterministic solution method to obtain high-fidelity whole-core solutions to reactor physics problems with formidable speed. This method involves pre-computing solutions to individual coarse meshes within the global problem, then using a deterministic transport sweep to construct a whole-core solution from these local solutions. In this work, a new implementation of the deterministic transport sweep solver is written which includes the ability to accelerate the calculation using up to 4 Graphics Processing Units (GPUs) on one computational node. To demonstrate the new implementation, three whole-core benchmark problems were solved using the previous serial solver and various configurations of the new solver, with the relative performance compared. In this comparison, it was found that the application of one GPU to the problem resulted in between a 100x-150x speedup (depending on the specific problem) relative to the old serial solver. Excellent scaling up to 4 GPUs was observed, which brought the total speedup up to 450x-500x. As an example of a new type of analysis which is enabled by the improved speed of the solver, a sensitivity study was performed on the convergence thresholds used in the inner and outer iteration processes. This study involves repeatedly solving problems using slightly varying thresholds, including computing a “gold-standard” solution to double-precision. These various runs would be prohibitively expensive if run using the old solver but in this work were completed in around an hour.Ph.D
AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES
A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph
Fast algorithm for real-time rings reconstruction
The GAP project is dedicated to study the application of GPU in several contexts in which
real-time response is important to take decisions. The definition of real-time depends on
the application under study, ranging from answer time of ÎĽs up to several hours in case
of very computing intensive task. During this conference we presented our work in low
level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and
specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6].
Apart from the study of dedicated solution to decrease the latency due to data transport
and preparation, the computing algorithms play an essential role in any GPU application.
In this contribution, we show an original algorithm developed for triggers application, to
accelerate the ring reconstruction in RICH detector when it is not possible to have seeds
for reconstruction from external trackers
Strongly Coupled Theories in Lattice Coulomb Gauge
Quantum chromodynamics, despite its simple and elegant formulation at the Lagrangian
level and numerous experimental verifications, still poses many interesting questions to
particle physicists in the region where perturbation theory breaks down. The origin of
confinement of quarks and gluons is one of these big puzzles. Analytic techniques, based
on Dyson–Schwinger equations or the variational approach have proven to be useful tools
to study the non-perturbative aspects of field theories. Of the latter, the Hamiltonian
approach in Coulomb gauge offers an appealing physical interpretation of two-point func-
tions of the theory. In recent years, as numerical algorithms improved and more and
more compute power became available to the physics community, lattice gauge theory, a
fully numerical approach, has become established as the main tool for studies in the non-
perturbative sector of field theories. A verification of these different approaches against
each other is of great interest to learn about their limitations.
In the first part of this work we will study the correlation functions of pure SU(2) Yang–
Mills theory at zero and finite temperature. After an introduction to QCD and lattice
gauge theory, we will discuss the Gribov problem and investigate a recent proposal to
resolve it. Then we will turn on temperature to study the deconfinement phase transition.
Based on the center vortex picture of confinement, we will propose an answer to the
question why the correlators from lattice gauge theory in Coulomb gauge fail to detect
the phase transition. Afterwards we will leave pure Yang–Mills theory and apply our
knowledge to the so-called Minimal Walking Technicolor theory, a possible extension to
the Standard Model. Finally we discuss how lattice gauge theory applications can be
implemented efficiently on graphics processing units used nowadays in high performance
computing.Obwohl die Quantenchromodynamik durch eine einfache und elegante Lagrangedichte
beschrieben wird und auch experimentell sehr gut bestätigt ist, sind einige interessan-
te Fragen, im Energiebereich, der nicht-störungstheoretisch zugänglich ist, noch immer
unbeantwortet. Insbesondere die Frage nach dem Ursprung des Farbeinschlusses (Confinement) beschäftigt die theoretische Teilchenphysik seit mehreren Jahrzehnten. Analytische Zugänge, basierend auf Dyson–Schwinger-Gleichungen oder dem Variationsprinzip,
sind wichtige Hilfsmittel, um die nicht-störungstheoretischen Eigenschaften von Feldtheorien zu untersuchen. Aus der zweiten Kategorie bietet insbesondere der Hamiltonzugang
in Coulombeichung eine schlĂĽssige physikalische Interpretation der Zweipunktfunktionen
der Theorie. Auch beflĂĽgelt durch die enorme Leistungssteigerung der Computer sowie
Fortschritte in numerischen Algorithmen hat sich die Gitterfeldtheorie zur bedeutendsten
Technik zur Erforschung des nicht-störungtheoretischen Sektors von Feldtheorien entwickelt. Eine Überprüfung der Resultate aus den verschiedenen Zugängen ist von großem
Interesse, um etwaige Beschränkungen der Methoden zu verstehen.
Im ersten Teil dieser Arbeit beschäftigen wir uns mit den Korrelationsfunktionen der
reinen SU(2) Yang–Mills-Theorie bei Nulltemperatur und bei endlichen Temperaturen.
Nach einer EinfĂĽhrung zur QCD und der Gitterfeldtheorie werden wir uns zuerst mit
dem Gribovproblem beschäftigen, sowie einem neuen Vorschlag dieses zu beheben. Anschließend werden wir die Theorie bei endlicher Temperatur betrachten, um den Deconfinementphasenübergang zu untersuchen. Basierend auf dem Bild der Zentrumsvortices,
einem Modell zur Beschreibung des Confinements, werden wir eine Erklärung finden wieso, der Phasenübergang in Korrelatoren in Coulombeichung auf dem Gitter nicht sichtbar
ist. Dann werden wir die reine Eichtheorie verlassen und die sogenannte Minimal Walking Technicolor Theorie betrachten, die eine mögliche Erweiterung des Standardmodells
darstellt. AbschlieĂźen werden wir diese Arbeit mit einem Kapitel ĂĽber die effiziente Implementierung von Algorithmen der Gitterfeldtheorie auf Grafikkarten, welche heutzutage
als Rechenbeschleuniger im High Performance Computing zur Anwendung kommen
Hypercubic storage layout and transforms in arbitrary dimensions using GPUs and CUDA
Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a manner that supports exploitation of data-parallelism on fast modern processing devices. We report on data layouts and transformation algorithms that support both conventional and data-parallel memory layouts. We present our implementations expressed in both conventional serial C code as well as in NVIDIA's Compute Unified Device Architecture concurrent programming language for use on general purpose graphical processing units. We discuss: general memory layouts; specific optimizations possible for dimensions that are powers-of-two and common transformations, such as inverting, shifting and crinkling. We present performance data for some illustrative scientific applications of these layouts and transforms using several current GPU devices and discuss the code and speed scalability of this approach
Investigation of hadron matter using lattice QCD and implementation of lattice QCD applications on heterogeneous multicore acceleration processors
Observables relevant for the understanding of the structure of baryons
were determined by means of Monte Carlo simulations of Lattice Quantum
Chromodynamics (QCD) using 2+1 dynamical quark flavours. Especial
emphasis was placed on how these observables change when flavour
symmetry is broken in comparison to choosing equal masses for the two
light and the strange quark. The first two moments of unpolarised,
longitudinally, and transversely polarised parton distribution
functions were calculated for the nucleon and hyperons. The latter are
baryons which comprise a strange quark.
Lattice QCD simulations tend to be extremely expensive, reaching the
need for petaflop computing and beyond, a regime of computing power we
just reach today. Heterogeneous multicore computing is getting
increasingly important in high performance scientific computing. The
strategy of deploying multiple types of processing elements within a
single workflow, and allowing each to perform the tasks to which it is
best suited is likely to be part of the roadmap to exascale. In this
work new design concepts were developed for an active library (QDP++)
harnessing the compute power of a heterogeneous multicore processor
(IBM PowerXCell 8i processor). Not only a proof-of-concept is given
furthermore it was possible to run a QDP++ based physics application
(Chroma) achieving a reasonable performance on the IBM BladeCenter QS22
Review of particle physics
The Review summarizes much of particle physics and cosmology. Using data from previous editions, plus 3,283 new measurements from 899 papers, we list, evaluate, and average measured properties of gauge bosons and the recently discovered Higgs boson, leptons, quarks, mesons, and baryons. We summarize searches for hypothetical particles such as heavy neutrinos, supersymmetric and technicolor particles, axions, dark photons, etc. All the particle properties and search limits are listed in Summary Tables. We also give numerous tables, figures, formulae, and reviews of topics such as Supersymmetry, Extra Dimensions, Particle Detectors, Probability, and Statistics. Among the 112 reviews are many that are new or heavily revised including those on: Dark Energy, Higgs Boson Physics, Electroweak Model, Neutrino Cross Section Measurements, Monte Carlo Neutrino Generators, Top Quark, Dark Matter, Dynamical Electroweak Symmetry Breaking, Accelerator Physics of Colliders, High-Energy Collider Parameters, Big Bang Nucleosynthesis, Astrophysical Constants and Cosmological Parameters. A booklet is available containing the Summary Tables and abbreviated versions of some of the other sections of this full Review. All tables, listings, and reviews (and errata) are also available on the Particle Data Group website: http://pdg.Ibi.gov