17 research outputs found

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    COMET-GPU: A GPGPU-Enabled Deterministic Solver for the Continuous-Energy Coarse Mesh Transport Method (COMET)

    Get PDF
    The Continuous-Energy Coarse Mesh Transport (COMET) method is a neutron transport solution method that uses a unique hybrid stochastic-deterministic solution method to obtain high-fidelity whole-core solutions to reactor physics problems with formidable speed. This method involves pre-computing solutions to individual coarse meshes within the global problem, then using a deterministic transport sweep to construct a whole-core solution from these local solutions. In this work, a new implementation of the deterministic transport sweep solver is written which includes the ability to accelerate the calculation using up to 4 Graphics Processing Units (GPUs) on one computational node. To demonstrate the new implementation, three whole-core benchmark problems were solved using the previous serial solver and various configurations of the new solver, with the relative performance compared. In this comparison, it was found that the application of one GPU to the problem resulted in between a 100x-150x speedup (depending on the specific problem) relative to the old serial solver. Excellent scaling up to 4 GPUs was observed, which brought the total speedup up to 450x-500x. As an example of a new type of analysis which is enabled by the improved speed of the solver, a sensitivity study was performed on the convergence thresholds used in the inner and outer iteration processes. This study involves repeatedly solving problems using slightly varying thresholds, including computing a “gold-standard” solution to double-precision. These various runs would be prohibitively expensive if run using the old solver but in this work were completed in around an hour.Ph.D

    AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES

    Get PDF
    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

    Fast algorithm for real-time rings reconstruction

    Get PDF
    The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of ÎĽs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

    Strongly Coupled Theories in Lattice Coulomb Gauge

    Get PDF
    Quantum chromodynamics, despite its simple and elegant formulation at the Lagrangian level and numerous experimental verifications, still poses many interesting questions to particle physicists in the region where perturbation theory breaks down. The origin of confinement of quarks and gluons is one of these big puzzles. Analytic techniques, based on Dyson–Schwinger equations or the variational approach have proven to be useful tools to study the non-perturbative aspects of field theories. Of the latter, the Hamiltonian approach in Coulomb gauge offers an appealing physical interpretation of two-point func- tions of the theory. In recent years, as numerical algorithms improved and more and more compute power became available to the physics community, lattice gauge theory, a fully numerical approach, has become established as the main tool for studies in the non- perturbative sector of field theories. A verification of these different approaches against each other is of great interest to learn about their limitations. In the first part of this work we will study the correlation functions of pure SU(2) Yang– Mills theory at zero and finite temperature. After an introduction to QCD and lattice gauge theory, we will discuss the Gribov problem and investigate a recent proposal to resolve it. Then we will turn on temperature to study the deconfinement phase transition. Based on the center vortex picture of confinement, we will propose an answer to the question why the correlators from lattice gauge theory in Coulomb gauge fail to detect the phase transition. Afterwards we will leave pure Yang–Mills theory and apply our knowledge to the so-called Minimal Walking Technicolor theory, a possible extension to the Standard Model. Finally we discuss how lattice gauge theory applications can be implemented efficiently on graphics processing units used nowadays in high performance computing.Obwohl die Quantenchromodynamik durch eine einfache und elegante Lagrangedichte beschrieben wird und auch experimentell sehr gut bestätigt ist, sind einige interessan- te Fragen, im Energiebereich, der nicht-störungstheoretisch zugänglich ist, noch immer unbeantwortet. Insbesondere die Frage nach dem Ursprung des Farbeinschlusses (Confinement) beschäftigt die theoretische Teilchenphysik seit mehreren Jahrzehnten. Analytische Zugänge, basierend auf Dyson–Schwinger-Gleichungen oder dem Variationsprinzip, sind wichtige Hilfsmittel, um die nicht-störungstheoretischen Eigenschaften von Feldtheorien zu untersuchen. Aus der zweiten Kategorie bietet insbesondere der Hamiltonzugang in Coulombeichung eine schlüssige physikalische Interpretation der Zweipunktfunktionen der Theorie. Auch beflügelt durch die enorme Leistungssteigerung der Computer sowie Fortschritte in numerischen Algorithmen hat sich die Gitterfeldtheorie zur bedeutendsten Technik zur Erforschung des nicht-störungtheoretischen Sektors von Feldtheorien entwickelt. Eine Überprüfung der Resultate aus den verschiedenen Zugängen ist von großem Interesse, um etwaige Beschränkungen der Methoden zu verstehen. Im ersten Teil dieser Arbeit beschäftigen wir uns mit den Korrelationsfunktionen der reinen SU(2) Yang–Mills-Theorie bei Nulltemperatur und bei endlichen Temperaturen. Nach einer Einführung zur QCD und der Gitterfeldtheorie werden wir uns zuerst mit dem Gribovproblem beschäftigen, sowie einem neuen Vorschlag dieses zu beheben. Anschließend werden wir die Theorie bei endlicher Temperatur betrachten, um den Deconfinementphasenübergang zu untersuchen. Basierend auf dem Bild der Zentrumsvortices, einem Modell zur Beschreibung des Confinements, werden wir eine Erklärung finden wieso, der Phasenübergang in Korrelatoren in Coulombeichung auf dem Gitter nicht sichtbar ist. Dann werden wir die reine Eichtheorie verlassen und die sogenannte Minimal Walking Technicolor Theorie betrachten, die eine mögliche Erweiterung des Standardmodells darstellt. Abschließen werden wir diese Arbeit mit einem Kapitel über die effiziente Implementierung von Algorithmen der Gitterfeldtheorie auf Grafikkarten, welche heutzutage als Rechenbeschleuniger im High Performance Computing zur Anwendung kommen

    Hypercubic storage layout and transforms in arbitrary dimensions using GPUs and CUDA

    No full text
    Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a manner that supports exploitation of data-parallelism on fast modern processing devices. We report on data layouts and transformation algorithms that support both conventional and data-parallel memory layouts. We present our implementations expressed in both conventional serial C code as well as in NVIDIA's Compute Unified Device Architecture concurrent programming language for use on general purpose graphical processing units. We discuss: general memory layouts; specific optimizations possible for dimensions that are powers-of-two and common transformations, such as inverting, shifting and crinkling. We present performance data for some illustrative scientific applications of these layouts and transforms using several current GPU devices and discuss the code and speed scalability of this approach

    Investigation of hadron matter using lattice QCD and implementation of lattice QCD applications on heterogeneous multicore acceleration processors

    Get PDF
    Observables relevant for the understanding of the structure of baryons were determined by means of Monte Carlo simulations of Lattice Quantum Chromodynamics (QCD) using 2+1 dynamical quark flavours. Especial emphasis was placed on how these observables change when flavour symmetry is broken in comparison to choosing equal masses for the two light and the strange quark. The first two moments of unpolarised, longitudinally, and transversely polarised parton distribution functions were calculated for the nucleon and hyperons. The latter are baryons which comprise a strange quark. Lattice QCD simulations tend to be extremely expensive, reaching the need for petaflop computing and beyond, a regime of computing power we just reach today. Heterogeneous multicore computing is getting increasingly important in high performance scientific computing. The strategy of deploying multiple types of processing elements within a single workflow, and allowing each to perform the tasks to which it is best suited is likely to be part of the roadmap to exascale. In this work new design concepts were developed for an active library (QDP++) harnessing the compute power of a heterogeneous multicore processor (IBM PowerXCell 8i processor). Not only a proof-of-concept is given furthermore it was possible to run a QDP++ based physics application (Chroma) achieving a reasonable performance on the IBM BladeCenter QS22

    Review of particle physics

    Get PDF
    The Review summarizes much of particle physics and cosmology. Using data from previous editions, plus 3,283 new measurements from 899 papers, we list, evaluate, and average measured properties of gauge bosons and the recently discovered Higgs boson, leptons, quarks, mesons, and baryons. We summarize searches for hypothetical particles such as heavy neutrinos, supersymmetric and technicolor particles, axions, dark photons, etc. All the particle properties and search limits are listed in Summary Tables. We also give numerous tables, figures, formulae, and reviews of topics such as Supersymmetry, Extra Dimensions, Particle Detectors, Probability, and Statistics. Among the 112 reviews are many that are new or heavily revised including those on: Dark Energy, Higgs Boson Physics, Electroweak Model, Neutrino Cross Section Measurements, Monte Carlo Neutrino Generators, Top Quark, Dark Matter, Dynamical Electroweak Symmetry Breaking, Accelerator Physics of Colliders, High-Energy Collider Parameters, Big Bang Nucleosynthesis, Astrophysical Constants and Cosmological Parameters. A booklet is available containing the Summary Tables and abbreviated versions of some of the other sections of this full Review. All tables, listings, and reviews (and errata) are also available on the Particle Data Group website: http://pdg.Ibi.gov
    corecore