25 research outputs found

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond

    Full text link
    In this and a set of companion whitepapers, the USQCD Collaboration lays out a program of science and computing for lattice gauge theory. These whitepapers describe how calculation using lattice QCD (and other gauge theories) can aid the interpretation of ongoing and upcoming experiments in particle and nuclear physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers

    Fast algorithm for real-time rings reconstruction

    Get PDF
    The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of ÎĽs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

    An Optimized Multiple Right-Hand Side Dslash Kernel for Intel Xeon Phi

    Get PDF
    Lattice quantum chromodynamics (LQCD) stands unique as the only computationally tractable, non-perturbative, and model-independent quantum field theory of the strong nuclear force. The computational core of LQCD is the Wilson Dslash operator, a nearest neighbor stencil operator summing matrix-vector multiplications over lattice points, whose performance is bandwidth-bound on most architectures. Reportedly, up to 90\% of LQCD running time may be spent computing Dslash. In recent years, efforts have been made by researchers to optimize LQCD calculations for floating point coprocessor cards such as GPUs and Intel Xeon Phi Knights Corner (KNC), which boast powerful vector processing units. Most of these efforts in the area of Dslash have focused on single right-hand side solvers. This thesis will present two optimized Dslash kernels which simplify vectorization using multiple right-hand sides and traverse lattices using novel methods. The speedups resulting from these approaches will be explored in the context of KNC\u27s architecture

    AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES

    Get PDF
    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

    Proceedings, MSVSCC 2015

    Get PDF
    The Virginia Modeling, Analysis and Simulation Center (VMASC) of Old Dominion University hosted the 2015 Modeling, Simulation, & Visualization Student capstone Conference on April 16th. The Capstone Conference features students in Modeling and Simulation, undergraduates and graduate degree programs, and fields from many colleges and/or universities. Students present their research to an audience of fellow students, faculty, judges, and other distinguished guests. For the students, these presentations afford them the opportunity to impart their innovative research to members of the M&S community from academic, industry, and government backgrounds. Also participating in the conference are faculty and judges who have volunteered their time to impart direct support to their students’ research, facilitate the various conference tracks, serve as judges for each of the tracks, and provide overall assistance to this conference. 2015 marks the ninth year of the VMASC Capstone Conference for Modeling, Simulation and Visualization. This year our conference attracted a number of fine student written papers and presentations, resulting in a total of 51 research works that were presented. This year’s conference had record attendance thanks to the support from the various different departments at Old Dominion University, other local Universities, and the United States Military Academy, at West Point. We greatly appreciated all of the work and energy that has gone into this year’s conference, it truly was a highly collaborative effort that has resulted in a very successful symposium for the M&S community and all of those involved. Below you will find a brief summary of the best papers and best presentations with some simple statistics of the overall conference contribution. Followed by that is a table of contents that breaks down by conference track category with a copy of each included body of work. Thank you again for your time and your contribution as this conference is designed to continuously evolve and adapt to better suit the authors and M&S supporters. Dr.Yuzhong Shen Graduate Program Director, MSVE Capstone Conference Chair John ShullGraduate Student, MSVE Capstone Conference Student Chai

    Autotuning for Automatic Parallelization on Heterogeneous Systems

    Get PDF
    corecore