Search CORE

1,495 research outputs found

Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Author: Karunanithi Kamesh Arumugam
Publication venue: ODU Digital Commons
Publication date: 01/10/2017
Field of study

Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly-structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-flow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-flow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-flow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-flow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization

Old Dominion University

CompF2: Theoretical Calculations and Simulation Topical Group Report

Author: Boyle Peter
Pedro Kevin
Qiang Ji
Publication venue
Publication date: 16/09/2022
Field of study

This report summarizes the work of the Computational Frontier topical group on theoretical calculations and simulation for Snowmass 2021. We discuss the challenges, potential solutions, and needs facing six diverse but related topical areas that span the subject of theoretical calculations and simulation in high energy physics (HEP): cosmic calculations, particle accelerator modeling, detector simulation, event generators, perturbative calculations, and lattice QCD (quantum chromodynamics). The challenges arise from the next generations of HEP experiments, which will include more complex instruments, provide larger data volumes, and perform more precise measurements. Calculations and simulations will need to keep up with these increased requirements. The other aspect of the challenge is the evolution of computing landscape away from general-purpose computing on CPUs and toward special-purpose accelerators and coprocessors such as GPUs and FPGAs. These newer devices can provide substantial improvements for certain categories of algorithms, at the expense of more specialized programming and memory and data access patterns.Comment: Report of the Computational Frontier Topical Group on Theoretical Calculations and Simulation for Snowmass 202

arXiv.org e-Print Archive

Faster inference from state space models via GPU computing

Author: Fagard-Jenkin Calliste
Thomas Len
Publication venue
Publication date: 02/02/2024
Field of study

Funding: C.F.-J. is funded via a doctoral scholarship from the University of St Andrews, School of Mathematics and Statistics.Inexpensive Graphics Processing Units (GPUs) offer the potential to greatly speed up computation by employing their massively parallel architecture to perform arithmetic operations more efficiently. Population dynamics models are important tools in ecology and conservation. Modern Bayesian approaches allow biologically realistic models to be constructed and fitted to multiple data sources in an integrated modelling framework based on a class of statistical models called state space models. However, model fitting is often slow, requiring hours to weeks of computation. We demonstrate the benefits of GPU computing using a model for the population dynamics of British grey seals, fitted with a particle Markov chain Monte Carlo algorithm. Speed-ups of two orders of magnitude were obtained for estimations of the log-likelihood, compared to a traditional ‘CPU-only’ implementation, allowing for an accurate method of inference to be used where this was previously too computationally expensive to be viable. GPU computing has enormous potential, but one barrier to further adoption is a steep learning curve, due to GPUs' unique hardware architecture. We provide a detailed description of hardware and software setup, and our case study provides a template for other similar applications. We also provide a detailed tutorial-style description of GPU hardware architectures, and examples of important GPU-specific programming practices.Publisher PDFPeer reviewe

University of St. Andrews - Pure

St Andrews Research Repository

ASCR/HEP Exascale Requirements Review Report

This draft report summarizes and details the findings, results, and recommendations derived from the ASCR/HEP Exascale Requirements Review meeting held in June, 2015. The main conclusions are as follows. 1) Larger, more capable computing and data facilities are needed to support HEP science goals in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of the demand at the 2025 timescale is at least two orders of magnitude -- and in some cases greater -- than that available currently. 2) The growth rate of data produced by simulations is overwhelming the current ability, of both facilities and researchers, to store and analyze it. Additional resources and new techniques for data analysis are urgently needed. 3) Data rates and volumes from HEP experimental facilities are also straining the ability to store and analyze large and complex data volumes. Appropriately configured leadership-class facilities can play a transformational role in enabling scientific discovery from these datasets. 4) A close integration of HPC simulation and data analysis will aid greatly in interpreting results from HEP experiments. Such an integration will minimize data movement and facilitate interdependent workflows. 5) Long-range planning between HEP and ASCR will be required to meet HEP's research needs. To best use ASCR HPC resources the experimental HEP program needs a) an established long-term plan for access to ASCR computational and data resources, b) an ability to map workflows onto HPC resources, c) the ability for ASCR facilities to accommodate workflows run by collaborations that can have thousands of individual members, d) to transition codes to the next-generation HPC platforms that will be available at ASCR facilities, e) to build up and train a workforce capable of developing and using simulations and analysis to support HEP scientific research on next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio

arXiv.org e-Print Archive

eScholarship - University of California

Recommended from our members

Accelerating Radiation Dose Calculation with High Performance Computing and Machine Learning for Large-scale Radiotherapy Treatment Planning

Author: Neph Ryan
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Radiation therapy is powered by modern techniques in precise planning and executionof radiation delivery, which are being rapidly improved to maximize its benefit to cancerpatients. In the last decade, radiotherapy experienced the introduction of advanced methodsfor automatic beam orientation optimization, real-time tumor tracking, daily planadaptation, and many others, which improve the radiation delivery precision, planning easeand reproducibility, and treatment efficacy. However, such advanced paradigms necessitatethe calculation of orders of magnitude more causal dose deposition data, increasing the timerequirement of all pre-planning dose calculation. Principles of high-performance computingand machine learning were applied to address the insufficient speeds of widely-used dosecalculation algorithms to facilitate translation of these advanced treatment paradigms intoclinical practice.To accelerate CT-guided X-ray therapies, Collapsed-Cone Convolution-Superposition(CCCS), a state-of-the-art analytical dose calculation algorithm, was accelerated through itsnovel implementation on highly parallelized GPUs. This context-based GPU-CCCS approachtakes advantage of X-ray dose deposition compactness to parallelize calculation acrosshundreds of beamlets, reducing hardware-specific overheads, and enabling acceleration bytwo to three orders of magnitude compared to existing GPU-based beamlet-by-beamletapproaches. Near-linear increases in acceleration are achieved with a distributed, multi-GPUimplementation of context-based GPU-CCCS.Dose calculation for MR-guided treatment is complicated by electron return effects(EREs), exhibited by ionizing electrons in the strong magnetic field of the MRI scanner. EREsnecessitate the use of much slower Monte Carlo (MC) dose calculation, limiting the clinicalapplication of advanced treatment paradigms due to time restrictions. An automaticallydistributed framework for very-large-scale MC dose calculation was developed, grantinglinear scaling of dose calculation speed with the number of utilized computational cores. Itwas then harnessed to efficiently generate a large dataset of paired high- and low-noise MCdoses in a 1.5 tesla magnetic field, which were used to train a novel deep convolutionalneural network (CNN), DeepMC, to predict low-noise dose from faster high-noise MC-simulation. DeepMC enables 38-fold acceleration of MR-guided X-ray beamlet dosecalculation, while remaining synergistic with existing MC acceleration techniques to achievemultiplicative speed improvements.This work redefines the expectation of X-ray dose calculation speed, making it possibleto apply new highly-beneficial treatment paradigms to standard clinical practice for the firsttime

eScholarship - University of California

Large-Scale Optical Neural Networks based on Photoelectric Multiplication

Author: Bernstein Liane
Englund Dirk
Hamerly Ryan
Sludds Alexander
Soljačić Marin
Publication venue: 'American Physical Society (APS)'
Publication date: 01/05/2019
Field of study

Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to large (

N \gtrsim 10^6

) networks and can be operated at high (GHz) speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using the massive spatial multiplexing enabled by standard free-space optical components. In contrast to previous approaches, both weights and inputs are optically encoded so that the network can be reprogrammed and trained on the fly. Simulations of the network using models for digit- and image-classification reveal a "standard quantum limit" for optical neural networks, set by photodetector shot noise. This bound, which can be as low as 50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for digital irreversible computation is theoretically possible in this device. The proposed accelerator can implement both fully-connected and convolutional networks. We also present a scheme for back-propagation and training that can be performed in the same hardware. This architecture will enable a new class of ultra-low-energy processors for deep learning.Comment: Text: 10 pages, 5 figures, 1 table. Supplementary: 8 pages, 5, figures, 2 table

arXiv.org e-Print Archive

Directory of Open Access Journals