27 research outputs found
Performance Engineering for Graduate Students:a View from Amsterdam
HPC relies on experts to design, implement, and tune (computational science) applications that can efficiently use current (super)computing systems. As such, we strongly believe we must educate our students to ensure their ability to drive these activities, together with the domain experts. To this end, in 2017, we have designed a performance engineering course that, inspired by several conference-like tutorials, covers the principles and practice of performance engineering: benchmarking, performance modeling, and performance improvement. In this paper, we describe the goals, learning objectives, and structure of the course, share students feedback and evaluation data, and discuss the lessons learned. After teaching the course seven times, our results show that the course is tough (as expected) but very well received, with high-scores and several students continuing on the path of performance engineering during and after their master studies.</p
Reduced Simulations for High-Energy Physics, a Middle Ground for Data-Driven Physics Research
Subatomic particle track reconstruction (tracking) is a vital task in High-Energy Physics experiments. Tracking is exceptionally computationally challenging and fielded solutions, relying on traditional algorithms, do not scale linearly. Machine Learning (ML) assisted solutions are a promising answer. We argue that a complexity-reduced problem description and the data representing it, will facilitate the solution exploration workflow. We provide the REDuced VIrtual Detector (REDVID) as a complexity-reduced detector model and particle collision event simulator combo. REDVID is intended as a simulation-in-the-loop, to both generate synthetic data efficiently and to simplify the challenge of ML model design. The fully parametric nature of our tool, with regards to system-level configuration, while in contrast to physics-accurate simulations, allows for the generation of simplified data for research and education, at different levels. Resulting from the reduced complexity, we showcase the computational efficiency of REDVID by providing the computational cost figures for a multitude of simulation benchmarks. As a simulation and a generative tool for ML-assisted solution design, REDVID is highly flexible, reusable and open-source. Reference data sets generated with REDVID are publicly available. Data generated using REDVID has enabled rapid development of multiple novel ML model designs, which is currently ongoing
Modelling Performance Loss due to Thread Imbalance in Stochastic Variable-Length SIMT Workloads
When designing algorithms for single-instruction multiple-thread (SIMT) devices such as general purpose graphics processing units (GPGPUs), thread imbalance is an important performance consideration. Thread imbalance can emerge in iterative applications where workloads are of variable length, because threads processing larger amounts of work will cause threads with less work to idle. This form of thread imbalance influences the design space of algorithms-particularly in terms of processing granularity-but we lack models to quantify its impact on application performance. In this paper, we present a statistical model for quantifying the performance loss due to thread imbalance for iterative SIMT applications with stochastic, variable-length workloads. Our model is designed to operate with minimal knowledge of the implementation details of the algorithm, relying solely on an understanding of the probability distribution of the lengths of the workloads. We validate our model against a synthetic benchmark based on a Monte Carlo simulation of matrix exponentiation, and show that our model achieves nearly perfect accuracy. Compared to empirical data extracted from real hardware, our model maintains a high degree of accuracy, predicting mean performance loss within a margin of 2%.</p
Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition
We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications.</p
Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays
The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware
Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms
The layout of multi-dimensional data can have a significant impact on the
efficacy of hardware caches and, by extension, the performance of applications.
Common multi-dimensional layouts include the canonical row-major and
column-major layouts as well as the Morton curve layout. In this paper, we
describe how the Morton layout can be generalized to a very large family of
multi-dimensional data layouts with widely varying performance characteristics.
We posit that this design space can be efficiently explored using a
combinatorial evolutionary methodology based on genetic algorithms. To this
end, we propose a chromosomal representation for such layouts as well as a
methodology for estimating the fitness of array layouts using cache simulation.
We show that our fitness function correlates to kernel running time in real
hardware, and that our evolutionary strategy allows us to find candidates with
favorable simulated cache properties in four out of the eight real-world
applications under consideration in a small number of generations. Finally, we
demonstrate that the array layouts found using our evolutionary method perform
well not only in simulated environments but that they can effect significant
performance gains -- up to a factor ten in extreme cases -- in real hardware
Performance Engineering in High Energy Physics Software: The ATLAS Offline Track Fitter
As the ATLAS high energy physics experiment gears up for the HL-LHC upgrade, the performance of the software used to process the experiment's massive amount of data grows more and more important. Models predict that, when HL-LHC becomes operational in 2026, the amount of CPU time required to process the data produced by the ATLAS experiment will increase by up to a factor of nine; this increase exceeds by far the predicted computing budget. In this thesis, we present the results of a research project investigating the feasibility and applicability of performance engineering techniques on HEP software, using the ATLAS offline track fitting software as a guiding example. We discuss the challenges involved in optimising such software from a software and performance engineering perspective. Of particular interest are ways in which traditional performance analysis techniques can be adapted to obtain detailed and meaningful performance data for such software. Indeed, we find that not all techniques can be applied naively to give maximally meaningful results. From the results of performance analysis using techniques modified to fit this particular use case, we design and implement a series of relatively simple optimising code transformations. These optimisations approximately double the performance of the fitting code, thus bringing it closer to the performance required for HL-LHC
REDVID Collision Event Data – Linear Tracks and Hits
<p>An example, representative data set is generated using the REDuced VIrtual Detector (REDVID) simulation framework and contains complexity-reduced subatomic particle collision event data. Particle trajectory information and hit coordinates from interactions with reduced-order virtual detector models is included. The data is generated in 3D domain and follows the cylindrical coordinate system for hit point coordinates in space and trajectory function parameters.</p>
<p>The included five tarballs each belong to a different data generation recipe. While all recipes include 10000 collision events, the number of tracks included in events varies from 1 track per event to 10000 tracks per event. This is noticeable from the tarball names.</p>
<p>The data set is intended to be used as synthesised input for research involving ML-assisted pipeline design exploration, as well as ML model design exploration, e.g., Neural Architecture Search (NAS). To understand the data and its generation in detail, refer to the provided README file, as well as the related publication.</p>
The derivation of Jacobian matrices for the propagation of track parameter uncertainties in the presence of magnetic fields and detector material
In high-energy physics experiments, the trajectories of charged particles are reconstructed using track reconstruction algorithms. Such algorithms need to both identify the set of measurements from a single charged particle and to fit the parameters by propagating tracks along the measurements. The propagation of the track parameter uncertainties is an important component in the track fitting to get the optimal precision in the fitted parameters. The error propagation is performed at intersections between the track and local coordinate frames defined on detector components by calculating a Jacobian matrix corresponding to the local-to-local frame transport. This paper derives the Jacobian matrix in a general manner to harmonize with numerical integration methods developed for inhomogeneous magnetic fields and materials. The Jacobian and transported covariance matrices are validated by simulating the propagation of charged particles between two frames and comparing with the results of numerical methods.In high-energy physics experiments, the trajectories of charged particles are reconstructed using track reconstruction algorithms. Such algorithms need to both identify the set of measurements from a single charged particle and to fit the parameters by propagating tracks along the measurements. The propagation of the track parameter uncertainties is an important component in the track fitting to get the optimal precision in the fitted parameters. The error propagation is performed at the surface intersections by calculating a Jacobian matrix corresponding to the surface-to-surface transport. This paper derives the Jacobian matrix in a general manner to harmonize with semi-analytical numerical integration methods developed for inhomogeneous magnetic fields and materials. The Jacobian and transported covariance matrices are validated by simulating the charged particles between two surfaces and comparing with the results of numerical methods
Improvements to ATLAS Inner Detector Track reconstruction for LHC Run-3
This document summarises the main changes to the ATLAS experiment's Inner Detector Track reconstruction software chain in preparation of LHC Run 3 (2022-2024). The work was carried out to ensure that the expected high-activity collisions with on average 50 simultaneous proton-proton interactions per bunch crossing (pile-up) can be reconstructed promptly using the available computing resources. Performance figures in terms of CPU consumption for the key components of the reconstruction algorithm chain and their dependence on the pile-up are shown. For the design pile-up value of 60 the updated track reconstruction is a factor of 2 faster than the previous version.This document summarises the main changes to the ATLAS experiment’s Inner Detector Track reconstruction software chain in preparation of LHC Run 3 (2022-2024). The work was carried out to ensure that the expected high-activity collisions with on average 50 simultaneous proton-proton interactions per bunch crossing (pile-up) can be reconstructed promptly using the available computing resources. Performance figures in terms of CPU consumption for the key components of the reconstruction algorithm chain and their dependence on the pile-up are shown. For the design pile-up value of 60 the updated track reconstruction is a factor of 2 faster than the previous version