63 research outputs found
A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit
Autotuning of performance-relevant source-code parameters allows to
automatically tune applications without hard coding optimizations and thus
helps with keeping the performance portable. In this paper, we introduce a
benchmark set of ten autotunable kernels for important computational problems
implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that
with autotuning most of the kernels reach near-peak performance on various GPUs
and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation
also demonstrates that autotuning is key to performance portability. In
addition to offline tuning, we also introduce dynamic autotuning of code
optimization parameters during application runtime. With dynamic tuning, the
Kernel Tuning Toolkit enables applications to re-tune performance-critical
kernels at runtime whenever needed, for example, when input data changes.
Although it is generally believed that autotuning spaces tend to be too large
to be searched during application runtime, we show that it is not necessarily
the case when tuning spaces are designed rationally. Many of our kernels reach
near peak-performance with moderately sized tuning spaces that can be searched
at runtime with acceptable overhead. Finally we demonstrate, how dynamic
performance tuning can be integrated into a real-world application from
cryo-electron microscopy domain
Aceleración de algoritmos de procesamiento de imágenes para el análisis de partículas individuales con microscopia electrónica
Tesis Doctoral inédita cotutelada por la Masaryk University (República Checa) y la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray
crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and
other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM
is the computational complexity. Modern electron microscopes can produce terabytes of data per single
session, from which hundreds of thousands of particles must be extracted and processed to obtain a
near-atomic resolution of the original sample. Many existing software solutions use high-Performance
Computing (HPC) techniques to bring these computations to the realm of practical usability. The
common approach to acceleration is parallelization of the processing, but in praxis, we face many
complications, such as problem decomposition, data distribution, load scheduling, balancing, and
synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous
hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization,
and sub-optimal code performance due to missing specialization.
This dissertation, structured as a compendium of articles, aims to improve the algorithms used
in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance
optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous
computing or autotuning, which potentially needs the formulation of novel algorithms. The
secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since
the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no
single bottleneck to be solved. As such, the presented articles show a holistic approach to performance
optimization.
First, we give details on the GPU acceleration of the specific programs. The achieved speedup is
due to the higher performance of the GPU, adjustments of the original algorithm to it, and application
of the novel algorithms. More specifically, we provide implementation details of programs for movie
alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude
compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition
to these three programs, multiple other programs from an actively used, open-source software package
XMIPP have been accelerated and improved.
Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of
software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we
present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the
cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set
of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA,
together with the introduction of complex dynamic autotuning to the KTT tool.
Third, we propose an image processing framework Umpalumpa, which combines a task-based
runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for
writing complex workflows which automatically use available HW resources and adjust to different HW
and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa”
Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021.
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie grant agreement No. 71367
BAT: A Benchmark suite for AutoTuners
the code by ?nding the best possible values for a given architecture. To our knowledge, there are currently no standardized benchmark suites for comparing and testing autotuners. Developers of autotuners thus make their own when presenting and comparing autotuners. We thus present BAT, a Benchmark suite for AutoTuners with HPC-based parameterized GPU programs. CUDA programs and kernels from "The Scalable Heterogeneous Computing (SHOC) Benchmark" are parameterized. BAT contains a varied selection of benchmarks of different complexity that can utilize multiple GPUs on one system, either by running the same program and computations on multiple nodes, or by splitting the work between nodes. BAT contains 9 di?erent HPC benchmarks that provide a large search space of autotuning parameters, and are modified to suite many di?erent autotuners. BAT also includes a CLI that facilitates autotuning with the benchmarks. Our benchmark suite is tested with four di?erent autotuners, OpenTuner, Kernel Tuner, CLTune and KTT. They di?er in setup and how they tune. The impact of the di?erent benchmark parameters on the running time across architectures is analyzed. Test systems used include a DGX-2, IBM Power System AC922 with Tesla V100-SXM2 32 GB GPUs, an RTX Titan, a GeForce GTX 980 and a server with 20 Tesla T4 GPUs
Portable performance on heterogeneous architectures
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
Benchmarking optimization algorithms for auto-tuning GPU kernels
Recent years have witnessed phenomenal growth in the application, and
capabilities of Graphical Processing Units (GPUs) due to their high parallel
computation power at relatively low cost. However, writing a computationally
efficient GPU program (kernel) is challenging, and generally only certain
specific kernel configurations lead to significant increases in performance.
Auto-tuning is the process of automatically optimizing software for
highly-efficient execution on a target hardware platform. Auto-tuning is
particularly useful for GPU programming, as a single kernel requires re-tuning
after code changes, for different input data, and for different architectures.
However, the discrete, and non-convex nature of the search space creates a
challenging optimization problem. In this work, we investigate which algorithm
produces the fastest kernels if the time-budget for the tuning task is varied.
We conduct a survey by performing experiments on 26 different kernel spaces,
from 9 different GPUs, for 16 different evolutionary black-box optimization
algorithms. We then analyze these results and introduce a novel metric based on
the PageRank centrality concept as a tool for gaining insight into the
difficulty of the optimization problem. We demonstrate that our metric
correlates strongly with observed tuning performance.Comment: in IEEE Transactions on Evolutionary Computation, 202
Doctor of Philosophy
dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput
Nākotnes procesoru arhitektūru pielietojums precīzu daļiņu paātrinātāju modelēšanā
Jaunās procesoru arhitektūras, kā grafiskie procesori (GPU) un Intel Many Integrated Cores (MIC) procesori, sniedz milzīgu veiktspējas potenciālu augstas veiktspējas skaitļošanas aplikācijās. Tomēr izstrādājot programmatūru, kas spēj izmantot šīs jaunās tehnoloģijas ir jāsaskarās ar dažādiem papildus grūtībām. Programmām ir jāspēj izmantot papildus paralēlisms, ko piedāvā šīs iekārtās, tām ir jāspēj pielāgoties dažādām procesoru arhitektūrām un jāizmanto dažādas izstrādes platformas, lai aplikācija spēdu darboties uz iekārtām no dažādiem ražotājiem. Dynamic Kernel Scheduler (DKS) tika izstrādāts, lai nodrošinātu papildus programmatūras slāni starp programmu un papildus processoriem. DKS nodrošina komunikāciju starp aplikāciju un šīm iekārtām, uzdevumu izpildi uz iekārtas un piedāvā bibliotēku ar algoritmiem, kas optimizēti darbībai uz šīm iekārtām. Algoritmi, kas pieejami DKS, tiek izstrādāti izmantojot CUDA, OpenCL un OpenMP tehnoloģijas. Atkarībā no pieejamās iekārtas DKS spēj pārslēgties starp šiem risinājumiem un izvēlēties pareizo algoritma implementāciju. DKS tika izmantots, lai nodrošinātu papilduis processoru atbalstu tādās aplikācijās kā OPAL (Object-oriented Particle Accelerator Library), musrfit un PET (Positron Emission Tomography) attētlu rekonstruēšanas aplikācijā. Šīs programmas tiek izstrādātas Paul Scherrer Institut un ETH Zurich, un izmantotas daļiņu paātrinātāju modelēšanā un experimentālo datu analīzē. Sasniegtie rezultāti rāda, ka izmantojot papildus processorus iespējams sasniegt ievērojamu paātrinājumu programmu izpildes laikā un ar DKS palīdzību tiek atvieglota GPU un Intel MIC integrācija un uzturēšana esošajās aplikācijās. Jaunno processoru arhitektūru potenciāls tiek papildus nodemonstrēts pārnesot uz CUDA mbtrack aplikāciju, kas izstrādāta SOLEIL (French national synchrotron facility). Šī programma PSI tiek izmantota, lai modelētu nestabilitātes saistītos daļiņu kūļos un pārejošos effektus, ko rada daļiņu plūsma mijiedarbojoties ar apkārtējiem elementiem. Izmantojot skaitļošanas jaudu, kas pieejama GPU, ir iespējams šīs simulācijas pārnest no lielākiem CPU klasteriem un vienkāršāku sistēmu, kas sastāv no processora papildināta ar vienu grafisko karti. Atslēgas vārdi: Aparatūras paātrinātāji, GPU skaitļošana, Intel MIC, CUDA, OpenCL, OpenMPEmerging processor architectures such as graphical processing units (GPUs) and Intel Many Integrated Cores (MICs) provide a huge performance potential for high performance computing. However developing software that uses these hardware accelerators introduces additional challenges for the developer. These challenges may include exposing increased parallelism, handling different hardware designs, and using multiple development frameworks in order to utilize devices from different vendors. The Dynamic Kernel Scheduler (DKS) is being developed in order to provide a software layer between the host application and different hardware accelerators. DKS handles the communication between the host and the device, schedules task execution, and provides a library of built-in algorithms. Algorithms available in the DKS library will be written in CUDA, OpenCL, and OpenMP. Depending on the available hardware, the DKS can select the appropriate implementation of the algorithm. The DKS was used to enable co-processor usage in applications such as OPAL (Object-oriented Particle Accelerator Library), musrfit and PET (Positron Emission Tomography) Image reconstruction application. These applications are developed at Paul Scherrer Institut, and ETH Zurich for particle accelerator modeling and experimental data analysis, and used by the world wide user community. The achieved results show that substantial speedups in application execution times can be achieved using co-processors compared to CPUs and with the help of DKS the process of integrating new processors in existing applications is simplified and more maintainable. The potential of the new hardware architectures is further demonstrated by porting to CUDA application for multibunch tracking (mbtrack) developed at SOLEIL (French national synchrotron facility). This application is used at PSI for detailed study of coupled bunch instabilities and transient beam-loading. By using the computational power of the GPUs the necessary simulations can be done on the GPU instead of a larger computing cluster that would be required otherwise. Keywords: Hardware acceleration, GPU computing, Intel MIC, CUDA, OpenCL, OpenM
- …