63 research outputs found

    A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

    Full text link
    Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain

    Aceleración de algoritmos de procesamiento de imágenes para el análisis de partículas individuales con microscopia electrónica

    Full text link
    Tesis Doctoral inédita cotutelada por la Masaryk University (República Checa) y la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM is the computational complexity. Modern electron microscopes can produce terabytes of data per single session, from which hundreds of thousands of particles must be extracted and processed to obtain a near-atomic resolution of the original sample. Many existing software solutions use high-Performance Computing (HPC) techniques to bring these computations to the realm of practical usability. The common approach to acceleration is parallelization of the processing, but in praxis, we face many complications, such as problem decomposition, data distribution, load scheduling, balancing, and synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization, and sub-optimal code performance due to missing specialization. This dissertation, structured as a compendium of articles, aims to improve the algorithms used in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous computing or autotuning, which potentially needs the formulation of novel algorithms. The secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no single bottleneck to be solved. As such, the presented articles show a holistic approach to performance optimization. First, we give details on the GPU acceleration of the specific programs. The achieved speedup is due to the higher performance of the GPU, adjustments of the original algorithm to it, and application of the novel algorithms. More specifically, we provide implementation details of programs for movie alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition to these three programs, multiple other programs from an actively used, open-source software package XMIPP have been accelerated and improved. Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA, together with the introduction of complex dynamic autotuning to the KTT tool. Third, we propose an image processing framework Umpalumpa, which combines a task-based runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for writing complex workflows which automatically use available HW resources and adjust to different HW and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 71367

    BAT: A Benchmark suite for AutoTuners

    Get PDF
    the code by ?nding the best possible values for a given architecture. To our knowledge, there are currently no standardized benchmark suites for comparing and testing autotuners. Developers of autotuners thus make their own when presenting and comparing autotuners. We thus present BAT, a Benchmark suite for AutoTuners with HPC-based parameterized GPU programs. CUDA programs and kernels from "The Scalable Heterogeneous Computing (SHOC) Benchmark" are parameterized. BAT contains a varied selection of benchmarks of different complexity that can utilize multiple GPUs on one system, either by running the same program and computations on multiple nodes, or by splitting the work between nodes. BAT contains 9 di?erent HPC benchmarks that provide a large search space of autotuning parameters, and are modified to suite many di?erent autotuners. BAT also includes a CLI that facilitates autotuning with the benchmarks. Our benchmark suite is tested with four di?erent autotuners, OpenTuner, Kernel Tuner, CLTune and KTT. They di?er in setup and how they tune. The impact of the di?erent benchmark parameters on the running time across architectures is analyzed. Test systems used include a DGX-2, IBM Power System AC922 with Tesla V100-SXM2 32 GB GPUs, an RTX Titan, a GeForce GTX 980 and a server with 20 Tesla T4 GPUs

    Portable performance on heterogeneous architectures

    Get PDF
    Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    Benchmarking optimization algorithms for auto-tuning GPU kernels

    Full text link
    Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU program (kernel) is challenging, and generally only certain specific kernel configurations lead to significant increases in performance. Auto-tuning is the process of automatically optimizing software for highly-efficient execution on a target hardware platform. Auto-tuning is particularly useful for GPU programming, as a single kernel requires re-tuning after code changes, for different input data, and for different architectures. However, the discrete, and non-convex nature of the search space creates a challenging optimization problem. In this work, we investigate which algorithm produces the fastest kernels if the time-budget for the tuning task is varied. We conduct a survey by performing experiments on 26 different kernel spaces, from 9 different GPUs, for 16 different evolutionary black-box optimization algorithms. We then analyze these results and introduce a novel metric based on the PageRank centrality concept as a tool for gaining insight into the difficulty of the optimization problem. We demonstrate that our metric correlates strongly with observed tuning performance.Comment: in IEEE Transactions on Evolutionary Computation, 202

    Doctor of Philosophy

    Get PDF
    dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput

    Nākotnes procesoru arhitektūru pielietojums precīzu daļiņu paātrinātāju modelēšanā

    Get PDF
    Jaunās procesoru arhitektūras, kā grafiskie procesori (GPU) un Intel Many Integrated Cores (MIC) procesori, sniedz milzīgu veiktspējas potenciālu augstas veiktspējas skaitļošanas aplikācijās. Tomēr izstrādājot programmatūru, kas spēj izmantot šīs jaunās tehnoloģijas ir jāsaskarās ar dažādiem papildus grūtībām. Programmām ir jāspēj izmantot papildus paralēlisms, ko piedāvā šīs iekārtās, tām ir jāspēj pielāgoties dažādām procesoru arhitektūrām un jāizmanto dažādas izstrādes platformas, lai aplikācija spēdu darboties uz iekārtām no dažādiem ražotājiem. Dynamic Kernel Scheduler (DKS) tika izstrādāts, lai nodrošinātu papildus programmatūras slāni starp programmu un papildus processoriem. DKS nodrošina komunikāciju starp aplikāciju un šīm iekārtām, uzdevumu izpildi uz iekārtas un piedāvā bibliotēku ar algoritmiem, kas optimizēti darbībai uz šīm iekārtām. Algoritmi, kas pieejami DKS, tiek izstrādāti izmantojot CUDA, OpenCL un OpenMP tehnoloģijas. Atkarībā no pieejamās iekārtas DKS spēj pārslēgties starp šiem risinājumiem un izvēlēties pareizo algoritma implementāciju. DKS tika izmantots, lai nodrošinātu papilduis processoru atbalstu tādās aplikācijās kā OPAL (Object-oriented Particle Accelerator Library), musrfit un PET (Positron Emission Tomography) attētlu rekonstruēšanas aplikācijā. Šīs programmas tiek izstrādātas Paul Scherrer Institut un ETH Zurich, un izmantotas daļiņu paātrinātāju modelēšanā un experimentālo datu analīzē. Sasniegtie rezultāti rāda, ka izmantojot papildus processorus iespējams sasniegt ievērojamu paātrinājumu programmu izpildes laikā un ar DKS palīdzību tiek atvieglota GPU un Intel MIC integrācija un uzturēšana esošajās aplikācijās. Jaunno processoru arhitektūru potenciāls tiek papildus nodemonstrēts pārnesot uz CUDA mbtrack aplikāciju, kas izstrādāta SOLEIL (French national synchrotron facility). Šī programma PSI tiek izmantota, lai modelētu nestabilitātes saistītos daļiņu kūļos un pārejošos effektus, ko rada daļiņu plūsma mijiedarbojoties ar apkārtējiem elementiem. Izmantojot skaitļošanas jaudu, kas pieejama GPU, ir iespējams šīs simulācijas pārnest no lielākiem CPU klasteriem un vienkāršāku sistēmu, kas sastāv no processora papildināta ar vienu grafisko karti. Atslēgas vārdi: Aparatūras paātrinātāji, GPU skaitļošana, Intel MIC, CUDA, OpenCL, OpenMPEmerging processor architectures such as graphical processing units (GPUs) and Intel Many Integrated Cores (MICs) provide a huge performance potential for high performance computing. However developing software that uses these hardware accelerators introduces additional challenges for the developer. These challenges may include exposing increased parallelism, handling different hardware designs, and using multiple development frameworks in order to utilize devices from different vendors. The Dynamic Kernel Scheduler (DKS) is being developed in order to provide a software layer between the host application and different hardware accelerators. DKS handles the communication between the host and the device, schedules task execution, and provides a library of built-in algorithms. Algorithms available in the DKS library will be written in CUDA, OpenCL, and OpenMP. Depending on the available hardware, the DKS can select the appropriate implementation of the algorithm. The DKS was used to enable co-processor usage in applications such as OPAL (Object-oriented Particle Accelerator Library), musrfit and PET (Positron Emission Tomography) Image reconstruction application. These applications are developed at Paul Scherrer Institut, and ETH Zurich for particle accelerator modeling and experimental data analysis, and used by the world wide user community. The achieved results show that substantial speedups in application execution times can be achieved using co-processors compared to CPUs and with the help of DKS the process of integrating new processors in existing applications is simplified and more maintainable. The potential of the new hardware architectures is further demonstrated by porting to CUDA application for multibunch tracking (mbtrack) developed at SOLEIL (French national synchrotron facility). This application is used at PSI for detailed study of coupled bunch instabilities and transient beam-loading. By using the computational power of the GPUs the necessary simulations can be done on the GPU instead of a larger computing cluster that would be required otherwise. Keywords: Hardware acceleration, GPU computing, Intel MIC, CUDA, OpenCL, OpenM
    corecore