In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for "Parallel Autotuned Stencils,” generates a compute kernel from a specification of the stencil operation and a strategy which describes the parallelization and optimization to be applied, and leverages the autotuning methodology to optimize strategy-specific parameters for the given hardware architectur

Burkhart, Helmar

Christen, Matthias

Schenk, Olaf

RERO DOC Digital Library

Comput Sci Res Dev (2011) 26: 205–210DOI 10.1007/s00450-011-0160-6S P E C I A L I S S U E PA P E RAutomatic code generation and tuning for stencil kernelson modern shared memory architecturesMatthias Christen · Olaf Schenk · Helmar BurkhartPublished online: 6 April 2011© Springer-Verlag 2011Abstract In this paper, we present PATUS, a code gener-ation and auto-tuning framework for stencil computationstargeted at multi- and manycore processors, such as mul-ticore CPUs and graphics processing units. PATUS, whichstands for “Parallel Autotuned Stencils,” generates a com-pute kernel from a specification of the stencil operation anda strategy which describes the parallelization and optimiza-tion to be applied, and leverages the autotuning methodol-ogy to optimize strategy-specific parameters for the givenhardware architecture.Keywords Stencil computations · Code generation ·Autotuning · High performance computing1 IntroductionIn many numerical codes, ranging from simple PDE solversto complex AMR and multigrid solvers, the class of sten-cil computations is a constituent class of kernels. Often-times, stencil computations comprise a dominant part of thecompute time. Therefore, in order to minimize the time tosolution, it is crucial that the stencil kernels make use ofthe available computing resources as efficiently as possible.However, microarchitectures have become more and morecomplex and diverse, and, as a consequence, meticulousarchitecture- and application-specific tuning is required toelicit the machine’s full compute power. This not only re-quires deeper understanding of the architecture, but is alsoboth a time consuming and error-prone process.M. Christen () · O. Schenk · H. BurkhartDepartment of Mathematics and Computer Science,University of Basel, Klingelbergstrasse 50, 4056 Basel,Switzerlande-mail: m.christen@unibas.chThe PATUS framework is a code generation and au-totuning tool for the class of stencil computations. Theidea behind the PATUS framework is twofold: on the onehand it provides a software infrastructure for generatingarchitecture-specific stencil code from a specification ofthe stencil incorporating domain-specific knowledge thatenables optimizing the code beyond the abilities of cur-rent compilers, and on the other hand it aims at beingan experimentation toolbox for parallelization and opti-mization strategies. Using a small domain specific lan-guage (DSL), the user can define the stencil kernel usinga C-like syntax, and can choose from predefined strate-gies how the kernel is optimized and parallelized, or de-sign a custom strategy to experiment with other algo-rithms or find a better mapping to the hardware in use.Besides supporting almost arbitrary types of stencils onstructured grids and generating code from strategy tem-plates, another goal of PATUS is to be able to support futurehardware microarchitectures and programming paradigms.Currently we support traditional CPU architectures usingOpenMP for parallelization and NVIDIA CUDA-capableGPUs.This work is closely related to the more generally ap-plicable approach of loop tiling [4, 5, 7, 9, 10]. Cache-oblivious blocking schemes for iterative stencil computa-tions determining the optimal tile sizes at runtime are pro-posed in [3, 11]. The autotuning methodology has been ap-plied successfully in diverse libraries and frameworks forvarious types of kernels which occur frequently in scien-tific computing, including ATLAS, FLAME, OSKI, FFTW,SPIRAL, and recently in a framework for stencil computa-tions [6]. The customizable strategies are a key feature thatsets PATUS apart from other frameworks such as the afore-mentioned one.206 M. Christen et al.Fig. 1 Stencil examples studiedin this paper. The images showthe structure of the input andoutput nodes. The numbers arethe arithmetic intensities innumber of floating pointoperations (FLOPs) pertransferred data element (TDE).The numerator is the actualnumber of FLOPs per stencilcomputation, the denominatorthe number of actuallytransferred data elements2 Stencil examples and bandwidth saving algorithmsA stencil is a fixed geometric arrangement defined on astructured grid. A stencil computation assigns the centernode a value depending on the values previously assigned tothe neighboring nodes in the fixed arrangement. This is donefor all the inner nodes of the grid. Examples of applicationsof stencil computations include finite difference-type PDEsolvers and image processing filters.2.1 Stencil examplesIn this paper we concentrate on the stencil examples shownin Fig. 1. These stencil kernels will be used as benchmarkexamples in Sect. 4. The figure shows a number of exam-ples of stencil structures (“input” nodes to the left of thearrows in the images, and “output” nodes to their right) andhighlights their variety. The “Laplacian”, “Divergence”, and“Gradient” stencils are finite difference discretizations of thecorresponding basic differential operator, whereas “Hyper-thermia”, “Upstream”, and “Tricubic Interpolation” comefrom real world applications; the former comes from a sim-ulation of the temperature distribution within the humanbody during hyperthermia cancer treatment [1], and the lat-ter two occur as typical examples in the weather forecastcode COSMO.In stencil computations, the number of floating point op-erations (FLOPs) per grid point is constant. The number ofFLOPs typically is low compared to the (constant) numberof memory references. I.e., stencil computations have a con-stant arithmetic intensity with respect to the problem size,unlike, e.g., BLAS3 operations. The arithmetic intensitiesin FLOPs per transferred data element for the examples isgiven in the figure. Here, data transfers mean transfers be-tween RAM and caches (CPUs) or global memory and regis-ters (GPUs). Because of the low FLOP rate, we typically ex-pect the performance of stencil computations to be boundedby the available memory bandwidth. Hence, the key for en-hancing performance lies in minimizing data transfers.2.2 Saving bandwidthThere are several approaches to get rid of non-compulsorydata transfers and thereby reducing the memory traffic. Thekey is to reuse data that have been loaded previously, i.e., toexploit data locality.On cache-based architectures, cache blocking is a wellknown technique to improve temporal data locality: by de-composing the grid into cache size dependent small subgridsit is ensured that data loaded into the cache are reused beforebeing evicted due to capacity misses.Iterative stencil computations can benefit from blockingnot only in space, but also in time, especially if there is alocal memory that can be controlled explicitly by the pro-grammer such as on a GPU. Temporal blocking has the ad-vantage of greater temporal data locality and reduced syn-chronization overhead. The basic idea is to compute mul-tiple timesteps with all the data kept in local memory andtherefore to avoid writing the data back to main memory af-ter one timestep and reloading it again for the next as wellas to avoid synchronization within a time block. The tech-nique was previously adapted to different types of hardwarearchitectures and was described in [1, 2, 8].Another temporal blocking scheme is a “wavefront” par-allelization proposed by Wellein et al. [12]. The idea of thewavefront parallelization is having a team of threads coop-erate on a chunk of data. While thread i is sweeping throughthe subgrid, thread i + 1 takes the output of thread i to per-form its sweep. Hence, each thread in the team calculatesone timestep on the same subgrid. Because of the data de-pendencies, thread i + 1 has to wait for thread i to completethe computation of the input data that is needed for the com-putation before it can start its sweep, making it look likeripples of waves passing through the subgrid.Automatic code generation and tuning for stencil kernels on modern shared memory architectures 207Fig. 2 High-level overview ofthe software architecture ofPATUS. The strategy and thestencil specification are inputfiles, which drive the codegeneration. The code generatorcreates a set of parametrizedhardware-specific kernels thatare executed by the autotuner,which determines the optimalparameter set3 The PATUS frameworkPATUS expects 3 input files to generate a stencil kernelcode: The major input, from the user’s point of view, beingthe specification of the stencil operation. E.g., the discreteLaplacianu′ijk = αuijk + β(ui−1,j,k + ui+1,j,k + ui,j−1,k+ ui,j+1,k + ui,j,k−1 + ui,j,k+1)is specified in the PATUS stencil DSL like so:stencil laplacian {operation (double grid u,double param alpha, double param beta) {u[x, y, z; t+1] =alpha * u[x, y, z; t] +beta * (u[x-1, y, z; t] + u[x+1, y, z; t] +u[x, y-1, z; t] + u[x, y+1, z; t] +u[x, y, z-1; t] + u[x, y, z+1; t]);}}The second input is a “strategy,” which describes how thekernel source is actually generated: It describes paralleliza-tion methods or a bandwidth saving algorithm by means ofa second DSL. The description is independent both of thestencil and of the hardware architecture and the program-ming model used. Strategies are also the interfaces to the au-totuner: strategy parameters (e.g., blocking sizes) are pickedup by the autotuner, which tries to find the values for whichthe code has the best performance.PATUS provides predefined strategies, but the user candevelop own strategies and thereby experiment with otherparallelization and optimization approaches. The DSL is ex-pressive enough so that the afore-mentioned bandwidth sav-ing algorithms can be implemented.The third input file describes various aspects of the hard-ware for which the code is generated, e.g., it specifies theprogramming model, thereby selecting the code generatorback-end, whether or not the hardware requires explicit datatransfers to local stores, whether explicit SIMDization is re-quired, etc.3.1 The software infrastructurePATUS is built from four core components: the parsers forthe two input files, the stencil definition and the strategy; thecode generator, which is driven by the third input file, thearchitecture specification; and the autotuner. Figure 2 givesa high-level overview over the software architecture.The code generator produces C code for variants of thestencil kernel and also creates an initialization routine thatimplements a NUMA-aware data initialization based on theparallelization used in the kernel routine. The code gener-ator transforms the strategy AST to C code and “instanti-ates” the stencil, i.e., replaces the formal grids and sten-cil calls in the strategy by the actual identifiers and sten-cil computation. Moreover, it handles data transfers to localmemory if required and performs optimizations such as ex-plicit SIMDization and loop unrolling. Back-ends for sharedmemory CPU systems using OpenMP for parallelizationand CUDA-capable single-GPUs systems have been imple-mented so far.In order for the autotuner to perform the benchmarks, thecode generator also creates a benchmark harness. The prob-lem size and the autotuning parameters are expected as com-mand line arguments by the benchmark harness. After build-ing the executable from the kernel code and the benchmarkharness, the autotuner seeks to find the optimal configura-tion for the parameters by repeatedly running the programwith the autotuning parameters varying according to somesearch method. The PATUS autotuner supports a variety ofsearch methods.4 Experimental performance resultsThe performance benchmark were carried out on an IntelNehalem (Intel Xeon E7540 “Beckton”) architecture, the208 M. Christen et al.Fig. 3 Performance results for 6 types of stencils on 1 to 24 threadsof the Intel Nehalem and the AMD Magny-Cours architectures, andon the NVIDIA C2050 Fermi GPU for naïve threading and blockingstrategies. The low arithmetic intensity stencil kernels—Laplacian,Divergence, Gradient, and Hyperthermia—were calculated using sin-gle precision floating point numbers; the kernels with high arithmeticintensity—Upstream and Tricubic—were calculated in double preci-sion. The figure to the lower right shows the performance variations forblock sizes varying in two dimensions. The autotuner picks the blockyielding the best performance by searching along one axis, fixing thesize with the best performance and continuing the search along thenext axis as symbolized by the arrows (Powell search method). Thesize with the best performance has been highlighted in the figureAutomatic code generation and tuning for stencil kernels on modern shared memory architectures 209Table 1 Characteristics of thehardware architecture used inthe performance benchmarksIntel AMD NVIDIAXeon E7540 Opteron 6172 Tesla C2050Cores 2 × 6 4 × 6 14Concurrency 24 HW threads 24 HW threads 448 ALUsClock 2 GHz 2.1 GHz 1.15 GHzL1 Data Cache 32 KB 64 KB 48 + 16 KBL2 Cache 256 KB 512 KB –Shared L3 Cache 18 MB 6 MB –Avg. Shared L3/HW Thread 1.5 MB 1 MB –Measured Bandwidth (STREAM) 35.0 GB/s 53.1 GB/s 79.3 GB/sAMD Magny-Cours (AMD Opteron 6172), and an NVIDIATesla C2050 GPU Computing Processor. Some architecturecharacteristics are summarized in Table 1.We performed performance benchmarks using the auto-tuned code with a set of strategies applied to the stencil ker-nels from Sect. 2.1. In all the plots, the performance num-bers of first four stencils (Laplacian, Divergence, Gradient,and Hyperthermia) are single precision GFLOP/s; Upstreamand Tricubic are double precision GFLOP/s. We used a 1283sized grid for all the stencil examples. Five runs with onetimestep each were performed and timed. The reported per-formance numbers are average numbers.The “basic threading” strategy only parallelizes the sten-cil computation without doing any blocking. On the CPUstwo different blocking strategies differing in the numbers ofblocking levels were applied. Both display similar perfor-mance results after autotuning on both architectures. Gener-ating explicit SSE intrinsics (instead of relying on the com-piler to do the vectorization) proved to be beneficial for the7-point stencils (Laplacian and Hyperthermia). The slightlyhigher absolute performance numbers on the Magny-Coursis due to its superior bandwidth for the bandwidth-limitedcases and slightly higher clock rate for the compute-boundcase (“Tricubic”).On the GPU, besides using a parallelization using thesame basic strategy as for the CPUs, a blocked strategy withtwo parallelism levels was chosen. The graph shows sub-stantial speed improvements when the thread block sizesare chosen carefully (in our case by means of the auto-tuner) over default 43 thread block sizes. The bar labeled“+Cache” shows the performance improvement from in-creasing the GPU cache size from 16 KB to 48 KB perstreaming multiprocessor. The GPU code generation is stillwork in progress, and the figure is merely included to high-light the fact that PATUS is able to generate CUDA code,while code optimizations still need to be improved.Figure 3 to the lower right, finally, shows the perfor-mance for block sizes varying in two dimensions. It is theautotuner’s job to picks the block yielding the best perfor-mance. In Fig. 3, the Powell search method is shown, whichsearches along one axis, fixing the size with the best perfor-mance and continuing the search along the next axis.5 ConclusionWe presented PATUS, a code generation and autotuningframework for general stencil computations. It is thought ofas both a productivity tool and a tool for experimenting withparallelization and optimization strategies. We have shownthat the approach works for both modern multi- and many-core architectures, and the performance numbers demon-strate the potential of leveraging non-trivial strategies andthe autotuning methodology. The current framework still haslimitations that we intend to overcome in the future. Theframework will be publicly available under an open sourcetype license.References1. Christen M, Schenk O, Neufeld E, Paulides M, Burkhart H (2010)Manycore stencil computations in hyperthermia applications. In:Scientific computing with multicore and accelerators. CRC Press,Boca Raton, pp 255–2772. Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2008, toappear) Optimization and performance modeling of stencil compu-tations on modern microprocessors, SIAM Rev3. Frigo M, Strumpen V (2005) Cache oblivious stencil computations.In: ICS’05: proceedings of the 19th annual international conferenceon supercomputing. ACM, New York, pp 361–3664. Goumas G, Athanasaki M, Koziris N (2003) An efficient code gen-eration technique for tiled iteration spaces. IEEE Trans Parallel Dis-trib Syst 14:1021–10345. Hall M, Chame J, Chen C, Shin J, Rudy G, Khan M (2010) Looptransformation recipes for code generation and auto-tuning. In:Gao G, Pollock L, Cavazos J, Li X (eds) Languages and compil-ers for parallel computing. Lecture Notes in Computer Science, vol5898. Springer, Berlin, pp 50–646. Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In:IEEE International Parallel & Distributed Processing Symposium(IPDPS), pp 1–127. Li Z, Song Y (2004) Automatic tiling of iterative stencil loops.ACM Trans Program Lang Syst 26(6):975–1028210 M. Christen et al.8. Meng J, Skadron K (2011) A performance study for iterative stencilloops on GPUs with ghost zone optimizations. Int J Parallel Pro-gram 39:115–142. doi:10.1007/s10766-010-0142-59. Renganarayanan L, Kim D, Rajopadhye S, Strout M (2007) Param-eterized tiled loops for free. ACM SIGPLAN Not 42:405–41410. Rivera G, Tseng C (2000) Tiling optimizations for 3D scientificcomputations. In: Supercomputing, ACM/IEEE 2000 conference11. Strzodka R, Shaheen M, Pajak D, Seidel H (2010) Cache obliv-ious parallelograms in iterative stencil computations. In: ICS’10:proceedings of the 24th ACM international conference on super-computing, pp 49–5912. Wellein G, Hager G, Zeiser T, Wittmann M, Fehske H (2009)Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: COMPSAC(1), pp 579–586Matthias Christen received his MSdegree in Mathematics from theUniversity of Basel, Switzerland, in2006. Currently he is a PhD studentin Computer Science at the Univer-sity of Basel. His research interestsare in code generation, autotuning,and high-performance computing.Olaf Schenk received a diploma inmathematics from the KarlsruherInstitute of Technology, Germany,a PdD degree from the Swiss Fed-eral Institute of Technology (ETH)and Venia Legendi (Habilitation)from the University of Basel, Swit-zerland. The research of OlafSchenk concerns algorithmic andarchitectural problems in the fieldof computational mathematics, sci-entific computing and high-performance computing. In these ar-eas, he has published more than 60peer-reviewed journal articles andconference contributions. He is an IEEE Senior Member, and a SIAMMember. He received a highly-competitive IBM Faculty Award on CellProcessors for Biomedical Hyperthermia Applications in 2008 and wasthe finalist of the International Itanium Award in 2009 in the area ofcomputational intensive applications.Helmar Burkhart is a ComputerScience Professor at the Universityof Basel since 1987. He received adiploma in Computer Science fromthe University of Stuttgart, Ger-many, and a PdD degree and Ve-nia Legendi (Habilitation) from theSwiss Federal Institute of Technol-ogy (ETH) Zurich, Switzerland. Hehave held several positions such asPresident of the Swiss Informat-ics Society SI/Swiss Chapter of theACM (1990–1992), member of theexpert group Swiss Priority Pro-gramme in Informatics Research(1991–1996), and cofounder and board member of the SPEEDUP as-sociation. His research interests include parallel and distributed pro-cessing, web technologies, and e-learning.

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

http://doc.rero.ch/record/313510/files/450_2011_Article_160.pdf

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Abstract

Similar works

Full text

Available Versions

RERO DOC Digital Library