We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA).



This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC).



In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale

Hammond, Simon D.

Jarvis, Stephen A.

Mudalige, Gihan R.

Pennycook, Simon J.

English

Warwick Research Archives Portal Repository

http://wrap.warwick.ac.uk/            Original citation: Pennycook, Simon J., Hammond, Simon D., Mudalige, Gihan R. and Jarvis, Stephen A., 1970- (2010) Experiences with porting and modelling wavefront algorithms on many-core architectures. In: Daresbury GPU Workshop 2010, Daresbury, UK  Permanent WRAP url: http://wrap.warwick.ac.uk/47466       Copyright and reuse: The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions.  Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners.  To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.  Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge.  Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.  A note on versions: The version presented in WRAP is the published version or, version of record, and may be cited as it appears here.  For more information, please contact the WRAP Team at: publications@warwick.ac.uk  Experiences with Porting and Modelling Wavefront Algorithmson Many-Core Architecturesx erie ces it rti elli vefr t l rit sy- re rc itect resExperiences with Porting and Mode ling avefront Algorithmson Many-Core ArchitecturesS.J. Pennycook, S.D. Hammond, G.R. Mudalige, and S.A.Jarvis1 IntroductionWe are currently investigating the viability of many-core ar-chitectures for the acceleration of wavefront applications andthis report focuses on graphics processing units (GPUs) inparticular. To this end, we have implemented NASA’s LUbenchmark [1] – a real world production-grade application –on GPUs employing NVIDIA’s Compute Unified Device Archi-tecture (CUDA) [2].This GPU implementation of the benchmark has been used toinvestigate the performance of a selection of GPUs, rangingfrom workstation-grade commodity GPUs to the HPC “Tesla”and “Fermi” GPUs. We have also compared the performanceof the GPU solution at scale to that of traditional high perfor-mance computing (HPC) clusters based on a range of multi-core CPUs from a number of major vendors, including Intel(Nehalem), AMD (Opteron) and IBM (PowerPC).In previous work we have developed a predictive “plug-and-play” performance model of this class of application runningon such clusters, in which CPUs communicate via the MessagePassing Interface (MPI) [3, 4]. By extending this model toalso capture the performance behaviour of GPUs, we are ableto: (1) comment on the effects that architectural changeswill have on the performance of single-GPU solutions, and(2) make projections regarding the performance of multi-GPUsolutions at larger scale.2 Wavefront ApplicationsTypical three-dimensional implementations of the parallel wave-front algorithm operate over a grid of Nx × Ny × Nz grid-points. Computation starts at one of the grid’s corner verticesand progresses to the opposite, which we refer to henceforthas a sweep.By way of example, we consider a single sweep through thedata-grid in which the computation required by each grid-point (i, j, k) is dependent upon the values of three neigh-bours: (i−1, j, k), (i, j−1, k) and (i, j, k−1). In [5], Lamportshowed that, for a given value of f , all grid-points that lie onthe hyperplane defined by i+ j + k = f can be computed inparallel. Furthermore, all of the grid-points upon which thiscomputation depends satisfy i + j + k = f − 1; by steppingin f and computing all satisfied grid-points, the dependencyis preserved. We refer to each of these steps as a wavefrontstep.  0 2 1 0  2     2     5     7     0     5     6     3     4     1 Figure 1: Two-dimensional mapping of threads onto athree-dimensional data grid.This algorithm occupies large execution times at supercom-puting centres such as the Los Alamos National Laboratory(LANL) in the United States and the Atomic Weapons Es-tablishment (AWE) in the United Kingdom. Therefore, theacceleration of wavefront applications is of both commercialand academic interest.2.1 GPU ImplementationDue to the strict data dependency present in wavefront ap-plications, it is necessary to introduce a method of globalthread synchronisation. We achieve this through the use ofrepeated kernel invocation, launching one kernel for the solu-tion of each of the wavefront steps. In each kernel we launchO(N2) threads, mapping them to grid-points as shown in Fig-ure 1. Those threads assigned to grid-points that are notcurrently on the active hyperplane exit immediately, withoutcarrying out any computation.Global memory is rearranged in keeping with this mapping (toensure coalescence of memory accesses), but our kernels donot make use of shared memory.2.2 GPU PerformanceThe graph in Figure 2 shows the performance of the GPU so-lution running on three HPC-capable GPUs: the Tesla T10,Tesla C1060 and Tesla C2050. Also included is the perfor-mance of the original Fortran benchmark executed on a quad-core 2.66GHz Intel “Nehalem” X5550 with 12GB of RAM,with and without simultaneous multithreading (SMT). Theexecution times reported are for problem classes A, B and Cof the benchmark, which use data grids of size 643, 1023 and1623 respectively.The GPU solution outperforms the original Fortran benchmarkfor all three problem classes, with the Tesla T10/C1060 andC2050 being up to 2.3x and 6.9x faster respectively.One of the most eagerly anticipated additions to NVIDIA’sFermi architecture was the inclusion of ECC memory. ECCHigh Performance Systems Group, University of WarwickSeptember 2010Research Note UW22-9-2010-2.0For more information: go.warwick.ac.uk/hpsgHigh Performance Systems Group, University of WarwickSeptember 2010Research Note UW 2-9-2010-2.0For more information: go.warwick.ac.uk/hpsgExperiences with Porting and Modelling Wavefront Algorithms on Many-Core ArchitecturesExperiences with Porting and Mode ling avefront Algorithms on Many-Core ArchitecturesExperiences with Porting and Mode ling Wavefront Algorithms on Many-Core Architectures050100150200250300350400450A B CExecutionTime(Seconds)Problem ClassIntel X5550 (4 Cores)Intel X5550 (8 SMT Threads)Tesla T10Tesla C1060Tesla C2050 (ECC on)Tesla C2050 (ECC off)Figure 2: Execution times of LU across differentworkstation configurations.memory is not without cost, however; firstly, enabling ECCon the Tesla C2050 decreases the amount of global memoryavailable to the user from 3GB to 2.65GB; secondly it leadsto a significant performance decrease. For a Class C problemrun in double precision, execution times are almost 1.3x lowerwhen ECC is disabled.Though these results illustrate the performance benefits ofGPU utilisation at the level of a single workstation, they do notanswer the question of whether GPUs are ready to be adoptedin place of CPUs at the cluster level. The development ofperformance models is likely to aid us in attempting to answerthis question. Furthermore, results from several HPC clustersare to appear in an SC 10 paper later this year.3 Performance ModellingPrevious work by the group developed a performance modelfor wavefront applications running at scale on CPUs commu-nicating via MPI [3, 4]. We attempt to adapt this model tobe used with GPU clusters, made possible by its reusable andgeneric nature.We can capture changes to compute behaviour by replacingthe original CPU “grind time” parameter with a GPU sub-model, whilst the MPI-associated overheads can be incorpo-rated into the message latency parameter.It is worth noting that, in our experience, the new costs thatresult from data transfer across the PCI Express (PCIe) busare unlikely to contribute significantly to the overall executiontime in a realistic GPU-enabled application. A well-writtenapplication will not transfer the entire data grid back to the0.0E+005.0E-051.0E-041.5E-042.0E-042.5E-043.0E-043.5E-044.0E-044.5E-045.0E-045.5E-040 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000ExecutionTime(Seconds)Number of Grid-PointsTeslaTesla ModelFigure 3: Execution times for Tesla C1060 and modelpredictions.CPU for communication; this is wasteful, since only the borderdata is actually required. Packing and unpacking the MPIbuffers on the GPU avoids the transfer of unnecessary dataand greatly decreases the effect of these overheads.Several other studies have demonstrated that it is possible toaccurately model the performance of select application kernelsbased on source code analysis, or through low-level hardwaresimulation of GPUs [6, 7]. Our previous performance modelshave been produced at a higher level of abstraction, usuallybased on timing results from instrumented code and/or bench-mark results. However, the concept of executing kernels on aseparate device (shared by both CUDA and OpenCL) does notoften lend itself well to such instrumentation. The best thatwe can achieve with CPU timers is to model the time takenfor a given application kernel – in our case, this correspondsdirectly to the time taken for a given hyperplane step.The graph in Figure 3 shows the execution times for each ofthe hyperplane steps in our GPU implementation of LU whenrun on a Tesla C1060 card. The first hyperplane step com-putes the value of a single grid-point, the second the values ofthree grid-points, the third six grid-points and so on, up untila maximum of approximately 20,000 grid points.We model the execution time for a given number of grid-points, g, thus:B(g) = dg/(S × T )e (1)t(g) = h+ (C × (B(g)− 1)× h) (2)High Performance Systems Group, University of WarwickSeptember 2010Research Note UW22-9-2010-2.0For more information: go.warwick.ac.uk/hpsgHigh Performance Systems Group, University of WarwickSeptember 2010Research Note UW 2-9-2010-2.0For more information: go.warwick.ac.uk/hpsgExperiences with Porting and Modelling Wavefront Algorithms on Many-Core ArchitecturesExperiences with Porting and Mode ling avefront Algorithms on Many-Core ArchitecturesExperiences with Porting and Mode ling Wavefront Algorithms on Many-Core Architectureswhere B is the number of blocks per stream multiprocessor(SM), S is the number of SMs and T is the number of threadsper block. h represents the time taken to process the firstactive grid-point (calculated through code instrumentation).Finally, the presence of C attempts to compensate for theGPUs’ ability to run several thread blocks concurrently. Wecurrently know little more about this “concurrency factor”other than that it is a function of the number of thread blocksthat can be scheduled to each SM, itself a function of thenumber of registers used by a particular kernel – we expectthat the exact nature of this term will become clearer as ourwork progresses.Essentially, the model states that the execution time of thekernel will stay constant so long as the number of threadblocks assigned to each SM remains the same. An increase inthe number of blocks scheduled to each SM will increase theexecution time by some factor (based on the ability to hidememory latency via time-slicing); any further blocks scheduledto an SM after they have been saturated will require additionalprocessing steps and we model these as occuring serially. S×Tthreads are required to fill all SMs once, and the number ofthreads required to fully saturate all SMs is dependent uponregister usage.The wavefront kernel used in our experiments to date uses 107registers, limiting the number of concurrent blocks per SMto 2. The Tesla C1060 has 30 SMs, and each thread blockcontains 64 threads. Our model predicts a small increase inexecution time every 30 × 64 = 1920 threads and a largerincrease every 30×64×2 = 3840 threads, assumptions whichboth match up to the graph. However, our model is lessaccurate for large numbers of grid-points, which we believe tobe the result of unforeseen memory contention issues not yetcovered by our parameterisation (e.g. partition camping). Forthis kernel, we have found 0.5 to be an acceptable value of C.Figure 4 shows the corresponding graph of execution timesfor a Tesla C2050 card, built on the “Fermi” architecture. Itshould be noted that the y-axis scale is not the same as theprevious graph – the Fermi card is consistently around 2 - 3xfaster than the Tesla. As before, the increases in executiontime correspond to increases in the number of blocks sched-uled to each SM. The GPU has 14 SMs, each supporting amaximum of 8 thread blocks, suggesting an increase in exe-cution time every 14× 64× 8 = 7168 threads. However, theremainder of the execution time graph appears to be linear innature.The different performance behaviour of Fermi is likely to bedue to the architectural improvements made by NVIDIA, in-cluding a dual-warp scheduler and a two-tier hardware cache.0.0E+002.0E-054.0E-056.0E-058.0E-051.0E-041.2E-041.4E-041.6E-041.8E-042.0E-040 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000ExecutionTime(Seconds)Number of Grid-PointsFermiFigure 4: Execution times for Tesla C2050.In future work, we intend to extend our existing Tesla modelto GPUs based on the Fermi architecture.References[1] D. Bailey, T. Harris, W. Saphir, R. Wijngaart, A. Woo, andM. Yarrow. The NAS Parallel Benchmarks 2.0. Technical Re-port NAS-95-020, NASA, December 1995.[2] The NVIDIA Compute Unified Device Architecture. http://www.nvidia.com/cuda/, 2010.[3] G. R. Mudalige, M. K. Vernon, and S. A. Jarvis. A Plug-and-Play Model for Evaluating Wavefront Computations on ParallelArchitectures. In IEEE International Parallel and DistributedProcessing Symposium (IPDPS 2008). IEEE Computer Society,April, 2008.[4] G. R. Mudalige, S. A. Jarvis, D. P. Spooner, and G. R.Nudd. Predictive Performance Analysis of a Parallel PipelinedSynchronous Wavefront Application for Commodity ProcessorCluster Systems. In Proc. IEEE International Conference onCluster Computing - Cluster2006, Barcelona, September 2006.IEEE Computer Society.[5] L. Lamport. The Parallel Execution of DO Loops. Commun.ACM, 17(2):83–93, 1974.[6] S.S. Baghsorkhi, M. Delahaye, S.J. Patel, W.D. Gropp, andW.W. Hwu. An Adaptive Performance Modeling Tool for GPUArchitectures. In Proceedings of the 15th ACM SIGPLAN Sym-posium on Principles and Practice of Parallel Computing, pages105–114. ACM, 2010.[7] S. Hong and H. Kim. An Analytical Model for a GPU Ar-chitecture with Memory-Level and Thread-Level ParallelismAwareness. ACM SIGARCH Computer Architecture News,37(3):152–163, 2009.High Performance Systems Group, University of WarwickSeptember 2010Research Note UW22-9-2010-2.0For more information: go.warwick.ac.uk/hpsgHigh Performance Systems Group, University of WarwickSeptember 2010Research Note UW 2-9-2010-2.0For more information: go.warwick.ac.uk/hpsg

A Plug-andPlay Model for Evaluating Wavefront Computations on Parallel Architectures.

An Adaptive Performance Modeling Tool for GPU Architectures.

An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness.

Predictive Performance Analysis of a Parallel Pipelined Synchronous Wavefront Application for Commodity Processor Cluster Systems. In

The NAS Parallel Benchmarks 2.0.

The NVIDIA Compute Uni Device Architecture.

The Parallel Execution of DO Loops.

Experiences with porting and modelling wavefront algorithms on many-core architectures

http://wrap.warwick.ac.uk/47466/1/WRAP_Pennycook_uw22-9-2010-2.0.pdf

Experiences with porting and modelling wavefront algorithms on many-core architectures

Abstract

Similar works

Full text

Available Versions

Warwick Research Archives Portal Repository