The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the performance and portability of the SYCL and CUDA languages for a matrix multiplication (MM) application across different GPU architectures. The experimental work showed that, while the CUDA implementation outperforms the SYCL implementation on NVIDIA devices due to optimizations provided by the nvcc compiler, the latter implementation demonstrated remarkable code portability to other GPU architectures, such as AMD and Intel. Furthermore, the architectural efficiency percentages obtained on AMD and Intel GPUs showed consistency with the results observed on NVIDIA devices.Facultad de Informátic

Costanzo, Manuel

García-Sánchez, Carlos

Naiouf, Marcelo

Rucci, Enzo

Servicio de Difusión de la Creación Intelectual

Brief Performance Portability Analysis of aMatrix Multiplication Kernel on MultipleVendor GPUsManuel Costanzo1 , Enzo Rucci1 , Carlos Garćıa-Sánchez2 , and MarceloNaiouf11 III-LIDI, Facultad de Informática, UNLP – CIC.La Plata (1900), Bs As, Argentina{mcostanzo,erucci,mnaiouf}@lidi.info.unlp.edu.ar2 Dpto. Arquitectura de Computadores y Automática, Universidad Complutense deMadrid. Madrid (28040), Españagarsanca@dacya.ucm.esAbstract. The heterogeneous computing paradigm has led to the needfor portable and efficient programming solutions that can leverage thecapabilities of various hardware devices, such as NVIDIA, Intel, andAMD GPUs. This study evaluates the performance and portability of theSYCL and CUDA languages for a matrix multiplication (MM) applica-tion across different GPU architectures. The experimental work showedthat, while the CUDA implementation outperforms the SYCL imple-mentation on NVIDIA devices due to optimizations provided by thenvcc compiler, the latter implementation demonstrated remarkable codeportability to other GPU architectures, such as AMD and Intel. Fur-thermore, the architectural efficiency percentages obtained on AMD andIntel GPUs showed consistency with the results observed on NVIDIAdevices.Keywords: oneAPI · SYCL · GPU · CUDA· Performance portability1 IntroductionIn the last decade, the quest to improve the energy efficiency of computingsystems has fueled the trend toward heterogeneous computing and massivelyparallel architectures [1]. Nowadays, GPUs can be considered the dominant ac-celerator, and Nvidia, Intel, and AMD are the biggest manufacturers. In the 4thquarter of 2022, Intel and AMD had 9% of the market, with Nvidia dominatingthe discrete graphics card market at 82%. By including integrated and embeddedgraphics, Intel had 71% of the market, Nvidia 17% and AMD 12% 3.Focusing on the programming aspect, CUDA is the most popular GPU pro-gramming language. However, CUDA codes only run on Nvidia GPUs. This fact Corresponding author.3 https://www.pcgamer.com/intel-is-already-matching-amd-for-gaming-graphics-market-share/Short Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 13 -imposes severe limitations to code portability, also affecting maintenance, ex-tension, and development cost. One effort to face this issue is SYCL 4, a newopen standard from Khronos Group. SYCL is a royalty-free, cross-platform ab-straction layer that allows the programmer to write single-source C++ host codeincluding accelerated code expressed as functors. In this sense, the same SYCLcode can run on various hardware platforms, including CPUs, GPUs, and FP-GAs. In this way, SYCL seeks to reduce development and maintenance costs andalso improve programming productivity.In this context, while reaching functional portability is already hard, perfor-mance portability becomes a major challenge. In this paper, we evaluate thefunctional and performance portability of two GPU-accelerated implementationsof a matrix multiplication (MM) kernel across Intel, Nvidia, and AMD GPUsusing Marowka’s method [2].2 Background2.1 SYCL and the oneAPI Programming EcosystemSYCL is a cross-platform programming model based on C++ language forheterogeneous computing and features asynchronous task graphs, hierarchicalparallelism, buffers defining location-independent storage, automatic overlap-ping kernels and communications, and interoperability with OpenCL, amongother characteristics [3]. Recently, Intel announced the oneAPI programmingecosystem that provides a unified programming model for a wide range of hard-ware architectures. At the core of the oneAPI environment is the Data-ParallelC++ (DPC++) programming language, which can be summarized as C++with SYCL. Additionally, DPC++ also features some vendor-provided exten-sions that might be integrated into these standards in the future [4]. Last,oneAPI provides different programming utilities, including a compatibility tool(SYCLomatic) that facilitates the migration to the SYCL-based DPC++ pro-gramming language.2.2 Performance portabilityAccording to Penycook [5], performance portability refers to ”A measurement ofan application’s performance efficiency for a given problem that can be executedcorrectly on all platforms in a given set”. These authors define two different per-formance efficiency metrics: architectural efficiency and application efficiency.The former represents the ability of an application to utilize hardware efficientlyand is a fraction of “peak” theoretical hardware performance; while the latterrepresents the ability of an application to use the most appropriate implemen-tation and algorithm for each platform, and is a fraction of the best-observedperformance.The metric for performance portability presented by Penycook [5] was laterreformulated by Marowka [2] to address some of its flaws. The latter is presented4 https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdfShort Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 14 -next. Formally, for a given set of platforms H from the same architecture class,the performance portability Φ̄ of a case-study application α solving problem pis:¯Φ(α, p,H) ={∑i∈H ei(α,p)|H| if i is supported ∀i ∈ Hnot applicable (NA) otherwisewhere ei(α, p) is the performance efficiency of case-study application α solvingproblem p on the platform i.3 Experimental Work and Results3.1 Case-Study Applications: Matrix Multiplication (MM)Two GPU-accelerated implementations of matrix multiplication (MM) kernelwere considered for the performance portability evaluation:– CUDA: this version was extracted from the CUDA Demo Suite 5. This appcomputes a MM using shared memory through a tiled approach and loopunrolling technique to increase throughput.– SYCL: this code is based on the implementation provided in [6], which rep-resents an SYCL-equivalent, migrated version of the previous one.It is important to note that, according to Nvidia, the code “It has been writ-ten for clarity of exposition to illustrate various CUDA programming principles,not with the goal of providing the most performant generic kernel for matrixmultiplication.” 63.2 Experimental ResultsThe experiments were performed on eight systems equipped with different GPUs.The main features of these systems are described in Table 1. A single workloadwas configured for MM (nIter = 10; wA,wB, hA, hB = {16384}). Finally, torun SYCL code on Nvidia and AMD GPUs, several modifications had to bemade to the build, as it is not supported by default 7. After these modifications,it was possible to run DPC++ code on an Nvidia GPU, but using the Clang++compiler (nvcc 11.7, clang 16.0).Table 2 presents the GFLOP/s and architectural efficiencies of both CUDA andSYCL codes on the experimental platforms. On the one hand, it becomes evidentthat CUDA outperforms SYCL using all Nvidia GPUs. In particular, CUDAversion runs (on average) 1.2× faster than SYCL. This superior performancecan be primarily attributed to the fact that nvcc, performs a more efficient5 https://docs.Nvidia.com/cuda/cuda-samples/6 https://docs.Nvidia.com/cuda/cuda-c-programming-guide/index.html7 https://intel.github.io/llvm-docs/GetStartedGuide.htmlShort Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 15 -Table 1: Experimental platformsCPU GPUProcessorRAMMemoryVendorModel(Architecture)GFLOPSpeak (SP)Intel Xeon E5-2695 16 GB GTX 980 (Maxwell) 4980Intel Xeon E5-2695 16 GB GTX 1080 (Pascal) 8872Intel Core i5-7400 64 GB Nvidia RTX 2070 (Turing) 7464Intel Core i5-10400F 64 GB RTX 3070 (Ampere) 20313Intel Xeon Gold 6138 64 GB Tesla V100 (Volta) 14131Intel Core i9-9900K 32 GB Intel Arc 770 (Gen 12.5) 19660Intel Xeon iE5-1620 64 GB AMD RX 6700 XT (RDNA 2.0) 13214Table 2: GFLOP/s and architectural efficiencies of both CUDA and SYCL codeson the experimental platforms.Platform CUDA SYCLGPUGFLOP/speakGFLOP/sachievedArch.eff.GFLOP/sachievedArch.eff.GTX 980 4980 552 11.1% 430 8.6%GTX 1080 8872 603 6.8% 556 6.3%RTX 2070 7464 1011 13.6% 810 10.9%RTX 3070 20313 1316 6.5% 1084 5.3%Tesla V100 14131 1582 11.2% 1345 9.5%Arc 770 19660 × NA 1836 9.3%RX 6700 XT 13214 × NA 1553 11.8%code translation than clang++ when it comes to shared memory accesses 8,causing SYCL code to use more registers and perform additional computation.On the other hand, architectural efficiencies are low for both code versions (8%on average). This behavior is related to the educative aspect of the original codethat was detailed in Section 3.1.Performance portability of both CUDA and SYCL codes is presented in Table 3.First, it becomes evident that SYCL code provides higher functional portability,successfully running on different hardware vendor platforms. Moreover, CUDAfails to demonstrate the same level of adaptability, just being able to run onNvidia GPUs. Second, both codes present a similar performance efficiency whenexecuting on the different supported GPUs, demonstrating their performanceportability.8 https://support.codeplay.com/t/poor-performance-on-matrix-multiplication/575/2?u=mcostanzoShort Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 16 -Table 3: Performance portability of both CUDA and SYCL codes on the exper-imental platforms.Φ(α, p,H)Platform set (H ) CUDA SYCLNvidia 9.8% 8.1%Intel NA 9.3%AMD NA 11.8%Nvidia ∪ AMD NA 8.7%Nvidia ∪ Intel NA 8.3%Intel ∪ AMD NA 10.5%Nvidia ∪ AMD ∪ Intel NA 8.8%3.3 DiscussionWhile SYCL code proved to be slower than its CUDA counterpart in this study,it showcased performance portability across a wider range of GPU vendors,highlighting its versatility and potential. However, it is important to note thatthe observed performance difference between SYCL and CUDA codes does notoccur in all cases; [7, 8, 9] show that SYCL codes can achieve the same or evenbetter performance than CUDA versions.4 Conclusions and Future WorkIn this paper, we have evaluated both the performance and portability of SYCLand CUDA languages for a MM application on Nvidia, Intel, and AMD GPUs.The main findings of this study can be summarized as follows:– The performance comparison between the SYCL and CUDA implementa-tions on Nvidia devices revealed that the latter outperforms the former dueto the optimizations applied by the nvcc compiler.– We have successfully demonstrated the code portability of the SYCL imple-mentation to other GPU architectures, such as AMD and Intel. Moreover,the architectural efficiency percentages obtained on these GPUs were foundto be consistent with those observed on Nvidia devices.In summary, this brief study highlights the potential of SYCL as a performance-portable alternative to CUDA for the development of high-performance comput-ing applications. Although the current performance of SYCL on Nvidia GPUsmay be lower than that of CUDA, this gap will decrease as SYCL compilerscontinue to improve.Future work focuses on exploring the use of SYCL in different application do-mains. This could provide valuable insights into its performance and portabilityfeatures in a broader context, enabling a more comprehensive understanding ofits strengths and limitations.Short Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 17 -References[1] H. Giefers et al. “Analyzing the energy-efficiency of sparse matrix multipli-cation on heterogeneous systems: A comparative study of GPU, Xeon Phiand FPGA”. In: 2016 IEEE ISPASS. 2016, pp. 46–56.[2] Ami Marowka. “Reformulation of the performance portability metric”. In:Software: Practice and Experience 52.1 (2022), pp. 154–171. doi: https://doi.org/10.1002/spe.3002.[3] Ronan Keryell and Lin-Ya Yu. “Early Experiments Using SYCL Single-Source Modern C++ on Xilinx FPGA”. In: Proceedings of the IWOCL ’18.Oxford, UK: ACM, 2018. doi: 10.1145/3204919.3204937.[4] S. Christgau and T. Steinke. “Porting a Legacy CUDA Stencil Code tooneAPI”. In: 2020 IEEE IPDPSW. May 2020, pp. 359–367. doi: 10.1109/IPDPSW50202.2020.00070.[5] S.J. Pennycook, J.D. Sewall, and V.W. Lee. “Implications of a metric forperformance portability”. In: Future Generation Computer Systems 92 (2019),pp. 947–958. issn: 0167-739X. doi: https://doi.org/10.1016/j.future.2017.08.007.[6] Manuel Costanzo et al. “Early Experiences Migrating CUDA codes to oneAPI”.In: IX Jornadas de Cloud Computing, Big Data & Emerging Topics. 2021.[7] Manuel Costanzo et al. “Migrating CUDA to oneAPI: A Smith-WatermanCase Study”. In: Bioinformatics and Biomedical Engineering. Cham: SpringerInternational Publishing, 2022, pp. 103–116. isbn: 978-3-031-07802-6. doi:10.1007/978-3-031-07802-6_9.[8] Zheming Jin and Jeffrey S. Vetter. “Performance Portability Study of Epis-tasis Detection Using SYCL on NVIDIA GPU”. In: Proceedings of the 13thACM International Conference on Bioinformatics, Computational Biologyand Health Informatics. BCB ’22. Northbrook, Illinois: ACM, 2022. isbn:9781450393867. doi: 10.1145/3535508.3545591.[9] Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, and Mahesh Barve.“Performance Study of GPU applications using SYCL and CUDA on TeslaV100 GPU”. In: 2021 IEEE High Performance Extreme Computing Con-ference (HPEC). 2021, pp. 1–7. doi: 10.1109/HPEC49654.2021.9622813.Short Papers of the 11th Conference on Cloud Computing Conference, Big Data & Emerging Topics- 18 -

Brief performance portability analysis of a matrix multiplication kernel on multiple vendor  GPUs

http://sedici.unlp.edu.ar/bitstream/handle/10915/155420/Documento_completo.pdf?sequence=1

Brief performance portability analysis of a matrix multiplication kernel on multiple vendor GPUs

Abstract

Similar works

Full text

Available Versions

Servicio de Difusión de la Creación Intelectual