126 research outputs found

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

    Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance

    Full text link
    [EN] The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge for hardware and software designers. These requirements are reaching levels where the isolated development on either computational field is not enough to deal with such challenge. A holistic view of the computational thinking is therefore the only way to success in real scenarios. However, this is not a trivial task as it requires, among others, of hardware¿software codesign. In the hardware side, most high-throughput computers are designed aiming for heterogeneity, where accelerators (e.g. Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), etc.) are connected through high-bandwidth bus, such as PCI-Express, to the host CPUs. Applications, either via programmers, compilers, or runtime, should orchestrate data movement, synchronization, and so on among devices with different compute and memory capabilities. This increases the programming complexity and it may reduce the overall application performance. This article evaluates different offloading strategies to leverage heterogeneous systems, based on several cards with the firstgeneration Xeon Phi coprocessors (Knights Corner). We use a 11-point 3-D Stencil kernel that models heat dissipation as a case study. Our results reveal substantial performance improvements when using several accelerator cards. Additionally, we show that computing of an approximate result by reducing the communication overhead can yield 23% performance gains for double-precision data sets.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is jointly supported by the Fundacion Seneca (Agencia Regional de Ciencia y Tecnologia, Region de Murcia) under grants 15290/PI/2010 and 18946/JLI/13 and by the Spanish MINECO, as well as European Commission FEDER funds, under grants TIN2015-66972-C5-3-R and TIN2016-78799-P (AEI/ FEDER, UE). MH was supported by a research grant from the PRODEP under the Professional Development Program for Teachers (UAGro-197) MéxicoHernández, M.; Cebrián, JM.; Cecilia-Canales, JM.; García, JM. (2020). Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance. International Journal of High Performance Computing Applications. 34(2):199-297. https://doi.org/10.1177/1094342017738352S199297342Michael Brown, W., Carrillo, J.-M. Y., Gavhane, N., Thakkar, F. M., & Plimpton, S. J. (2015). Optimizing legacy molecular dynamics software with directive-based offload. Computer Physics Communications, 195, 95-101. doi:10.1016/j.cpc.2015.05.004Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2012). Power Limitations and Dark Silicon Challenge the Future of Multicore. ACM Transactions on Computer Systems, 30(3), 1-27. doi:10.1145/2324876.2324879Feng, L. (2015). Data Transfer Using the Intel COI Library. High Performance Parallelism Pearls, 341-348. doi:10.1016/b978-0-12-802118-7.00020-0Jeffers, J., & Reinders, J. (2013). Offload. Intel Xeon Phi Coprocessor High Performance Programming, 189-241. doi:10.1016/b978-0-12-410414-3.00007-4Rahman, R. (2013). Intel® Xeon Phi™ Coprocessor Architecture and Tools. doi:10.1007/978-1-4302-5927-5Reinders J, Jeffers J (2014) High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches (Characterization and Auto-tuning of 3DFD). Morgan Kaufmann, pp. 377–396.Shareef, B., de Doncker, E., & Kapenga, J. (2015). Monte Carlo simulations on Intel Xeon Phi: Offload and native mode. 2015 IEEE High Performance Extreme Computing Conference (HPEC). doi:10.1109/hpec.2015.7322456Ujaldón, M. (2016). CUDA Achievements and GPU Challenges Ahead. Lecture Notes in Computer Science, 207-217. doi:10.1007/978-3-319-41778-3_20Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., & Wang, Y. (2014). High-Performance Computing on the Intel® Xeon Phi™. doi:10.1007/978-3-319-06486-4Wende, F., Klemm, M., Steinke, T., & Reinefeld, A. (2015). Concurrent Kernel Offloading. High Performance Parallelism Pearls, 201-223. doi:10.1016/b978-0-12-802118-7.00012-

    Batched Linear Algebra Problems on GPU Accelerators

    Get PDF
    The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of the accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. The size of these matrices is usually the same, up to a few hundred. The number of them can be thousands, even millions. Compared to large matrix problems with more data parallel computation that are well suited on GPUs, the challenges of small matrix problems lie in the low computing intensity, the large sequential operation fractions, and the big PCI-E overhead. These challenges entail redesigning the algorithms instead of merely porting the current LAPACK algorithms. We consider two classes of problems. The first is linear systems with one-sided factorizations (LU, QR, and Cholesky) and their solver, forward and backward substitution. The second is a two-sided Householder bi-diagonalization. They are challenging to develop and are highly demanded in applications. Our main efforts focus on the same-sized problems. Variable-sized problems are also considered, though to a lesser extent. Our contributions can be summarized as follows. First, we formulated a batched linear algebra framework to solve many data-parallel, small-sized problems/tasks. Second, we redesigned a set of fundamental linear algebra algorithms for high- performance, batched execution on GPU accelerators. Third, we designed batched BLAS (Basic Linear Algebra Subprograms) and proposed innovative optimization techniques for high-performance computation. Fourth, we illustrated the batched methodology on real-world applications as in the case of scaling a CFD application up to 4096 nodes on the Titan supercomputer at Oak Ridge National Laboratory (ORNL). Finally, we demonstrated the power, energy and time efficiency of using accelerators as compared to CPUs. Our solutions achieved large speedups and high energy efficiency compared to related routines in CUBLAS on NVIDIA GPUs and MKL on Intel Sandy-Bridge multicore CPUs. The modern accelerators are all Single-Instruction Multiple-Thread (SIMT) architectures. Our solutions and methods are based on NVIDIA GPUs and can be extended to other accelerators, such as the Intel Xeon Phi and AMD GPUs based on OpenCL

    A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

    Full text link
    Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain
    • …
    corecore