126 research outputs found
Optimisation of computational fluid dynamics applications on multicore and manycore architectures
This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems.
The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance.
The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces
Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance
[EN] The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge
for hardware and software designers. These requirements are reaching levels where the isolated development on either
computational field is not enough to deal with such challenge. A holistic view of the computational thinking is therefore the
only way to success in real scenarios. However, this is not a trivial task as it requires, among others, of hardwareÂżsoftware
codesign. In the hardware side, most high-throughput computers are designed aiming for heterogeneity, where accelerators (e.g. Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), etc.) are connected through
high-bandwidth bus, such as PCI-Express, to the host CPUs. Applications, either via programmers, compilers, or runtime,
should orchestrate data movement, synchronization, and so on among devices with different compute and memory
capabilities. This increases the programming complexity and it may reduce the overall application performance. This
article evaluates different offloading strategies to leverage heterogeneous systems, based on several cards with the firstgeneration Xeon Phi coprocessors (Knights Corner). We use a 11-point 3-D Stencil kernel that models heat dissipation as
a case study. Our results reveal substantial performance improvements when using several accelerator cards. Additionally,
we show that computing of an approximate result by reducing the communication overhead can yield 23% performance
gains for double-precision data sets.The author(s) disclosed receipt of the following financial
support for the research, authorship, and/or publication of
this article: This work is jointly supported by the Fundacion
Seneca (Agencia Regional de Ciencia y Tecnologia,
Region de Murcia) under grants 15290/PI/2010 and
18946/JLI/13 and by the Spanish MINECO, as well as
European Commission FEDER funds, under grants
TIN2015-66972-C5-3-R and TIN2016-78799-P (AEI/
FEDER, UE). MH was supported by a research grant from the PRODEP under the Professional Development Program for Teachers (UAGro-197) MĂ©xicoHernández, M.; Cebrián, JM.; Cecilia-Canales, JM.; GarcĂa, JM. (2020). Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance. International Journal of High Performance Computing Applications. 34(2):199-297. https://doi.org/10.1177/1094342017738352S199297342Michael Brown, W., Carrillo, J.-M. Y., Gavhane, N., Thakkar, F. M., & Plimpton, S. J. (2015). Optimizing legacy molecular dynamics software with directive-based offload. Computer Physics Communications, 195, 95-101. doi:10.1016/j.cpc.2015.05.004Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2012). Power Limitations and Dark Silicon Challenge the Future of Multicore. ACM Transactions on Computer Systems, 30(3), 1-27. doi:10.1145/2324876.2324879Feng, L. (2015). Data Transfer Using the Intel COI Library. High Performance Parallelism Pearls, 341-348. doi:10.1016/b978-0-12-802118-7.00020-0Jeffers, J., & Reinders, J. (2013). Offload. Intel Xeon Phi Coprocessor High Performance Programming, 189-241. doi:10.1016/b978-0-12-410414-3.00007-4Rahman, R. (2013). Intel® Xeon Phi™ Coprocessor Architecture and Tools. doi:10.1007/978-1-4302-5927-5Reinders J, Jeffers J (2014) High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches (Characterization and Auto-tuning of 3DFD). Morgan Kaufmann, pp. 377–396.Shareef, B., de Doncker, E., & Kapenga, J. (2015). Monte Carlo simulations on Intel Xeon Phi: Offload and native mode. 2015 IEEE High Performance Extreme Computing Conference (HPEC). doi:10.1109/hpec.2015.7322456UjaldĂłn, M. (2016). CUDA Achievements and GPU Challenges Ahead. Lecture Notes in Computer Science, 207-217. doi:10.1007/978-3-319-41778-3_20Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., & Wang, Y. (2014). High-Performance Computing on the Intel® Xeon Phi™. doi:10.1007/978-3-319-06486-4Wende, F., Klemm, M., Steinke, T., & Reinefeld, A. (2015). Concurrent Kernel Offloading. High Performance Parallelism Pearls, 201-223. doi:10.1016/b978-0-12-802118-7.00012-
Batched Linear Algebra Problems on GPU Accelerators
The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of the accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. The size of these matrices is usually the same, up to a few hundred. The number of them can be thousands, even millions.
Compared to large matrix problems with more data parallel computation that are well suited on GPUs, the challenges of small matrix problems lie in the low computing intensity, the large sequential operation fractions, and the big PCI-E overhead. These challenges entail redesigning the algorithms instead of merely porting the current LAPACK algorithms.
We consider two classes of problems. The first is linear systems with one-sided factorizations (LU, QR, and Cholesky) and their solver, forward and backward substitution. The second is a two-sided Householder bi-diagonalization. They are challenging to develop and are highly demanded in applications. Our main efforts focus on the same-sized problems. Variable-sized problems are also considered, though to a lesser extent.
Our contributions can be summarized as follows. First, we formulated a batched linear algebra framework to solve many data-parallel, small-sized problems/tasks. Second, we redesigned a set of fundamental linear algebra algorithms for high- performance, batched execution on GPU accelerators. Third, we designed batched BLAS (Basic Linear Algebra Subprograms) and proposed innovative optimization techniques for high-performance computation. Fourth, we illustrated the batched methodology on real-world applications as in the case of scaling a CFD application up to 4096 nodes on the Titan supercomputer at Oak Ridge National Laboratory (ORNL). Finally, we demonstrated the power, energy and time efficiency of using accelerators as compared to CPUs. Our solutions achieved large speedups and high energy efficiency compared to related routines in CUBLAS on NVIDIA GPUs and MKL on Intel Sandy-Bridge multicore CPUs.
The modern accelerators are all Single-Instruction Multiple-Thread (SIMT) architectures. Our solutions and methods are based on NVIDIA GPUs and can be extended to other accelerators, such as the Intel Xeon Phi and AMD GPUs based on OpenCL
A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit
Autotuning of performance-relevant source-code parameters allows to
automatically tune applications without hard coding optimizations and thus
helps with keeping the performance portable. In this paper, we introduce a
benchmark set of ten autotunable kernels for important computational problems
implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that
with autotuning most of the kernels reach near-peak performance on various GPUs
and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation
also demonstrates that autotuning is key to performance portability. In
addition to offline tuning, we also introduce dynamic autotuning of code
optimization parameters during application runtime. With dynamic tuning, the
Kernel Tuning Toolkit enables applications to re-tune performance-critical
kernels at runtime whenever needed, for example, when input data changes.
Although it is generally believed that autotuning spaces tend to be too large
to be searched during application runtime, we show that it is not necessarily
the case when tuning spaces are designed rationally. Many of our kernels reach
near peak-performance with moderately sized tuning spaces that can be searched
at runtime with acceptable overhead. Finally we demonstrate, how dynamic
performance tuning can be integrated into a real-world application from
cryo-electron microscopy domain
- …