Search CORE

878 research outputs found

Doctor of Philosophy

Author: King James Sokhom
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

The University of Utah: J. Willard Marriott Digital Library

Term Rewriting on GPUs

Author: Groote Jan Friso
Hijma Pieter
Martens Jan
van Eerd Johri
Wijs Anton
Publication venue
Publication date: 15/09/2020
Field of study

We present a way to implement term rewriting on a GPU. We do this by letting the GPU repeatedly perform a massively parallel evaluation of all subterms. We find that if the term rewrite systems exhibit sufficient internal parallelism, GPU rewriting substantially outperforms the CPU. Since we expect that our implementation can be further optimized, and because in any case GPUs will become much more powerful in the future, this suggests that GPUs are an interesting platform for term rewriting. As term rewriting can be viewed as a universal programming language, this also opens a route towards programming GPUs by term rewriting, especially for irregular computations

arXiv.org e-Print Archive

VU Research Portal

A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers

Author: Álvarez Farré Xavier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 05/09/2022
Field of study

(English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial. Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels. The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product. The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Enginyeria tèrmic

Tesis Doctorals en Xarxa

A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers

Author: Álvarez Farré Xavier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 05/09/2022
Field of study

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

Efficient execution of irregular programs on heterogeneous systems

Author: Kaleem Rashid
Publication venue
Publication date: 02/10/2017
Field of study

Programmable accelerators such as GPUs, FPGAs, and DSPs enable modern systems to provide higher performance for many workloads than is possible by using conventional processors alone. Traditionally, portability of applications to these accelerators and between accelerators was a major hurdle in utilizing accelerators in a heterogeneous system. With the emergence of standardized programming APIs such as OpenCL, this problem is being ameliorated and many accelerators can now be programmed using a single API. In this work, we address the efficient execution of \emph{irregular} programs on heterogeneous systems. Irregular programs are used extensively in problem domains like graph analytics and finite-element methods, and they are characterized by data-dependent control flow and memory accesses that cannot be predicted at compile time. We focus on heterogeneous systems that provide a coherent memory to all devices. First, we describe a set of compiler and runtime techniques to support efficient execution of irregular programs on heterogeneous systems composed of a CPU and an integrated GPU. The compiler allows applications written in C++ to be executed on the GPU without any programmer effort. The runtime system solves the load imbalance arising from irregularity in the applications by dynamically assigning work to each device. Next, we present an alternative implementation strategy for irregular applications on a system with more heterogeneity. Specifically, graph applications can be expressed as \textit{producer-consumer} computations on FPGA+CPU heterogeneous systems. This approach allows for better utilization of the capabilities of each device and suggests a programming model for accelerators that goes beyond the \textit{offload} model. Finally, we explore efficient execution of irregular applications on accelerators that do not share a coherent memory with the master processor. For discrete GPUs, we explore implementation strategies of graph application, focusing on synchronization tradeoffs and present optimizations that address the synchronization overheads both within and across devices.Computer Science

Texas ScholarWorks