878 research outputs found
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Term Rewriting on GPUs
We present a way to implement term rewriting on a GPU. We do this by letting
the GPU repeatedly perform a massively parallel evaluation of all subterms. We
find that if the term rewrite systems exhibit sufficient internal parallelism,
GPU rewriting substantially outperforms the CPU. Since we expect that our
implementation can be further optimized, and because in any case GPUs will
become much more powerful in the future, this suggests that GPUs are an
interesting platform for term rewriting. As term rewriting can be viewed as a
universal programming language, this also opens a route towards programming
GPUs by term rewriting, especially for irregular computations
A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers
(English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial.
Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels.
The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product.
The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Enginyeria tèrmic
A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers
(English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial.
Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels.
The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product.
The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Postprint (published version
Recommended from our members
Efficient execution of irregular programs on heterogeneous systems
Programmable accelerators such as GPUs, FPGAs, and DSPs enable
modern systems to provide higher performance for many workloads than is
possible by using conventional processors alone.
Traditionally, portability of applications to these accelerators and
between accelerators was a major hurdle in utilizing accelerators
in a heterogeneous system. With the emergence of standardized programming
APIs such as OpenCL, this problem is being ameliorated and
many accelerators can now be programmed using a single API.
In this work, we address the efficient execution of \emph{irregular}
programs on heterogeneous systems. Irregular programs are used extensively
in problem domains like graph analytics and finite-element methods, and
they are characterized by data-dependent control flow and memory accesses
that cannot be predicted at compile time. We focus on heterogeneous systems
that provide a coherent memory to all devices.
First, we describe a set of compiler and runtime techniques to
support efficient execution of irregular programs on heterogeneous
systems composed of a CPU and an integrated GPU. The compiler allows
applications written in C++ to be executed on the GPU without any
programmer effort. The runtime system solves the load imbalance arising
from irregularity in the applications by dynamically assigning work to
each device.
Next, we present an alternative implementation strategy for
irregular applications on a system with more heterogeneity. Specifically,
graph applications can be expressed as \textit{producer-consumer}
computations on FPGA+CPU heterogeneous systems. This approach allows for
better utilization of the capabilities of each device and suggests a
programming model for accelerators that goes beyond the \textit{offload}
model.
Finally, we explore efficient execution of irregular
applications on accelerators that do not share a
coherent memory with the master processor. For discrete GPUs, we explore
implementation strategies of graph application, focusing on
synchronization tradeoffs and present optimizations that address the
synchronization overheads both within and across devices.Computer Science
- …