97 research outputs found
Lightweight Massively Parallel Suffix Array Construction
The suffix array is an array of sorted suffixes in lexicographic order, where each sorted suffix is represented by its starting position in the input string. It is a fundamental data structure that finds various applications in areas such as string processing, text indexing, data compression, computational biology, and many more. Over the last three decades, researchers have proposed a broad spectrum of suffix array construction algorithms (SACAs). However, the majority of SACAs were implemented using sequential and parallel programming models. The maturity of GPU programming opened doors to the development of massively parallel GPU SACAs that outperform the fastest versions of suffix sorting algorithms optimized for the CPU parallel computing. Over the last five years, several GPU SACA approaches were proposed and implemented. They prioritized the running time over lightweight design.
In this thesis, we design and implement a lightweight massively parallel SACA on the GPU using the prefix-doubling technique. Our prefix-doubling implementation is memory-efficient and can successfully construct the suffix array for input strings as large as 640 megabytes (MB) on Tesla P100 GPU. On large datasets, our implementation achieves a speedup of 7-16x over the fastest, highly optimized, OpenMP-accelerated suffix array constructor, libdivsufsort, that leverages the CPU shared memory parallelism. The performance of our algorithm relies on several high-performance parallel primitives such as radix sort, conditional filtering, inclusive prefix sum, random memory scattering, and segmented sort. We evaluate the performance of our implementation over a variety of real-world datasets with respect to its runtime, throughput, memory usage, and scalability. We compare our results against libdivsufsort that we run on a Haswell compute node equipped with 24 cores. Our GPU SACA is simple and compact, consisting of less than 300 lines of readable and effective source code. Additionally, we design and implement a fast and lightweight algorithm for checking the correctness of the suffix array
OPTIMIZING LEMPEL-ZIV FACTORIZATION FOR THE GPU ARCHITECTURE
Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel nature of GPUs for computations other that their original purpose of rendering graphics. Our work targets the use of GPUs for general lossless data compression. Specifically, we developed and ported an algorithm that constructs the Lempel-Ziv factorization directly on the GPU. Our implementation bypasses the sequential nature of the LZ factorization and attempts to compute the factorization in parallel. By breaking down the LZ factorization into what we call the PLZ, we are able to outperform the fastest serial CPU implementations by up to 24x and perform comparatively to a parallel multicore CPU implementation. To achieve these speeds, our implementation outputted LZ factorizations that were on average only 0.01 percent greater than the optimal solution that what could be computed sequentially.
We are also able to reevaluate the fastest GPU suffix array construction algorithm, which is needed to compute the LZ factorization. We are able to find speedups of up to 5x over the fastest CPU implementations
Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems
Graphics Processing Units (GPUs) are accelerators for computers and provide massive amounts of computational power and bandwidth for amenable applications. While effectively utilizing an individual GPU already requires a high level of skill, effectively utilizing multiple GPUs introduces completely new types of challenges. This work sets out to investigate how the hierarchical execution model of GPUs can be exploited to simplify the utilization of such multi-GPU systems.
The investigation starts with an analysis of the memory access patterns exhibited by applications from common GPU benchmark suites. Memory access patterns are collected using custom instrumentation and a simple simulation then analyzes the patterns and identifies implicit communication across the different levels of the execution hierarchy. The analysis reveals that for most GPU applications memory accesses are highly localized and there exists a way to partition the workload so that the communication volume grows slower than the aggregated bandwidth for growing numbers of GPUs.
Next, an application model based on Z-polyhedra is derived that formalizes the distribution of work across multiple GPUs and allows the identification of data dependencies. The model is then used to implement a prototype compiler that consumes single-GPU programs and produces executables that distribute GPU workloads across all available GPUs in a system. It uses static analysis to identify memory access patterns and polyhedral code generation in combination with a dynamic tracking system to efficiently resolve data dependencies. The prototype is implemented as an extension to the LLVM/Clang compiler and published in full source.
The prototype compiler is then evaluated using a set of benchmark applications. While the prototype is limited in its applicability by technical issues, it provides impressive speedups of up to 12.4x on 16 GPUs for amenable applications. An in-depth analysis of the application runtime reveals that dependency resolution takes up less than 10% of the runtime, often significantly less.
A discussion follows and puts the work into context by presenting and differentiating related work, reflecting critically on the work itself and an outlook of the aspects that could be explored as part of this research. The work concludes with a summary and a closing opinion
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
Algorithm Engineering for fundamental Sorting and Graph Problems
Fundamental Algorithms build a basis knowledge for every computer science undergraduate or a professional programmer. It is a set of basic techniques one can find in any (good) coursebook on algorithms and data structures. In this thesis we try to close the gap between theoretically worst-case optimal classical algorithms and the real-world circumstances one face under the assumptions imposed by the data size, limited main memory or available parallelism
GPU High-Performance Framework for PIC-like Simulation Methods Using the Vulkan® Explicit API
Within computational continuum mechanics there exists a large category of simulation methods which operate by tracking Lagrangian particles over an Eulerian background grid. These Lagrangian/Eulerian hybrid methods, descendants of the Particle-In-Cell method (PIC), have proven highly effective at simulating a broad range of materials and mechanics including fluids, solids, granular materials, and plasma. These methods remain an area of active research after several decades, and their applications can be found across scientific, engineering, and entertainment disciplines.
This thesis presents a GPU driven PIC-like simulation framework created using the Vulkan® API. Vulkan is a cross-platform and open-standard explicit API for graphics and GPU compute programming. Compared to its predecessors, Vulkan offers lower overhead, support for host parallelism, and finer grain control over both device resources and scheduling. This thesis harnesses those advantages to create a programmable GPU compute pipeline backed by a Vulkan adaptation of the SPgrid data-structure and multi-buffered particle arrays. The CPU host system works asynchronously with the GPU to maximize utilization of both the host and device. The framework is demonstrated to be capable of supporting Particle-in-Cell like simulation methods, making it viable for GPU acceleration of many Lagrangian particle on Eulerian grid hybrid methods. This novel framework is the first of its kind to be created using Vulkan® and to take advantage of GPU sparse memory features for grid sparsity
AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES
A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph
- …