    Sparse Tensor Transpositions

    We present a new algorithm for transposing sparse tensors called Quesadilla. The algorithm converts the sparse tensor data structure to a list of coordinates and sorts it with a fast multi-pass radix algorithm that exploits knowledge of the requested transposition and the tensors input partial coordinate ordering to provably minimize the number of parallel partial sorting passes. We evaluate both a serial and a parallel implementation of Quesadilla on a set of 19 tensors from the FROSTT collection, a set of tensors taken from scientific and data analytic applications. We compare Quesadilla and a generalization, Top-2-sadilla to several state of the art approaches, including the tensor transposition routine used in the SPLATT tensor factorization library. In serial tests, Quesadilla was the best strategy for 60% of all tensor and transposition combinations and improved over SPLATT by at least 19% in half of the combinations. In parallel tests, at least one of Quesadilla or Top-2-sadilla was the best strategy for 52% of all tensor and transposition combinations.Comment: This work will be the subject of a brief announcement at the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '20

    GPU Accelerated Approach to Numerical Linear Algebra and Matrix Analysis with CFD Applications

    A GPU accelerated approach to numerical linear algebra and matrix analysis with CFD applications is presented. The works objectives are to (1) develop stable and efficient algorithms utilizing multiple NVIDIA GPUs with CUDA to accelerate common matrix computations, (2) optimize these algorithms through CPU/GPU memory allocation, GPU kernel development, CPU/GPU communication, data transfer and bandwidth control to (3) develop parallel CFD applications for Navier Stokes and Lattice Boltzmann analysis methods. Special consideration will be given to performing the linear algebra algorithms under certain matrix types (banded, dense, diagonal, sparse, symmetric and triangular). Benchmarks are performed for all analyses with baseline CPU times being determined to find speed-up factors and measure computational capability of the GPU accelerated algorithms. The GPU implemented algorithms used in this work along with the optimization techniques performed are measured against preexisting work and test matrices available in the NIST Matrix Market. CFD analysis looked to strengthen the assessment of this work by providing a direct engineering application to analysis that would benefit from matrix optimization techniques and accelerated algorithms. Overall, this work desired to develop optimization for selected linear algebra and matrix computations performed with modern GPU architectures and CUDA developer which were applied directly to mathematical and engineering applications through CFD analysis

    Performance Optimization With An Integrated View Of Compiler And Application Knowledge

    Compiler optimization is a long-standing research field that enhances program performance with a set of rigorous code analyses and transformations. Traditional compiler optimization focuses on general programs or program structures without considering too much high-level application operations or data structure knowledge. In this thesis, we claim that an integrated view of the application and compiler is helpful to further improve program performance. Particularly, we study integrated optimization opportunities for three kinds of applications: irregular tree-based query processing systems such as B+ tree, security enhancement such as buffer overflow protection, and tensor/matrix-based linear algebra computation. The performance of B+ tree query processing is important for many applications, such as file systems and databases. Latch-free B+ tree query processing is efficient since the queries are processed in batches without locks. To avoid long latency, the batch size can not be very large. However, modern processors provide opportunities to process larger batches parallel with acceptable latency. From studying real-world data, we find that there are many redundant and unnecessary queries especially when the real-world data is highly skewed. We develop a query sequence transformation framework Qtrans to reduce the redundancies in queries by applying classic dataflow analysis to queries. To further confirm the effectiveness, we integrate Qtrans into an existing BSP-based B+ tree query processing system, PALM tree. The evaluations show that the throughput can be improved up to 16X. Heap overflows are still the most common vulnerabilities in C/C++ programs. Common approaches incur high overhead since it checks every memory access. By analyzing dozens of bugs, we find that all heap overflows are related to arrays. We only need to check array-related memory accesses. We propose Prober to efficiently detect and prevent heap overflows. It contains Prober-Static to identify the array-related allocations and Prober-Dynamic to protect objects at runtime. In this thesis, our contributions lie on the Prober-Static side. The key challenge is to correctly identify the array-related allocations. We propose a hybrid method. Some objects can be identified as array-related (or not) by static analysis. For the remaining ones, we instrument the basic allocation type size statically and then determine the real allocation size at runtime. The evaluations show Prober-Static is effective. Tensor algebra is widely used in many applications, such as machine learning and data analytics. Tensors representing real-world data are usually large and sparse. There are many sparse tensor storage formats, and the kernels are different with varied formats. These different kernels make performance optimization for sparse tensor algebra challenging. We propose a tensor algebra domain-specific language and a compiler to automatically generate kernels for sparse tensor algebra computations, called SPACe. This compiler supports a wide range of sparse tensor formats. To further improve the performance, we integrate the data reordering into SPACe to improve data locality. The evaluations show that the code generated by SPACe outperforms state-of-the-art sparse tensor algebra compilers

    Graphics Processing Units: Abstract Modelling and Applications in Bioinformatics

    The Graphical Processing Unit is a specialised piece of hardware that contains many low powered cores, available on both the consumer and industrial market. The original Graphical Processing Units were designed for processing high quality graphical images, for presentation to the screen, and were therefore marketed to the computer games market segment. More recently, frameworks such as CUDA and OpenCL allowed the specialised highly parallel architecture of the Graphical Processing Unit to be used for not just graphical operations, but for general computation. This is known as General Purpose Programming on Graphical Processing Units, and it has attracted interest from the scientific community, looking for ways to exploit this highly parallel environment, which was cheaper and more accessible than the traditional High Performance Computing platforms, such as the supercomputer. This interest in developing algorithms that exploit the parallel architecture of the Graphical Processing Unit has highlighted the need for scientists to be able to analyse proposed algorithms, just as happens for proposed sequential algorithms. In this thesis, we study the abstract modelling of computation on the Graphical Processing Unit, and the application of Graphical Processing Unit-based algorithms in the field of bioinformatics, the field of using computational algorithms to solve biological problems. We show that existing abstract models for analysing parallel algorithms on the Graphical Processing Unit are not able to sufficiently and accurately model all that is required. We propose a new abstract model, called the Abstract Transferring Graphical Processing Unit Model, which is able to provide analysis of Graphical Processing Unit-based algorithms that is more accurate than existing abstract models. It does this by capturing the data transfer between the Central Processing Unit and the Graphical Processing Unit. We demonstrate the accuracy and applicability of our model with several computational problems, showing that our model provides greater accuracy than the existing models, verifying these claims using experiments. We also contribute novel Graphics Processing Unit-base solutions to two bioinformatics problems: DNA sequence alignment, and Protein spectral identification, demonstrating promising levels of improvement against the sequential Central Processing Unit experiments

    On the programmability of multi-GPU computing systems

    Multi-GPU systems are widely used in High Performance Computing environments to accelerate scientific computations. This trend is expected to continue as integrated GPUs will be introduced to processors used in multi-socket servers and servers will pack a higher number of GPUs per node. GPUs are currently connected to the system through the PCI Express interconnect, which provides limited bandwidth (compared to the bandwidth of the memory in GPUs) and it often becomes a bottleneck for performance scalability. Current programming models present GPUs as isolated devices with their own memory, even if they share the host memory with the CPU. Programmers explicitly manage allocations in all GPU memories and use primitives to communicate data between GPUs. Furthermore, programmers are required to use mechanisms such as command queues and inter-GPU synchronization. This explicit model harms the maintainability of the code and introduces new sources for potential errors. The first proposal of this thesis is the HPE model. HPE builds a simple, consistent programming interface based on three major features. (1) All device address spaces are combined with the host address space to form a Unified Virtual Address Space. (2) Programs are provided with an Asymmetric Distributed Shared Memory system for all the GPUs in the system. It allows to allocate memory objects that can be accessed by any GPU or CPU. (3) Every CPU thread can request a data exchange between any two GPUs, through simple memory copy calls. Such a simple interface allows HPE to provide always the optimal implementation; eliminating the need for application code to handle different system topologies. Experimental results show improvements on real applications that range from 5% in compute-bound benchmarks to 2.6x in communication-bound benchmarks. HPE transparently implements sophisticated communication schemes that can deliver up to a 2.9x speedup in I/O device transfers. The second proposal of this thesis is a shared memory programming model that exploits the new GPU capabilities for remote memory accesses to remove the need for explicit communication between GPUs. This model turns a multi-GPU system into a shared memory system with NUMA characteristics. In order to validate the viability of the model we also perform an exhaustive performance analysis of remote memory accesses over PCIe. We show that the unique characteristics of the GPU execution model and memory hierarchy help to hide the costs of remote memory accesses. Results show that PCI Express 3.0 is able to hide the costs of up to a 10% of remote memory accesses depending on the access pattern, while caching of remote memory accesses can have a large performance impact on kernel performance. Finally, we introduce AMGE, a programming interface, compiler support and runtime system that automatically executes computations that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type for multidimensional arrays that allows for robust, transparent distribution of arrays across all GPU memories. The compiler extracts the dimensionality information from the type of each array, and is able to determine the access pattern in each dimension of the array. The runtime system uses the compiler-provided information to automatically choose the best computation and data distribution configuration to minimize inter-GPU communication and memory footprint. This model effectively frees programmers from the task of decomposing and distributing computation and data to exploit several GPUs. AMGE achieves almost linear speedups for a wide range of dense computation benchmarks on a real 4-GPU system with an interconnect with moderate bandwidth. We show that irregular computations can also benefit from AMGE, too.Los sistemas multi-GPU son muy com煤nmente utilizados en entornos de computaci贸n de altas prestaciones para acelerar c谩lculos cient铆ficos. Esta tendencia continuar谩 con la introducci贸n de GPUs integradas en los procesadores de los servidores procesador y con una mayor densidad de GPUs por nodo. Las GPUs actualmente se contectan al sistema a trav茅s de una interconexi贸n PCI Express, que provee un ancho de banda reducido (comparado con las memorias de las GPUs) y habitualmente se convierte en el cuello de botella para escalar el rendimiento. Los modelos de programaci贸n actuales exponen las GPUs como dispositivos aislados con su propia memoria, incluso si comparten la memoria f铆sica con la CPU. Los programadores manejan diferentes reservas en todas las memorias de GPU y usan primitivas para comunicar datos entre GPUs. Adem谩s, los programadores deben utilizar mecanismos como colas de comandos y sincronicaci贸n entre GPUs. Este modelo expl铆cito empeora la programabilidad del c贸digo e introduce nuevas fuentes de errores potenciales. La primera propuesta de esta tesis es el modelo HPE. HPE construye una interfaz de programaci 贸n consistente basada en tres caracter铆sticas principales. (1) Todos los espacios de direcciones de los dispositivos son combinados para formar un espacio de direcciones unificado. (2) Los programas usan un sistema asim茅trico distribuido de memoria compartida para todas las GPUs del sistema, que permite declarar objetos de memoria que pueden ser accedidos por cualquier GPU o CPU. (3) Cada hilo de ejecuci贸n de la CPU puede lanzar un intercambio de datos entre dos GPUs a trav茅s de simples llamadas de copia de memoria. Esta interfaz simplificada permite a HPE usar la implementaci 贸n 贸ptima; sinque la aplicaci贸n contemple diferentes topolog铆as de sistema. Los resultados experimentales muestran mejoras en aplicaciones reales que van desde un 5% en aplicaciones limitadas por el c贸mputo a 2.6x aplicaciones imitadas por la comunicaci贸n. HPE implementa sofisticados esquemas de transferencia para dispositivos de E/S que proporcionan mejoras de rendimiento de 2.9x. La segunda propuesta de esta tesis es un modelo de programaci贸n basado en memoria compartida que aprovecha las nuevas capacidades acceso remoto de memoria de las GPUs para eliminar la comunicaci贸n expl铆cita entre memorias de GPU. Este modelo convierte un sistema multi-GPU en un sistema de memoria compartida con caracter铆sticas NUMA. Para validar la viabilidad del modelo realizamos un anl谩sis exhaustivo del rendimiento los accessos de memoria remotos sobre PCIe. Los resultados muestran que PCI Express 3.0 elimina los costes de hasta un 10% de accesos remotos, dependiendo en el patr贸n de acceso, mientras que guardar los accesos remotos en memorias cache tiene un gran inpacto en el rendimiento de las computaciones. Finalmente, presentamos AMGE, una interfaz de programaci贸n con soporte de compilaci贸n y un sistema que ejecuta, de forma autom谩tica, computaciones programadas para una 煤nica GPU en todas las GPUs del sistema. La interfaz de programaci贸n proporciona un tipo de datos para arreglos multidimensionales que permite una distribuci 贸n transparente y robusta de los datos en todas las memorias de GPU. El compilador extrae la informaci贸n sobre la dimensionalidad de cada arreglo y puede determinar el patr贸n de acceso en cada dimensi贸n de forma individual. El sistema utiliza, en tiempo de ejecuci贸n, la informaci贸n del compilador para elegir la mejor descomposici贸n de la computaci贸n y los datos para minimizar la comunicaci贸n entre GPUs y el uso de memoria. AMGE consigue mejoras de rendimiento que crecen de forma lineal con el n煤mero de GPUs para un amplio abanico de computaciones densas en un sistema real con 4 GPUs. Tambi茅n mostramos que las computaciones con patrones irregulares tambi茅n se pueden beneficiar de AMGE

    Proceedings of the 7th International Conference on PGAS Programming Models

