84 research outputs found

    Accelerating Stencil Computation on GPGPU by Novel Mapping Method Between the Global Memory and the Shared Memory

    Get PDF
    Acceleration of stencil computation can be effectively improved by utilizing the memory resource. In this paper, in order to reduce the branch divergence of traditional mapping method between the global memory and the shared memory, we devise a new mapping mechanism in which the conditional statements loading the boundary stencil computation points in every XY-tile are removed by aligning ghost zone to reduce the synchronization overhead. In addition, we make full use of single XY-tile loaded into registers in every stencil computation point, common sub-expression elimination and software prefetching to reduce overhead. At last detailed performance evaluation demonstrates our optimized policies are close to optimal in terms of memory bandwidth utilization and achieve higher performance of stencil computation

    Performance modeling and optimization techniques for heterogeneous computing

    Get PDF
    Since Graphics Processing Units (CPUs) have increasingly gained popularity amoung non-graphic and computational applications, known as General-Purpose computation on GPU (GPGPU), CPUs have been deployed in many clusters, including the world\u27s fastest supercomputer. However, to make the most efficiency from a GPU system, one should consider both performance and reliability of the system. This dissertation makes four major contributions. First, the two-level checkpoint/restart protocol that aims to reduce the checkpoint and recovery costs with a latency hiding strategy in a system between a CPU (Central Processing Unit) and a GPU is proposed. The experimental results and analysis reveals some benefits, especially in a long-running application. Second, a performance model for estimating GPGPU execution time is proposed. This performance model improves operation cost estimation over existing ones by considering varied memory latencies. The proposed model also considers the effects of thread synchronization functions. In addition, the impacts of various issues in GPGPU programming such as bank conflicts in shared memory and branch divergence are also discussed. Third, the interplay between GPGPU application performance and system reliability of a large GPU system is explored. This includes a checkpoint scheduling model for a certain GPGPU application. The effects of a checkpoint/restart mechanism on the application performance is also discussed. Finally, optimization techniques to remedy uncoalesced memory access in GPU\u27s global memory are proposed. These techniques are memory rearrangement using 2-dimensional matrix transpose and 3-dimensional matrix permutation. The analytical results show that the proposed technique can reduce memory access time, especially when the transformed array/matrix is frequently accessed

    Parallel error-correcting output codes classification in volume visualization

    Get PDF
    In volume visualization, the definition of the regions of interest is inherently an iterative trial-and-error process finding out the best parameters to classify and render the final image. Generally, the user requires a lot of expertise to analyze and edit these parameters through multi-dimensional transfer functions. In this thesis, we present a framework of methods to label on-demand multiple regions of interest. The methods selected are a combination of 1vs1 Adaboost binary classifiers and an ECOC framework to combine binary results to generate a multi-class result. On a first step, Adaboost is used to train a set of 1vs1 binary classifiers, with a labeled subset of points on the target volume. On a second step, an ECOC framework is used to combine the Adaboost classifiers and classify the rest of the volume, assigning a label to each point among multiple possible labels. The labels have to be introduced by an expert on the target volume, and this labels have to be a small subset of all the points on the volume we want to classify. That way, we require a small e↵ort to the expert. But this requires an interactive process where the classification results are obtained in real or near real-time. That why on this master thesis we implemented the classification step in OpenCL, to exploit the parallelism in modern GPU. We provide experimental results for both accuracy on classification and execution time speedup, comparing GPU to single and multi-core CPU. Along with this work we will present some work derived from the use of OpenCL for the experiments, that we shared in OpenSource through Google code, and some abstraction on the parallelization process for any algorithm. Also, we will comment on future work and present some conclusions as the final sections of this document

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link

    Towards multiprogrammed GPUs

    Get PDF
    Programmable Graphics Processing Units (GPUs) have recently become the most pervasitheve massively parallel processors. They have come a long way, from fixed function ASICs designed to accelerate graphics tasks to a programmable architecture that can also execute general-purpose computations. Because of their performance and efficiency, an increasing amount of software is relying on them to accelerate data parallel and computationally intensive sections of code. They have earned a place in many systems, from low power mobile devices to the biggest data centers in the world. However, GPUs are still plagued by the fact that they essentially have no multiprogramming support, resulting in low system performance if the GPU is shared among multiple programs. In this dissertation we set to provide the rich GPU multiprogramming support by improving the multitasking capabilities and increasing the virtual memory functionality and performance. The main issue hindering the multitasking support in GPUs is the nonpreemptive execution of GPU kernels. Here we propose two preemption mechanisms with dierent design philosophies, that can be used by a scheduler to preempt execution on GPU cores and make room for some other process. We also argue for the spatial sharing of the GPU and propose a concrete hardware scheduler implementation that dynamically partitions the GPU cores among running kernels, according to their set priorities. Opposing the assumptions made in the related work, we demonstrate that preemptive execution is feasible and the desired approach to GPU multitasking. We further show improved system fairness and responsiveness with our scheduling policy. We also pinpoint that at the core of the insufficient virtual memory support lies the exceptions handling mechanism used by modern GPUs. Currently, GPUs offload the actual exception handling work to the CPU, while the faulting instruction is stalled in the GPU core. This stall-on-fault model prevents some of the virtual memory features and optimizations and is especially harmful in multiprogrammed environments because it prevents context switching the GPU unless all the in-flight faults are resolved. In this disseritation, we propose three GPU core organizations with varying performance-complexity trade-off that get rid of the stall-on-fault execution and enable preemptible exceptions on the GPU (i.e., the faulting instruction can be squashed and restarted later). Building on this support, we implement two use cases and demonstrate their utility. One is a scheme that performs context switch of the faulted threads and tries to find some other useful work to do in the meantime, hiding the latency of the fault and improving the system performance. The other enables the fault handling code to run locally, on the GPU, instead of relying on the CPU offloading and show that the local fault handling can also improve performance.Las Unidades de Procesamiento de Gráficos Programables (GPU, por sus siglas en inglés) se han convertido recientemente en los procesadores masivamente paralelos más difundidos. Han recorrido un largo camino desde ASICs de función fija diseñados para acelerar tareas gráficas, hasta una arquitectura programable que también puede ejecutar cálculos de propósito general. Debido a su rendimiento y eficiencia, una cantidad creciente de software se basa en ellas para acelerar las secciones de código computacionalmente intensivas que disponen de paralelismo de datos. Se han ganado un lugar en muchos sistemas, desde dispositivos móviles de baja potencia hasta los centros de datos más grandes del mundo. Sin embargo, las GPUs siguen plagadas por el hecho de que esencialmente no tienen soporte de multiprogramación, lo que resulta en un bajo rendimiento del sistema si la GPU se comparte entre múltiples programas. En esta disertación nos centramos en proporcionar soporte de multiprogramación para GPUs mediante la mejora de las capacidades de multitarea y del soporte de memoria virtual. El principal problema que dificulta el soporte multitarea en las GPUs es la ejecución no apropiativa de los núcleos de la GPU. Proponemos dos mecanismos de apropiación con diferentes filosofías de diseño, que pueden ser utilizados por un planificador para apropiarse de los núcleos de la GPU y asignarlos a otros procesos. También abogamos por la división espacial de la GPU y proponemos una implementación concreta de un planificador hardware que divide dinámicamente los núcleos de la GPU entre los kernels en ejecución, de acuerdo con sus prioridades establecidas. Oponiéndose a las suposiciones hechas por otros en trabajos relacionados, demostramos que la ejecución apropiativa es factible y el enfoque deseado para la multitarea en GPUs. Además, mostramos una mayor equidad y capacidad de respuesta del sistema con nuestra política de asignación de núcleos de la GPU. También señalamos que la causa principal del insuficiente soporte de la memoria virtual en las GPUs es el mecanismo de manejo de excepciones utilizado por las GPUs modernas. En la actualidad, las GPUs descargan el manejo de las excepciones a la CPU, mientras que la instrucción que causo la fallada se encuentra esperando en el núcleo de la GPU. Este modelo de bloqueo en fallada impide algunas de las funciones y optimizaciones de la memoria virtual y es especialmente perjudicial en entornos multiprogramados porque evita el cambio de contexto de la GPU a menos que se resuelvan todas las fallas pendientes. En esta disertación, proponemos tres implementaciones del pipeline de los núcleos de la GPU que ofrecen distintos balances de rendimiento-complejidad y permiten la apropiación del núcleo aunque haya excepciones pendientes (es decir, la instrucción que produjo la fallada puede ser reiniciada más tarde). Basándonos en esta nueva funcionalidad, implementamos dos casos de uso para demostrar su utilidad. El primero es un planificador que asigna el núcleo a otros subprocesos cuando hay una fallada para tratar de hacer trabajo útil mientras esta se resuelve, ocultando así la latencia de la fallada y mejorando el rendimiento del sistema. El segundo permite que el código de manejo de las falladas se ejecute localmente en la GPU, en lugar de descargar el manejo a la CPU, mostrando que el manejo local de falladas también puede mejorar el rendimiento.Postprint (published version

    Extreme scale parallel NBody algorithm with event driven constraint based execution model

    Get PDF
    Traditional scientific applications such as Computational Fluid Dynamics, Partial Differential Equations based numerical methods (like Finite Difference Methods, Finite Element Methods) achieve sufficient efficiency on state of the art high performance computing systems and have been widely studied / implemented using conventional programming models. For emerging application domains such as Graph applications scalability and efficiency is significantly constrained by the conventional systems and their supporting programming models. Furthermore technology trends like multicore, manycore, heterogeneous system architectures are introducing new challenges and possibilities. Emerging technologies are requiring a rethinking of approaches to more effectively expose the underlying parallelism to the applications and the end-users. This thesis explores the space of effective parallel execution of ephemeral graphs that are dynamically generated. The standard particle based simulation, solved using the Barnes-Hut algorithm is chosen to exemplify the dynamic workloads. In this thesis the workloads are expressed using sequential execution semantics, a conventional parallel programming model - shared memory semantics and semantics of an innovative execution model designed for efficient scalable performance towards Exascale computing called ParalleX. The main outcomes of this research are parallel processing of dynamic ephemeral workloads, enabling dynamic load balancing during runtime, and using advanced semantics for exposing parallelism in scaling constrained applications

    Tiling Optimization For Nested Loops On Gpus

    Get PDF
    Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern
    corecore