207 research outputs found

    IMPROVING THE PERFORMANCE AND TIME-PREDICTABILITY OF GPUs

    Get PDF
    Graphic Processing Units (GPUs) are originally mainly designed to accelerate graphic applications. Now the capability of GPUs to accelerate applications that can be parallelized into a massive number of threads makes GPUs the ideal accelerator for boosting the performance of such kind of general-purpose applications. Meanwhile it is also very promising to apply GPUs to embedded and real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, how to fully utilize the advanced architectural features of GPUs to boost the performance and how to analyze the worst-case execution time (WCET) of GPU applications are the problems that need to be addressed before exploiting GPUs further in embedded and real-time applications. We propose to apply both architectural modification and static analysis methods to address these problems. First, we propose to study the GPU cache behavior and use bypassing to reduce unnecessary memory traffic and to improve the performance. The results show that the proposed bypassing method can reduce the global memory traffic by about 22% and improve the performance by about 13% on average. Second, we propose a cache access reordering framework based on both architectural extension and static analysis to improve the predictability of GPU L1 data caches. The evaluation results show that the proposed method can provide good predictability in GPU L1 data caches, while allowing the dynamic warp scheduling for good performance. Third, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model based on a predictable warp scheduling policy to enable the WCET estimation on GPUs. The experimental results show that the proposed WCET analyzer can effectively provide WCET estimations for both soft and hard real-time application purposes. Last, we propose to analyze the shared Last Level Cache (LLC) in integrated CPU-GPU architectures and to integrate the analysis of the shared LLC into the WCET analysis of the GPU kernels in such systems. The results show that the proposed shared data LLC analysis method can improve the accuracy of the shared LLC miss rate estimations, which can further improve the WCET estimations of the GPU kernels

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Directive-based Approach to Heterogeneous Computing

    Get PDF
    El mundo de la computación de altas prestaciones está sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de cómputo para suplir las necesidades de la comunidad científica ha forzado la irrupción de arquitecturas hardware masivamente paralelas y de unidades específicas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gráfico). Estos dispositivos, tradicionalmente dedicados a la programación gráfica, se han convertido recientemente en una plataforma ideal para implementar cómputos masivamente paralelos. La combinación de GPUs para realizar tareas intensivas en cómputo con multi-procesadores para llevar tareas menos intensas pero con lógica de control más compleja, se ha convertido en los últimos años en una de las plataformas más comunes para la realización de cálculos científicos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de clústers de pequeño o mediano tamaño, con un coste inicial y de mantenimiento notablemente inferior. La incorporación de GPUs en clústers ha permitido también aumentar la capacidad de éstos. Sin embargo, la complejidad de la programación de GPUs, y su integración con códigos existentes, dificultan enormemente la introducción de estas tecnologías entre usuarios menos expertos. En esta tésis exploramos la utilización de modelos de programación basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y clústers, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programación en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas técnicas con relativamente poco esfuerzo. La irrupción del estándar de programación de directivas de OpenACC nos permitió demostrar la capacidad de estas herramientas, realizando una implementación experimental del estándar (accULL) en muy poco tiempo y con un rendimiento nada desdeñable. Los resultados computacionales aportados nos permiten demostrar: (a) La disminución en el esfuerzo de programación que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas diseñadas durante esta tésis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales

    Massively parallel split-step Fourier techniques for simulating quantum systems on graphics processing units

    Get PDF
    The split-step Fourier method is a powerful technique for solving partial differential equations and simulating ultracold atomic systems of various forms. In this body of work, we focus on several variations of this method to allow for simulations of one, two, and three-dimensional quantum systems, along with several notable methods for controlling these systems. In particular, we use quantum optimal control and shortcuts to adiabaticity to study the non-adiabatic generation of superposition states in strongly correlated one-dimensional systems, analyze chaotic vortex trajectories in two dimensions by using rotation and phase imprinting methods, and create stable, threedimensional vortex structures in Bose–Einstein condensates through artificial magnetic fields generated by the evanescent field of an optical nanofiber. We also discuss algorithmic optimizations for implementing the split-step Fourier method on graphics processing units. All computational methods present in this work are demonstrated on physical systems and have been incorporated into a state-of-the-art and open-source software suite known as GPUE, which is currently the fastest quantum simulator of its kind.Okinawa Institute of Science and Technology Graduate Universit

    On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors

    Get PDF
    Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some cases- with increased performance. This thesis discusses previously published work on exploiting nested MIMD-SIMD Parallelization for Heterogeneous microprocessors. We examined loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling outer loops on the multicore CPU part of the microprocessor, each thread launches dynamic, independent instances of the inner loop onto the GPU, boosting GPU utilization while simultaneously parallelizing the outer loop. The second portion of the thesis proposal explores heterogeneous producer-consumer data-sharing between the CPU and GPU on the microprocessor. One advantage of tight integration -- the sharing of the on-chip cache system -- could improve the impact that memory accesses have on performance and power. Producer-consumer data sharing commonly occurs between the CPU and GPU portions of programs, but large kernel sizes whose data footprint far exceeds that of a typical CPU cache, cause shared data to be evicted before it is reused. We propose Pipelined CPU-GPU Scheduling for Caches, a locality transformation for producer-consumer relationships between CPUs and GPUs. By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance. To keep the cached data on chip, we allow the producer to run ahead of the consumer by a certain amount of loop iterations or threads. Choosing this "run-ahead distance" becomes the main constraint in the scheduling of work in this software pipeline, and we provide a method of statically predicting it. We assert that with intelligent scheduling and the hardware and software mechanisms to support it, more workloads can be gainfully executed on integrated heterogeneous CPU-GPU microprocessors than previously assumed

    Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

    Full text link
    The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
    corecore