239 research outputs found
GPU Resource Optimization and Scheduling for Shared Execution Environments
General purpose graphics processing units have become a computing workhorse for a variety of data- and compute-intensive applications, from large supercomputing systems for massive data analytics to small, mobile embedded devices for autonomous vehicles. Making effective and efficient use of these processors traditionally relies on extensive programmer expertise to design and develop kernel methods which simultaneously trade off task decomposition and resource exploitation. Often, new architecture designs force code refinements in order to continue to achieve optimal performance. At the same time, not all applications require full utilization of the system to achieve that optimal performance. In this case, the increased capability of new architectures introduces an ever-widening gap between the level of resources necessary for optimal performance and the level necessary to maintain system efficiency.
The ability to schedule and execute multiple independent tasks on a GPU, known generally as concurrent kernel execution, enables application programmers and system developers to balance application performance and system efficiency. Various approaches to develop both coarse- and fine-grained scheduling mechanisms to achieve a high degree of resource utilization and improved application performance have been studied. Most of these works focus on mechanisms for the management of compute resources, while a small percentage consider the data transfer channels. In this dissertation, we propose a pragmatic approach to scheduling and managing both types of resources – data transfer and compute – that is transparent to an application programmer and capable of providing near-optimal system performance.
Furthermore, the approaches described herein rely on reinforcement learning methods, which enable the scheduling solutions to be flexible to a variety of factors, such as transient application behaviors, changing system designs, and tunable objective functions. Finally, we describe a framework for the practical implementation of learned scheduling policies to achieve high resource utilization and efficient system performance
Performance counter-based strategies to improve data locality on multiprocessor systems: reordering and page migration techniques
In this dissertation we approach the study of Precise Event-Based Sampling (PEBS) techniques to improve the performance of applications on a NUMA, Itanium2-based system. We demonstrate that a low-cost, PEBS profiling can support strategies to improve the performance of an important group of computational and scientific codes in runtime. In addition, the accurate information provided by the new Event Adress Registers (EAR) of the Intel Itanium architecture helps foster the development of new data allocation strategies. Following this line, we have also developed a series of dynamic page migration PEBS strategies. Specifically, two problems are addressed: how to improve the performance of locality optimisation techniques for irregular codes in runtime, particularising for the Sparse Matrix-Vector product kernel, and how to develop strategies for dynamic page migration.
To summarise, the main contributions of this dissertation are:
1. A study of the different factors that affect the performance, as well as data and thread allocation policies, in the FinisTerrae supercomputer, the target platform in which this thesis relies on.
2. The implementation of a performance model for FinisTerrae.
3. The development of hardware counter-based strategies to assist reordering techniques
for irregular codes in order to reduce their cost and improve their behaviour.
4. The development of novel hardware counter-guided, dynamic page migration
algorithms that take advantage of the new features provided by the PEBS.
As a software contribution, we present a user-level page-migration framework to monitor,
sample and control an application in runtime
High-Performance In-Memory OLTP via Coroutine-to-Transaction
Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This thesis presents CoroBase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2× better for read-intensive workloads and remain competitive for workloads that inherently do not benefit from software prefetching
Integrated shared-memory and message-passing communication in the Alewife multiprocessor
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D
Register Optimizations for Stencils on GPUs
International audienceThe recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes
Recommended from our members
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors
textThroughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput.Computer Science
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Recommended from our members
Debugging Woven Code
The ability to debug woven programs is critical to the adoption of Aspect Oriented Programming (AOP). Nevertheless, many AOP systems lack adequate support for debugging, making it difficult to diagnose faults and understand the program's structure and control flow. We discuss why debugging aspect behavior is hard and how harvesting results from related research on debugging optimized code can make the problem more tractable. We also specify general debugging criteria that we feel all AOP systems should support. We present a novel solution to the problem of debugging aspect-enabled programs. Our Wicca system is the first dynamic AOP system to support full source-level debugging of woven code. It introduces a new weaving strategy that combines source weaving with online byte-code patching. Changes to the aspect rules, or base or aspect source code are rewoven and recompiled on-the-fly. We present the results of an experiment that show how these features provide the programmer with a powerful interactive debugging experience with relatively little overhead
Managing contamination delay to improve Timing Speculation architectures
Timing Speculation (TS) is a widely known method for realizing better-than-worst-case systems. Aggressive clocking, realizable by TS, enable systems to operate beyond specified safe frequency limits to effectively exploit the data dependent circuit delay. However, the range of aggressive clocking for performance enhancement under TS is restricted by short paths. In this paper, we show that increasing the lengths of short paths of the circuit increases the effectiveness of TS, leading to performance improvement. Also, we propose an algorithm to efficiently add delay buffers to selected short paths while keeping down the area penalty. We present our algorithm results for ISCAS-85 suite and show that it is possible to increase the circuit contamination delay by up to 30% without affecting the propagation delay. We also explore the possibility of increasing short path delays further by relaxing the constraint on propagation delay and analyze the performance impact
Decompose and Conquer: Addressing Evasive Errors in Systems on Chip
Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs.
To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests.
Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd
- …