54 research outputs found

    Performance, Power Modeling and Optimization for High-Performance Computing Systems

    Get PDF
    University of Minnesota Ph.D. dissertation.October 2016. Major: Electrical/Computer Engineering. Advisor: John Sartori. 1 computer file (PDF); xi, 154 pages.Heterogeneity abounds in modern high-performance computing systems. Applications are heterogeneous, containing time-varying unbalanced utilization for different resources, and system architectures have become heterogeneous in order to achieve higher levels of performance and energy efficiency. The most powerful, and also the most energy-efficient high-performance computing systems today consist of many-core CPUs and GPGPUs with a variety of specialize on-chip and off-chip memories. These heterogeneous systems provide a huge amount of computing resources, but it is becoming increasingly challenging to use them effectively and efficiently to maximize their potential. This becomes an even more pressing challenge as energy efficiency becomes the primary barrier to achieving higher levels of performance. This thesis addresses the challenges of performance modeling and optimization in heterogeneous high-performance computing systems. Effective system optimization requires understanding of how performance and power change in response to optimizations. Therefore, we begin by summarizing the impact of modern architectural advances on performance and power modeling for chip multiprocessors (CMPs). We present two models that estimate the performance and power in such systems. The first model, CAMP, is a fast and accurate cache-aware performance model that estimates the performance degradation due to cache contention of processes running on cache-sharing cores. We then propose a system-level power model for a multi-programmed CMP environment that accounts for cache contention. We explain how to integrate the two models to enable power-aware process assignment. Then, we propose an off-chip memory access-aware runtime DVFS control technique that minimizes energy consumption subject to a constraint on application execution time. The second part of the dissertation focuses on improving performance for GPGPUs. After a thorough analysis on CPI breakdown, we lay out all the key factors that govern GPU throughput. In order to improve overall performance for GPGPUs, we propose two approaches that address the key factors, without introducing extra congestion and degradation to the system. We first propose a new two-level priority scheduling policy to improve overall performance by optimizing effective degree of parallelism. Then, we propose ICMT, a full, detailed solution for intra-core multitasking for GPGPUs, including architectural support and a contention-aware workload scheduling algorithm that improves all the key factors in a balanced fashion. Furthermore, we propose a new contention-aware analytical performance model that provides fine-grained workload scheduling decisions for intra-core multitasking, including detailed resource allocation from co-scheduled workloads

    RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning

    Get PDF
    The technology-push of die stacking and application-pull of Big Data machine learning (BDML) have created a unique opportunity for processing-near-memory (PNM). This paper makes four contributions: (1) While previous PNM work explores general MapReduce workloads, we identify three workload characteristics: (a) irregular-and-compute-light (i.e., perform only a few operations per input word which include data-dependent branches and indirect memory accesses); (b) compact (i.e., the computation has a small intermediate live data and uses only a small amount of contiguous input data); and (c) memory-row-dense (i.e., process the input data without skipping over many bytes). We show that BDMLs have or can be transformed to have these characteristics which, except for irregularity, are necessary for bandwidth- and energyefficient PNM, irrespective of the architecture. (2) Based on these characteristics, we propose RowCore, a row-oriented PNM architecture, which (pre)fetches and operates on entire memory rows to exploit BDMLs’ row-density. Instead of this row-centric access and compute-schedule, traditional architectures opportunistically improve row locality while fetching and operating on cache blocks. (3) RowCore employs well-known MIMD execution to handle BDMLs’ irregularity, and sequential prefetch of input data to hide memory latency. In RowCore, however, one corelet prefetches a row for all the corelets which may stray far from each other due to their MIMD execution. Consequently, a leading corelet may prematurely evict the prefetched data before a lagging corelet has consumed the data. RowCore employs novel cross-corelet flow-control to prevent such eviction. (4) RowCore further exploits its flow-controlled prefetch for frequency scaling based on novel coarse-grain compute-memory rate-matching which decreases (increases) the processor clock speed when the prefetch buffers are empty (full). Using simulations, we show that RowCore improves performance and energy, by 135% and 20% over a GPGPU with prefetch, and by 35% and 34% over a multicore with prefetch, when all three architectures use the same resources (i.e., number of cores, and on-processor-die memory) and identical diestacking (i.e., GPGPUs/multicores/RowCore and DRAM)

    Time-energy Analysis of Multilevel Parallelism in Heterogeneous Clusters: the Case of EEG Classification in BCI Tasks

    Get PDF
    Present heterogeneous architectures interconnect nodes including multiple multi-core microprocessors and accelerators that allow different strategies to accelerate the applications and optimize their energy consumption according to the specific power-performance trade-offs. In this paper, a multi-level parallel procedure is proposed to take advantage of all nodes of a heterogeneous CPU-GPU cluster. Two more alternatives have been implemented, and experimentally compared and analyzed from both running time and energy consumption. Although the paper considers an evolutionary master-worker algorithm for feature selection in EEG classification, the conclusions from the experimental analysis here provided can be frequently applied, as many other useful bioinformatics and data mining applications show the same master-worker profile than the classification problem here considered. Our parallel approach allows to reduce the time by a factor of up to 83, with only about a 4.9% of energy consumed by the sequential procedure, in a cluster with 36 CPU cores and 43 GPU compute units.Spanish Ministerio de Ciencia, InnovaciĂłn y Universidades under grant PGC2018-098813-B-C31ERDF fun

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Efficient and Scalable Computing for Resource-Constrained Cyber-Physical Systems: A Layered Approach

    Get PDF
    With the evolution of computing and communication technology, cyber-physical systems such as self-driving cars, unmanned aerial vehicles, and mobile cognitive robots are achieving increasing levels of multifunctionality and miniaturization, enabling them to execute versatile tasks in a resource-constrained environment. Therefore, the computing systems that power these resource-constrained cyber-physical systems (RCCPSs) have to achieve high efficiency and scalability. First of all, given a fixed amount of onboard energy, these computing systems should not only be power-efficient but also exhibit sufficiently high performance to gracefully handle complex algorithms for learning-based perception and AI-driven decision-making. Meanwhile, scalability requires that the current computing system and its components can be extended both horizontally, with more resources, and vertically, with emerging advanced technology. To achieve efficient and scalable computing systems in RCCPSs, my research broadly investigates a set of techniques and solutions via a bottom-up layered approach. This layered approach leverages the characteristics of each system layer (e.g., the circuit, architecture, and operating system layers) and their interactions to discover and explore the optimal system tradeoffs among performance, efficiency, and scalability. At the circuit layer, we investigate the benefits of novel power delivery and management schemes enabled by integrated voltage regulators (IVRs). Then, between the circuit and microarchitecture/architecture layers, we present a voltage-stacked power delivery system that offers best-in-class power delivery efficiency for many-core systems. After this, using Graphics Processing Units (GPUs) as a case study, we develop a real-time resource scheduling framework at the architecture and operating system layers for heterogeneous computing platforms with guaranteed task deadlines. Finally, fast dynamic voltage and frequency scaling (DVFS) based power management across the circuit, architecture, and operating system layers is studied through a learning-based hierarchical power management strategy for multi-/many-core systems

    DRLCap: Runtime GPU Frequency Capping with Deep Reinforcement Learning

    Get PDF
    Power and energy consumption is the limiting factor of modern computing systems. As the GPU becomes a mainstream computing device, power management for GPUs becomes increasingly important. Current works focus on GPU kernel-level power management, with challenges in portability due to architecture-specific considerations. We present DRLCap , a general runtime power management framework intended to support power management across various GPU architectures. It periodically monitors system-level information to dynamically detect program phase changes and model the workload and GPU system behavior. This elimination from kernel-specific constraints enhances adaptability and responsiveness. The framework leverages dynamic GPU frequency capping, which is the most widely used power knob, to control the power consumption. DRLCap employs deep reinforcement learning (DRL) to adapt to the changing of program phases by automatically adjusting its power policy through online learning, aiming to reduce the GPU power consumption without significantly compromising the application performance. We evaluate DRLCap on three NVIDIA and one AMD GPU architectures. Experimental results show that DRLCap improves prior GPU power optimization strategies by a large margin. On average, it reduces the GPU energy consumption by 22% with less than 3% performance slowdown on NVIDIA GPUs. This translates to a 20% improvement in the energy efficiency measured by the energy-delay product (EDP) over the NVIDIA default GPU power management strategy. For the AMD GPU architecture, DRLCap saves energy consumption by 10%, on average, with a 4% percentage loss, and improves energy efficiency by 8%
    • …
    corecore