12 research outputs found

    Enhancing Two-Phase Cooling Efficiency through Thermal- Aware Workload Mapping for Power-Hungry Servers

    Get PDF
    The power density and, consequently, power hungriness of server processors is growing by the day. Traditional air cooling systems fail to cope with such high heat densities, whereas single-phase liquid-cooling still requires high mass flowrate, high pumping power, and large facility size. On the contrary, in a micro-scale gravity-driven thermosyphon attached on top of a processor, the refrigerant, absorbing the heat, turns into a two-phase mixture. The vapor-liquid mixture exchanges heat with a coolant at the condenser side, turns back to liquid state, and descends thanks to gravity, eliminating the need for pumping power. However, similar to other cooling technologies, thermosyphon efficiency can considerably vary with respect to workload performance requirements and thermal profile, in addition to the platform features, such as packaging and die floorplan. In this work, we first address the workload- and platform-aware design of a two-phase thermosyphon. Then, we propose a thermal-aware workload mapping strategy considering the potential and limitations of a two-phase thermosyphon to further minimize hot spots and spatial thermal gradients. Our experiments, performed on an 8-core Intel Xeon E5 CPU reveal, on average, up to 10ºC reduction in thermal hot spots, and 45% reduction in the maximum spatial thermal gradient on the die. Moreover, our design and mapping strategy are able to decrease the chiller cooling power at least by 45%

    A Survey of Prediction and Classification Techniques in Multicore Processor Systems

    Get PDF
    In multicore processor systems, being able to accurately predict the future provides new optimization opportunities, which otherwise could not be exploited. For example, an oracle able to predict a certain application\u27s behavior running on a smart phone could direct the power manager to switch to appropriate dynamic voltage and frequency scaling modes that would guarantee minimum levels of desired performance while saving energy consumption and thereby prolonging battery life. Using predictions enables systems to become proactive rather than continue to operate in a reactive manner. This prediction-based proactive approach has become increasingly popular in the design and optimization of integrated circuits and of multicore processor systems. Prediction transforms from simple forecasting to sophisticated machine learning based prediction and classification that learns from existing data, employs data mining, and predicts future behavior. This can be exploited by novel optimization techniques that can span across all layers of the computing stack. In this survey paper, we present a discussion of the most popular techniques on prediction and classification in the general context of computing systems with emphasis on multicore processors. The paper is far from comprehensive, but, it will help the reader interested in employing prediction in optimization of multicore processor systems

    Optimization-based power and thermal management for dark silicon aware 3D chip multiprocessors using heterogeneous cache hierarchy

    Get PDF
    Management of a problem recently known as “dark silicon” is a new challenge in multicore designs. Prior innovative studies have addressed the dark silicon problem in the fields of power-efficient core design. However, addressing dark silicon challenges in uncore component designs such as cache hierarchy, on-chip interconnect etc. that consume significant portion of the on-chip power consumption is largely unexplored. In this paper, for the first time, we propose an integrated approach which considers the impact of power consumption of core and uncore components simultaneously to improve multi/many-core performance in the dark silicon era. The proposed approach dynamically (1) predicts the changing program behavior on each core; (2) re-determines frequency/voltage, cache capacity and technology in each level of the cache hierarchy based on the program's scalability in order to satisfy the power and temperature constraints. In the proposed architecture, for future chip-multiprocessors (CMPs), we exploit emerging technologies such as non-volatile memories (NVMs) and 3D techniques to combat dark silicon. Also, for the first time, we propose a detailed power model which is useful for future dark silicon CMPs power modeling. Experimental results on SPEC 2000/2006 benchmarks show that the proposed method improves throughput by about 54.3% and energy-delay product by about 61% on average, respectively, in comparison with the conventional CMP architecture with homogenous cache system. (A preliminary short version of this work was presented in the 18th Euromicro Conference on Digital System Design (DSD), 2015.) © 2017 Elsevier B.V

    Managing lifetime reliability, performance, and power tradeoffs in multicore microarchitectures

    Get PDF
    The objective of this research is to characterize and manage lifetime reliability, microarchitectural performance, and power tradeoffs in multicore processors. This dissertation is comprised of three research themes; 1) modeling and simulation method of interacting multicore processor physics, 2) characterization and management of performance and lifetime reliability tradeoff, and 3) extending Amdahl’s Law for understanding lifetime reliability, performance, and energy efficiency of heterogeneous processors. With continued technology scaling, processor operations are increasingly dominated by multiple distinct physical phenomena and their coupled interactions. Understanding these behaviors requires the modeling of complex physical interactions. This dissertation first presents a novel simulation framework that orchestrates interactions between multiple physical models and microarchitecture simulators to enable research explorations at the intersection of application, microarchitecture, energy, power, thermal, and reliability. Using this framework, workload-induced variation of device degradation is characterized, and its impacts on processor lifetime and performance are analyzed. This research introduces a new metric to quantify performance-reliability tradeoff. Lastly, the theoretical models of heterogeneous multicore processors are proposed for understanding performance, energy efficiency, and lifetime reliability consequences. It is shown that these system metrics are governed by Amdahl’s Law and correlated as a function of processor composition, scheduling method, and Amdahl’s scaling factor. This dissertation highlights the importance of multidimensional analysis and extends the scope of microarchitectural studies by incorporating the physical aspects of processor operations and designs.Ph.D

    Computational Sprinting: Exceeding Sustainable Power in Thermally Constrained Systems

    Get PDF
    Although process technology trends predict that transistor sizes will continue to shrink for a few more generations, voltage scaling has stalled and thus future chips are projected to be increasingly more power hungry than previous generations. Particularly in mobile devices which are severely cooling constrained, it is estimated that the peak operation of a future chip could generate heat ten times faster than than the device can sustainably vent. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, this dissertation proposes computational sprinting, in which a system greatly exceeds sustainable power margins (by up to 10Ã?) to provide up to a few seconds of high-performance computation when a user interacts with the device. Computational sprinting exploits the material property of thermal capacitance to temporarily store the excess heat generated when sprinting. After sprinting, the chip returns to sustainable power levels and dissipates the stored heat when the system is idle. This dissertation: (i) broadly analyzes thermal, electrical, hardware, and software considerations to analyze the feasibility of engineering a system which can provide the responsiveness of a plat- form with 10Ã? higher sustainable power within today\u27s cooling constraints, (ii) leverages existing sources of thermal capacitance to demonstrate sprinting on a real system today, and (iii) identifies the energy-performance characteristics of sprinting operation to determine runtime sprint pacing policies

    Architectural-Physical Co-Design of 3D CPUs with Micro-Fluidic Cooling

    Get PDF
    The performance, energy efficiency and cost improvements due to traditional technology scaling have begun to slow down and present diminishing returns. Underlying reasons for this trend include fundamental physical limits of transistor scaling, the growing significance of quantum effects as transistors shrink, and a growing mismatch between transistors and interconnects regarding size, speed and power. Continued Moore's Law scaling will not come from technology scaling alone, and must involve improvements to design tools and development of new disruptive technologies such as 3D integration. 3D integration presents potential improvements to interconnect power and delay by translating the routing problem into a third dimension, and facilitates transistor density scaling independent of technology node. Furthermore, 3D IC technology opens up a new architectural design space of heterogeneously-integrated high-bandwidth CPUs. Vertical integration promises to provide the CPU architectures of the future by integrating high performance processors with on-chip high-bandwidth memory systems and highly connected network-on-chip structures. Such techniques can overcome the well-known CPU performance bottlenecks referred to as memory and communication wall. However the promising improvements to performance and energy efficiency offered by 3D CPUs does not come without cost, both in the financial investments to develop the technology, and the increased complexity of design. Two main limitations to 3D IC technology have been heat removal and TSV reliability. Transistor stacking creates increases in power density, current density and thermal resistance in air cooled packages. Furthermore the technology introduces vertical through silicon vias (TSVs) that create new points of failure in the chip and require development of new BEOL technologies. Although these issues can be controlled to some extent using thermal-reliability aware physical and architectural 3D design techniques, high performance embedded cooling schemes, such as micro-fluidic (MF) cooling, are fundamentally necessary to unlock the true potential of 3D ICs. A new paradigm is being put forth which integrates the computational, electrical, physical, thermal and reliability views of a system. The unification of these diverse aspects of integrated circuits is called Co-Design. Independent design and optimization of each aspect leads to sub-optimal designs due to a lack of understanding of cross-domain interactions and their impacts on the feasibility region of the architectural design space. Co-Design enables optimization across layers with a multi-domain view and thus unlocks new high-performance and energy efficient configurations. Although the co-design paradigm is becoming increasingly necessary in all fields of IC design, it is even more critical in 3D ICs where, as we show, the inter-layer coupling and higher degree of connectivity between components exacerbates the interdependence between architectural parameters, physical design parameters and the multitude of metrics of interest to the designer (i.e. power, performance, temperature and reliability). In this dissertation we present a framework for multi-domain co-simulation and co-optimization of 3D CPU architectures with both air and MF cooling solutions. Finally we propose an approach for design space exploration and modeling within the new Co-Design paradigm, and discuss the possible avenues for improvement of this work in the future

    Enabling efficient graph computing with near-data processing techniques

    Get PDF
    With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. However, when mapped to modern computing systems, graph computing typically suffers from poor performance because of inefficiencies in memory subsystems. At the same time, emerging technologies, such as Hybrid Memory Cube (HMC), enable processing-in-memory (PIM) functionality, a promising technique of near-data processing (NDP), by integrating compute units in the 3D-stacked logic layer. The PIM units allows operation offloading at an instruction level, which has considerable potential to overcome the performance bottleneck of graph computing. Nevertheless, studies have not fully explored this functionality for graph workloads or identified its applications and shortcomings. The main objective of this dissertation is to enable NDP techniques for efficient graph computing. Specifically, it investigates the PIM offloading at instruction level. To achieve this goal, it presents a graph benchmark suite for understanding graph computing behaviors, and then proposes architectural techniques for PIM offloading on various host platforms. This dissertation first presents GraphBIG, a comprehensive graph benchmark suite. To cover major graph computation types and data sources, GraphBIG selects representative data representations, workloads, and datasets from 21 real-world use cases of multiple application domains. This dissertation characterized the benchmarks on real machines and observed extremely irregular memory patterns and significant diverse behaviors across various computation types. GraphBIG helps users understand the behavior of modern graph computing on hardware architectures and enables future architecture and system research for graph computing. To achieve better performance of graph computing, this dissertation proposes GraphPIM, a full-stack NDP solution for graph computing. This dissertation performs an analysis on modern graph workloads to assess the applicability of PIM offloading and presents hardware and software mechanisms to efficiently make use of the PIM functionality. Following the real-world HMC 2.0 specification, GraphPIM provides performance benefits for graph applications without any user code modification and ISA changes. In addition, GraphPIM proposes an extension to PIM operations that can further bring performance benefits for more graph applications. The evaluation results show that GraphPIM achieves up to a 2.4X speedup with a 37% reduction in energy consumption. To effectively utilize NDP systems with GPU-based host architectures that can fully utilize hundreds of gigabytes of bandwidth, this dissertation explores managing the thermal constraints of 3D-stacked memory cubes. Based on the real experiment with an HMC prototype, this study observes that the operating temperature of HMC is much higher than conventional DRAM, which can even cause thermal shutdown with a passive cooling solution. In addition, it also shows that even with a commodity-server cooling solution, HMC can fail to maintain the temperature of the memory dies within the normal operating range when in-memory processing is highly utilized, thereby resulting in higher energy consumption and performance overhead. To this end, this dissertation proposes CoolPIM, a thermal-aware source throttling mechanism that controls the intensity of PIM offloading on runtime. The proposed technique keeps the memory dies of HMC within the normal operating temperature using software-based techniques. The evaluation results show that CoolPIM achieves up to 1.4X and 1.37X speedups compared to non-offloading and naïve offloading scenarios.Ph.D

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Power-constrained aware and latency-aware microarchitectural optimizations in many-core processors

    Get PDF
    As the transistor budgets outpace the power envelope (the power-wall issue), new architectural and microarchitectural techniques are needed to improve, or at least maintain, the power efficiency of next-generation processors. Run-time adaptation, including core, cache and DVFS adaptations, has recently emerged as a promising area to keep the pace for acceptable power efficiency. However, none of the adaptation techniques proposed so far is able to provide good results when we consider the stringent power budgets that will be common in the next decades, so new techniques that attack the problem from several fronts using different specialized mechanisms are necessary. The combination of different power management mechanisms, however, bring extra levels of complexity, since other factors such as workload behavior and run-time conditions must also be considered to properly allocate power among cores and threads. To address the power issue, this thesis first proposes Chrysso, an integrated and scalable model-driven power management that quickly selects the best combination of adaptation methods out of different core and uncore micro-architecture adaptations, per-core DVFS, or any combination thereof. Chrysso can quickly search the adaptation space by making performance/power projections to identify Pareto-optimal configurations, effectively pruning the search space. Chrysso achieves 1.9x better chip performance over core-level gating for multi-programmed workloads, and 1.5x higher performance for multi-threaded workloads. Most existing power management schemes use a centralized approach to regulate power dissipation. Unfortunately, the complexity and overhead of centralized power management increases significantly with core count rendering it in-viable at fine-grain time slices. The work leverages a two-tier hierarchical power manager. This solution is highly scalable with low overhead on a tiled many-core architecture with shared LLC and per-tile DVFS at fine-grain time slices. The global power is first distributed across tiles using GPM and then within a tile (in parallel across all tiles). Additionally, this work also proposes DVFS and cache-aware thread migration (DCTM) to ensure optimum per-tile co-scheduling of compatible threads at runtime over the two-tier hierarchical power manager. DCTM outperforms existing solutions by up to 12% on adaptive many-core tile processor. With the advancements in the core micro-architectural techniques and technology scaling, the performance gap between the computational component and memory component is increasing significantly (the memory-wall issue). To bridge this gap, the architecture community is pushing forward towards multi-core architecture with on-die near-memory DRAM cache memory (faster than conventional DRAM). Gigascale DRAM Caches poses a problem of how to efficiently manage the tags. The Tags-in-DRAM designs aims at efficiently co-locate tags with data, but it still suffer from high latency especially in multi-way associativity. The thesis finally proposes Tag Cache mechanism, an on-chip distributed tag caching mechanism with limited space and latency overhead to bypass the tag read operation in multi-way DRAM Caches, thereby reducing hit latency. Each Tag Cache, stored in L2, stores tag information of the most recently used DRAM Cache ways. The Tag Cache is able to exploit temporal locality of the DRAM Cache, thereby contributing to on average 46% of the DRAM Cache hits.A mesura que el consum dels transistors supera el nivell de potència desitjable es necessiten noves tècniques arquitectòniques i microarquitectòniques per millorar, o almenys mantenir, l'eficiència energètica dels processadors de les pròximes generacions. L'adaptació en temps d'execució, tant de nuclis com de les cachés, així com també adaptacions DVFS són idees que han sorgit recentment que fan preveure que sigui un àrea prometedora per mantenir un ritme d'eficiència energètica acceptable. Tanmateix, cap de les tècniques d'adaptació proposades fins ara és capaç d'oferir bons resultats si tenim en compte les restriccions estrictes de potència que seran comuns a les pròximes dècades. És convenient definir noves tècniques que ataquin el problema des de diversos fronts utilitzant diferents mecanismes especialitzats. La combinació de diferents mecanismes de gestió d'energia porta aparellada nivells addicionals de complexitat, ja que altres factors com ara el comportament de la càrrega de treball així com condicions específiques de temps d'execució també han de ser considerats per assignar adequadament la potència entre els nuclis del sistema computador. Per tractar el tema de la potència, aquesta tesi proposa en primer lloc Chrysso, una administració d'energia integrada i escalable que selecciona ràpidament la millor combinació entre diferents adaptacions microarquitectòniques. Chrysso pot buscar ràpidament l'adaptació adequada al fer projeccions òptimes de rendiment i potència basades en configuracions de Pareto, permetent així reduir de manera efectiva l'espai de cerca. Chrysso arriba a un rendiment de 1,9 sobre tècniques convencionals d'inhibició de portes amb una càrrega d'aplicacions seqüencials; i un rendiment de 1,5 quan les aplicacions corresponen a programes parla·lels. La majoria dels sistemes de gestió d'energia existents utilitzen un enfocament centralitzat per regular la dissipació d'energia. Malauradament, la complexitat i el temps d'administració s'incrementen significativament amb una gran quantitat de nuclis. En aquest treball es defineix un gestor jeràrquic de potència basat en dos nivells. Aquesta solució és altament escalable amb baix cost operatiu en una arquitectura de múltiples nuclis integrats en clústers, amb memòria caché de darrer nivell compartida a nivell de cluster, i DVFS establert en intervals de temps de gra fi a nivell de clúster. La potència global es distribueix en primer lloc a través dels clústers utilitzant GPM i després es distribueix dins un clúster (en paral·lel si es consideren tots els clústers). A més, aquest treball també proposa DVFS i migració de fils conscient de la memòria caché (DCTM) que garanteix una òptima distribució de tasques entre els nuclis. DCTM supera les solucions existents fins a un 12%. Amb els avenços en la tecnologia i les tècniques de micro-arquitectura de nuclis, la diferència de rendiment entre el component computacional i la memòria està augmentant significativament. Per omplir aquest buit, s'està avançant cap a arquitectures de múltiples nuclis amb memòries caché integrades basades en DRAM. Aquestes memòries caché DRAM a gran escala plantegen el problema de com gestionar de forma eficaç les etiquetes. Els dissenys de cachés amb dades i etiquetes juntes són un primer pas, però encara pateixen per tenir una alta latència, especialment en cachés amb un grau alt d'associativitat. En aquesta tesi es proposa l'estudi d'una tècnica anomenada Tag Cache, un mecanisme distribuït d'emmagatzematge d'etiquetes, que redueix la latència de les operacions de lectura d'etiquetes en les memòries caché DRAM. Cada Tag Cache, que resideix a L2, emmagatzema la informació de les vies que s'han accedit recentment de les memòries caché DRAM. D'aquesta manera es pot aprofitar la localitat temporal d'una caché DRAM, fet que contribueix en promig en un 46% dels encerts en les caché DRAM
    corecore