207 research outputs found

    Adaptive Knobs for Resource Efficient Computing

    Get PDF
    Performance demands of emerging domains such as artificial intelligence, machine learning and vision, Internet-of-things etc., continue to grow. Meeting such requirements on modern multi/many core systems with higher power densities, fixed power and energy budgets, and thermal constraints exacerbates the run-time management challenge. This leaves an open problem on extracting the required performance within the power and energy limits, while also ensuring thermal safety. Existing architectural solutions including asymmetric and heterogeneous cores and custom acceleration improve performance-per-watt in specific design time and static scenarios. However, satisfying applications’ performance requirements under dynamic and unknown workload scenarios subject to varying system dynamics of power, temperature and energy requires intelligent run-time management. Adaptive strategies are necessary for maximizing resource efficiency, considering i) diverse requirements and characteristics of concurrent applications, ii) dynamic workload variation, iii) core-level heterogeneity and iv) power, thermal and energy constraints. This dissertation proposes such adaptive techniques for efficient run-time resource management to maximize performance within fixed budgets under unknown and dynamic workload scenarios. Resource management strategies proposed in this dissertation comprehensively consider application and workload characteristics and variable effect of power actuation on performance for pro-active and appropriate allocation decisions. Specific contributions include i) run-time mapping approach to improve power budgets for higher throughput, ii) thermal aware performance boosting for efficient utilization of power budget and higher performance, iii) approximation as a run-time knob exploiting accuracy performance trade-offs for maximizing performance under power caps at minimal loss of accuracy and iv) co-ordinated approximation for heterogeneous systems through joint actuation of dynamic approximation and power knobs for performance guarantees with minimal power consumption. The approaches presented in this dissertation focus on adapting existing mapping techniques, performance boosting strategies, software and dynamic approximations to meet the performance requirements, simultaneously considering system constraints. The proposed strategies are compared against relevant state-of-the-art run-time management frameworks to qualitatively evaluate their efficacy

    Energy-Efficient and Reliable Computing in Dark Silicon Era

    Get PDF
    Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability

    Power-Performance Modeling and Adaptive Management of Heterogeneous Mobile Platforms​

    Get PDF
    abstract: Nearly 60% of the world population uses a mobile phone, which is typically powered by a system-on-chip (SoC). While the mobile platform capabilities range widely, responsiveness, long battery life and reliability are common design concerns that are crucial to remain competitive. Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful SoC with numerous other resources, including display, memory, power management IC, battery and wireless modems. Furthermore, the SoC itself is a heterogeneous resource that integrates many processing elements, such as CPU cores, GPU, video, image, and audio processors. Therefore, CPU cores do not dominate the platform power consumption under many application scenarios. Competitive performance requires higher operating frequency, and leads to larger power consumption. In turn, power consumption increases the junction and skin temperatures, which have adverse effects on the device reliability and user experience. As a result, allocating the power budget among the major platform resources and temperature control have become fundamental consideration for mobile platforms. Dynamic thermal and power management algorithms address this problem by putting a subset of the processing elements or shared resources to sleep states, or throttling their frequencies. However, an adhoc approach could easily cripple the performance, if it slows down the performance-critical processing element. Furthermore, mobile platforms run a wide range of applications with time varying workload characteristics, unlike early generations, which supported only limited functionality. As a result, there is a need for adaptive power and performance management approaches that consider the platform as a whole, rather than focusing on a subset. Towards this need, our specific contributions include (a) a framework to dynamically select the Pareto-optimal frequency and active cores for the heterogeneous CPUs, such as ARM big.Little architecture, (b) a dynamic power budgeting approach for allocating optimal power consumption to the CPU and GPU using performance sensitivity models for each PE, (c) an adaptive GPU frame time sensitivity prediction model to aid power management algorithms, and (d) an online learning algorithm that constructs adaptive run-time models for non-stationary workloads.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Thermal Management for Dependable On-Chip Systems

    Get PDF
    This thesis addresses the dependability issues in on-chip systems from a thermal perspective. This includes an explanation and analysis of models to show the relationship between dependability and tempature. Additionally, multiple novel methods for on-chip thermal management are introduced aiming to optimize thermal properties. Analysis of the methods is done through simulation and through infrared thermal camera measurements

    Machine Learning for Resource-Constrained Computing Systems

    Get PDF
    Die verfügbaren Ressourcen in Informationsverarbeitungssystemen wie Prozessoren sind in der Regel eingeschränkt. Das umfasst z. B. die elektrische Leistungsaufnahme, den Energieverbrauch, die Wärmeabgabe oder die Chipfläche. Daher ist die Optimierung der Verwaltung der verfügbaren Ressourcen von größter Bedeutung, um Ziele wie maximale Performanz zu erreichen. Insbesondere die Ressourcenverwaltung auf der Systemebene hat über die (dynamische) Zuweisung von Anwendungen zu Prozessorkernen und über die Skalierung der Spannung und Frequenz (dynamic voltage and frequency scaling, DVFS) einen großen Einfluss auf die Performanz, die elektrische Leistung und die Temperatur während der Ausführung von Anwendungen. Die wichtigsten Herausforderungen bei der Ressourcenverwaltung sind die hohe Komplexität von Anwendungen und Plattformen, unvorhergesehene (zur Entwurfszeit nicht bekannte) Anwendungen oder Plattformkonfigurationen, proaktive Optimierung und die Minimierung des Laufzeit-Overheads. Bestehende Techniken, die auf einfachen Heuristiken oder analytischen Modellen basieren, gehen diese Herausforderungen nur unzureichend an. Aus diesem Grund ist der Hauptbeitrag dieser Dissertation der Einsatz maschinellen Lernens (ML) für Ressourcenverwaltung. ML-basierte Lösungen ermöglichen die Bewältigung dieser Herausforderungen durch die Vorhersage der Auswirkungen potenzieller Entscheidungen in der Ressourcenverwaltung, durch Schätzung verborgener (unbeobachtbarer) Eigenschaften von Anwendungen oder durch direktes Lernen einer Ressourcenverwaltungs-Strategie. Diese Dissertation entwickelt mehrere neuartige ML-basierte Ressourcenverwaltung-Techniken für verschiedene Plattformen, Ziele und Randbedingungen. Zunächst wird eine auf Vorhersagen basierende Technik zur Maximierung der Performanz von Mehrkernprozessoren mit verteiltem Last-Level Cache und limitierter Maximaltemperatur vorgestellt. Diese verwendet ein neuronales Netzwerk (NN) zur Vorhersage der Auswirkungen potenzieller Migrationen von Anwendungen zwischen Prozessorkernen auf die Performanz. Diese Vorhersagen erlauben die Bestimmung der bestmöglichen Migration und ermöglichen eine proaktive Verwaltung. Das NN ist so trainiert, dass es mit unbekannten Anwendungen und verschiedenen Temperaturlimits zurechtkommt. Zweitens wird ein Boosting-Verfahren zur Maximierung der Performanz homogener Mehrkernprozessoren mit limitierter Maximaltemperatur mithilfe von DVFS vorgestellt. Dieses basiert auf einer neuartigen {Boostability}-Metrik, die die Abhängigkeiten von Performanz, elektrischer Leistung und Temperatur auf Spannungs/Frequenz-Änderungen in einer Metrik vereint. % ignorerepeated Die Abhängigkeiten von Performanz und elektrischer Leistung hängen von der Anwendung ab und können zur Laufzeit nicht direkt beobachtet (gemessen) werden. Daher wird ein NN verwendet, um diese Werte für unbekannte Anwendungen zu schätzen und so die Komplexität der Boosting-Optimierung zu bewältigen. Drittens wird eine Technik zur Temperaturminimierung von heterogenen Mehrkernprozessoren mit Quality of Service-Zielen vorgestellt. Diese verwendet Imitationslernen, um eine Migrationsstrategie von Anwendungen aus optimalen Orakel-Demonstrationen zu lernen. Dafür wird ein NN eingesetzt, um die Komplexität der Plattform und des Anwendungsverhaltens zu bewältigen. Die Inferenz des NNs wird mit Hilfe eines vorhandenen generischen Beschleunigers, einer Neural Processing Unit (NPU), beschleunigt. Auch die ML Algorithmen selbst müssen auch mit begrenzten Ressourcen ausgeführt werden. Zuletzt wird eine Technik für ressourcenorientiertes Training auf verteilten Geräten vorgestellt, um einen konstanten Trainingsdurchsatz bei sich schnell ändernder Verfügbarkeit von Rechenressourcen aufrechtzuerhalten, wie es z.~B.~aufgrund von Konflikten bei gemeinsam genutzten Ressourcen der Fall ist. Diese Technik verwendet Structured Dropout, welches beim Training zufällige Teile des NNs auslässt. Dadurch können die erforderlichen Ressourcen für das Training dynamisch angepasst werden -- mit vernachlässigbarem Overhead, aber auf Kosten einer langsameren Trainingskonvergenz. Die Pareto-optimalen Dropout-Parameter pro Schicht des NNs werden durch eine Design Space Exploration bestimmt. Evaluierungen dieser Techniken werden sowohl in Simulationen als auch auf realer Hardware durchgeführt und zeigen signifikante Verbesserungen gegenüber dem Stand der Technik, bei vernachlässigbarem Laufzeit-Overhead. Zusammenfassend zeigt diese Dissertation, dass ML eine Schlüsseltechnologie zur Optimierung der Verwaltung der limitierten Ressourcen auf Systemebene ist, indem die damit verbundenen Herausforderungen angegangen werden

    A survey of system level power management schemes in the dark-silicon era for many-core architectures

    Get PDF
    Power consumption in Complementary Metal Oxide Semiconductor (CMOS) technology has escalated to a point that only a fractional part of many-core chips can be powered-on at a time. Fortunately, this fraction can be increased at the expense of performance through the dark-silicon solution. However, with many-core integration set to be heading towards its thousands, power consumption and temperature increases per time, meaning the number of active nodes must be reduced drastically. Therefore, optimized techniques are demanded for continuous advancement in technology. Existing efforts try to overcome this challenge by activating nodes from different parts of the chip at the expense of communication latency. Other efforts on the other hand employ run-time power management techniques to manage the power performance of the cores trading-off performance for power. We found out that, for a significant amount of power to saved and high temperature to be avoided, focus should be on reducing the power consumption of all the on-chip components. Especially, the memory hierarchy and the interconnect. Power consumption can be minimized by, reducing the size of high leakage power dissipating elements, turning-off idle resources and integrating power saving materials

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

    Towards Successful Application of Phase Change Memories: Addressing Challenges from Write Operations

    Get PDF
    The emerging Phase Change Memory (PCM) technology is drawing increasing attention due to its advantages in non-volatility, byte-addressability and scalability. It is regarded as a promising candidate for future main memory. However, PCM's write operation has some limitations that pose challenges to its application in memory. The disadvantages include long write latency, high write power and limited write endurance. In this thesis, I present my effort towards successful application of PCM memory. My research consists of several optimizing techniques at both the circuit and architecture level. First, at the circuit level, I propose Differential Write to remove unnecessary bit changes in PCM writes. This is not only beneficial to endurance but also to the energy and latency of writes. Second, I propose two memory scheduling enhancements (AWP and RAWP) for a non-blocking bank design. My memory scheduling enhancements can exploit intra-bank parallelism provided by non-blocking bank design, and achieve significant throughput improvement. Third, I propose Bit Level Power Budgeting (BPB), a fine-grained power budgeting technique that leverages the information from Differential Write to achieve even higher memory throughput under the same power budget. Fourth, I propose techniques to improve the QoS tuning ability of high-priority applications when running on PCM memory. In summary, the techniques I propose effectively address the challenges of PCM's write operations. In addition, I present the experimental infrastructure in this work and my visions of potential future research topics, which could be helpful to other researchers in the area

    Power, Performance, and Energy Management of Heterogeneous Architectures

    Get PDF
    abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201
    corecore