18 research outputs found

    Design and Analysis of an Adaptive Asynchronous System Architecture for Energy Efficiency

    Get PDF
    Power has become a critical design parameter for digital CMOS integrated circuits. With performance still garnering much concern, a central idea has emerged: minimizing power consumption while maintaining performance. The use of dynamic voltage scaling (DVS) with parallelism has shown to be an effective way of saving power while maintaining performance. However, the potency of DVS and parallelism in traditional, clocked synchronous systems is limited because of the strict timing requirements such systems must comply with. Delay-insensitive (DI) asynchronous systems have the potential to benefit more from these techniques due to their flexible timing requirements and high modularity. This dissertation presents the design and analysis of a real-time adaptive DVS architecture for paralleled Multi-Threshold NULL Convention Logic (MTNCL) systems. Results show that energy-efficient systems with low area overhead can be created using this approach

    Cooperative Power Management for Chip Multiprocessors using Space-Shared Scheduling

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 8. Bernhard Egger.์ตœ๊ทผ Cloud Computing ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๋“ฑ์—์„œ๋Š” Many-core chip์ด ๊ธฐ์กด Multi-core๋ฅผ ๋Œ€์ฒดํ•˜์—ฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์œผ๋ฉฐ Operating System๋„ Many-core ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ Space-sharing ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„๊ฐ€ ๋ณ€๊ฒฝ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ถ”์„ธ์†์—์„œ ๊ธฐ์กด์˜ ์ „ํ†ต์ ์ธ DVFS ๋ฐฉ์‹์„ ์ด์šฉํ•ด์„œ๋Š” Many-core ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์ธ ์ „๋ ฅ ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€์ ์ธ ์ „๋ ฅ ๊ด€๋ฆฌ ๋ฐฉ๋ฒ•๊ณผ Many-core์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ Core ์žฌ๋ฐฐ์น˜ ๊ธฐ์ˆ ์ด ํ•„์š”ํ•˜๋‹ค. Space-shared OS๋Š” Core์™€ ๋ฌผ๋ฆฌ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ์˜ ๊ตฌ์„ฑ์— ๋Œ€ํ•œ ์ž์› ๊ด€๋ฆฌ๋ฅผ ํ•˜๋Š”๋ฐ, ์ตœ๊ทผ์˜ Chip multiprocessor (CMP) ๋“ค์€ ๊ฐ๊ฐ์˜ Core์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ DVFS๋ฅผ ๋™์ž‘ํ•˜๋„๋ก ํ•˜์ง€ ์•Š๊ณ  ๋ช‡๊ฐœ์˜ Core๋“ค์„ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ Voltage ๋˜๋Š” Frequency๋ฅผ ํ•จ๊ป˜ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ๋ฉ”๋ชจ๋ฆฌ ๋˜ํ•œ Coarse-grained ๋ฐฉ์‹์œผ๋กœ ๋…๋ฆฝ๋œ ํŒŒํ‹ฐ์…˜์œผ๋กœ ํ• ๋‹น ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ด€๋ฆฌ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด๋Ÿฌํ•œ CMP์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ Core ์žฌ๋ฐฐ์น˜์™€ DVFS ๊ธฐ์ˆ ์„ ์ด์šฉํ•œ ๊ณ„์ธต์  ์ „๋ ฅ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์—ฐ๊ตฌํ•˜๋Š”๋ฐ ๋ชฉํ‘œ๊ฐ€ ์žˆ๋‹ค. ํŠนํžˆ Core ์žฌ๋ฐฐ์น˜ ๊ธฐ์ˆ ์€ Core์˜ ์œ„์น˜์— ๋”ฐ๋ฅธ Data ์„ฑ๋Šฅ๋„ ํ•จ๊ป˜ ๊ณ ๋ คํ•˜๊ณ  ์žˆ๋‹ค. ์ด์— ์ถ”๊ฐ€๋กœ DVFS ์„ฑ๋Šฅ ์†์‹ค์„ ๊ณ ๋ คํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ ์ƒ์Šน๊ณผ Core ์žฌ๋ฐฐ์น˜์‹œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ํšจ๊ณผ๋ฅผ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•˜์—ฌ ์ตœ์†Œํ•œ์˜ ์„ฑ๋Šฅ์ €ํ•˜๋กœ ๋” ์ข‹์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ์–ป์„ ์ˆ˜ ์žˆ๋„๋ก ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์‹ค์ œ ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜์€ Intel์—์„œ ์ถœ์‹œํ•œ Single-chip Cloud Computer (SCC)์—์„œ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„๋กœ 1-2%์˜ ์„ฑ๋Šฅ ์†์‹ค๋กœ Performance per watt ratio๊ฐ€ 27-32% ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ๋˜ํ•œ Migration ํšจ๊ณผ์™€ Data ์ง€์—ญ์„ฑ ๋“ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜๋˜ ๊ธฐ์กด ์—ฐ๊ตฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด 5-11% ์ข‹์•„์กŒ๋‹ค.Nowadays, many-core chips are especially attractive for data center operators to provide cloud computing service models. The trend in operating system designs, furthermore, is changing from traditional time-sharing to space-shared approaches to support recent many-core architectures. These CPU and OS changes make power and thermal constraints becoming one of most important design issues. Additional power management methods and core re-allocation techniques are necessary to overcome the limitations of traditional dynamic voltage and frequency scaling (DVFS). In this thesis, we present a cooperative hierarchical power management for many-core systems running a space-shared operating system. We consider two levels of space-shared system resources: space in the form of cores and physical memory. Recent chip multiprocessors (CMPs) provide group-level DVFS in which the voltage/frequency of cores is managed at the level of several cores instead of every single core. Memory is also allocated by a coarse-grained resource manager to isolate space partitions. Our research reflects these characteristics of CMPs. We show how to integrate core re-allocation and DVFS techniques through cooperative hierarchical power management. The core re-allocation technique considers the data performance in dependence of the core location. In addition, two important factors are performance loss caused by DVFS and the benefit of core re-allocation. We have implemented this framework on the Intel Single Chip Cloud Computer (SCC) and achieve a 27-32% better performance per watt ratio than naive DVFS policies at the expense of a minimal 1-2% overall performance loss. Furthermore, we have achieved a 5-11% higher performance than previous research with a migration technique that uses a naive migration algorithm that does also not consider the migration benefit and data locality.Abstract i Contents iii List of Figures vi List of Tables viii Chapter 1 Introduction 1 Chapter 2 Related Work 4 Chapter 3 Many-core Architectures 6 3.1 The Intel Single-chip Cloud Computer 6 3.1.1 Architecture Overview 6 3.1.2 Memory Addressing 7 3.1.3 DVFS Capabilities 8 3.2 Tilera 10 3.2.1 Architecture Overview 10 3.2.2 Memory Architecture 10 3.2.3 Switch Interface and Mesh 11 Chapter 4 Zero-copy OS Migration 13 4.1 Cooperative OS Migration 14 4.2 Migration Steps 14 4.3 Migration Volatile State 15 4.4 Networking 16 Chapter 5 Cooperative Hierarchical Power Management 17 5.1 Cooperative Core Re-Allocation 17 5.2 Hierarchical Organization 18 Chapter 6 Core Re-Allocation and DVFS Policies 21 6.1 Core Re-Allocation Considerations 22 6.2 Core Re-Allocation Algorithm 24 6.3 Evaluation of Core Re-Allocation 27 6.4 DVFS Policies 28 Chapter 7 Experimentation and Evaluation 29 7.1 Experimental Setup 29 7.2 Power Management Considerations 30 7.2.1 DVFS Performance Loss 31 7.2.2 Migration Benefit 32 7.2.3 Data-location Aware Migration 33 7.3 Results 34 7.3.1 Synthetic Periodic Workload 34 7.3.2 Profiled Workload 37 7.3.3 World Cup Workload 40 7.3.4 Overall Results 40 Chapter 8 Conclusion 43 APPENDICES 43 Chapter A Profiled Workload Benchmark Scenarios 44 A.1 Synthetic Benchmark Scenario based on Periodic Workloads 45 A.1.1 Synthetic Benchmark Scenario 1 45 A.1.2 Synthetic Benchmark Scenario 2 45 A.2 Memory Synthetic Benchmark Scenario based on Periodic Workloads 46 A.2.1 Memory Synthetic Benchmark Scenario 1 46 A.2.2 Memory Synthetic Benchmark Scenario 2 46 A.3 Benchmark Scenario based on Profiled Workloads 47 A.3.1 Profiled Benchmark Scenario 1 47 A.3.2 Profiled Benchmark Scenario 2 47 A.3.3 Profiled Benchmark Scenario 3 48 ์š”์•ฝ 54 Acknowledgements 55Maste

    Resource and thermal management in 3D-stacked multi-/many-core systems

    Full text link
    Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures. 3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead. This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing. This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory. At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads. On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00

    Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks

    Full text link
    On-chip networks are especially vulnerable to within-die pa-rameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the differentVdd val-ues applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values โ€” always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy con-sumption by an average of 22 % with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28 % with negligible performance impact. 1

    RUNTIME METHODS TO IMPROVE ENERGY EFFICIENCY IN SUPERCOMPUTING APPLICATIONS

    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph

    Addressing Manufacturing Challenges in NoC-based ULSI Designs

    Full text link
    Hernรกndez Luz, C. (2012). Addressing Manufacturing Challenges in NoC-based ULSI Designs [Tesis doctoral no publicada]. Universitat Politรจcnica de Valรจncia. https://doi.org/10.4995/Thesis/10251/1669

    Design for Reliability and Low Power in Emerging Technologies

    Get PDF
    Die fortlaufende Verkleinerung von Transistor-StrukturgrรถรŸen ist einer der wichtigsten Antreiber fรผr das Wachstum in der Halbleitertechnologiebranche. Seit Jahrzehnten erhรถhen sich sowohl Integrationsdichte als auch Komplexitรคt von Schaltkreisen und zeigen damit einen fortlaufenden Trend, der sich รผber alle modernen FertigungsgrรถรŸen erstreckt. Bislang ging das Verkleinern von Transistoren mit einer Verringerung der Versorgungsspannung einher, was zu einer Reduktion der Leistungsaufnahme fรผhrte und damit eine gleichbleibenden Leistungsdichte sicherstellte. Doch mit dem Beginn von StrukturgrรถรŸen im Nanometerbreich verlangsamte sich die fortlaufende Skalierung. Viele Schwierigkeiten, sowie das Erreichen von physikalischen Grenzen in der Fertigung und Nicht-Idealitรคten beim Skalieren der Versorgungsspannung, fรผhrten zu einer Zunahme der Leistungsdichte und, damit einhergehend, zu erschwerten Problemen bei der Sicherstellung der Zuverlรคssigkeit. Dazu zรคhlen, unter anderem, Alterungseffekte in Transistoren sowie รผbermรครŸige Hitzeentwicklung, nicht zuletzt durch stรคrkeres Auftreten von Selbsterhitzungseffekten innerhalb der Transistoren. Damit solche Probleme die Zuverlรคssigkeit eines Schaltkreises nicht gefรคhrden, werden die internen Signallaufzeiten รผblicherweise sehr pessimistisch kalkuliert. Durch den so entstandenen zeitlichen Sicherheitsabstand wird die korrekte Funktionalitรคt des Schaltkreises sichergestellt, allerdings auf Kosten der Performance. Alternativ kann die Zuverlรคssigkeit des Schaltkreises auch durch andere Techniken erhรถht werden, wie zum Beispiel durch Null-Temperatur-Koeffizienten oder Approximate Computing. Wenngleich diese Techniken einen GroรŸteil des รผblichen zeitlichen Sicherheitsabstandes einsparen kรถnnen, bergen sie dennoch weitere Konsequenzen und Kompromisse. Bleibende Herausforderungen bei der Skalierung von CMOS Technologien fรผhren auรŸerdem zu einem verstรคrkten Fokus auf vielversprechende Zukunftstechnologien. Ein Beispiel dafรผr ist der Negative Capacitance Field-Effect Transistor (NCFET), der eine beachtenswerte Leistungssteigerung gegenรผber herkรถmmlichen FinFET Transistoren aufweist und diese in Zukunft ersetzen kรถnnte. Des Weiteren setzen Entwickler von Schaltkreisen vermehrt auf komplexe, parallele Strukturen statt auf hรถhere Taktfrequenzen. Diese komplexen Modelle benรถtigen moderne Power-Management Techniken in allen Aspekten des Designs. Mit dem Auftreten von neuartigen Transistortechnologien (wie zum Beispiel NCFET) mรผssen diese Power-Management Techniken neu bewertet werden, da sich Abhรคngigkeiten und VerhรคltnismรครŸigkeiten รคndern. Diese Arbeit prรคsentiert neue Herangehensweisen, sowohl zur Analyse als auch zur Modellierung der Zuverlรคssigkeit von Schaltkreisen, um zuvor genannte Herausforderungen auf mehreren Designebenen anzugehen. Diese Herangehensweisen unterteilen sich in konventionelle Techniken ((a), (b), (c) und (d)) und unkonventionelle Techniken ((e) und (f)), wie folgt: (a)\textbf{(a)} Analyse von Leistungszunahmen in Zusammenhang mit der Maximierung von Leistungseffizienz beim Betrieb nahe der Transistor Schwellspannung, insbesondere am optimalen Leistungspunkt. Das genaue Ermitteln eines solchen optimalen Leistungspunkts ist eine besondere Herausforderung bei Multicore Designs, da dieser sich mit den jeweiligen Optimierungszielsetzungen und der Arbeitsbelastung verschiebt. (b)\textbf{(b)} Aufzeigen versteckter Interdependenzen zwischen Alterungseffekten bei Transistoren und Schwankungen in der Versorgungsspannung durch โ€žIR-dropsโ€œ. Eine neuartige Technik wird vorgestellt, die sowohl รœber- als auch Unterschรคtzungen bei der Ermittlung des zeitlichen Sicherheitsabstands vermeidet und folglich den kleinsten, dennoch ausreichenden Sicherheitsabstand ermittelt. (c)\textbf{(c)} Eindรคmmung von Alterungseffekten bei Transistoren durch โ€žGraceful Approximationโ€œ, eine Technik zur Erhรถhung der Taktfrequenz bei Bedarf. Der durch Alterungseffekte bedingte zeitlich Sicherheitsabstand wird durch Approximate Computing Techniken ersetzt. Des Weiteren wird Quantisierung verwendet um ausreichend Genauigkeit bei den Berechnungen zu gewรคhrleisten. (d)\textbf{(d)} Eindรคmmung von temperaturabhรคngigen Verschlechterungen der Signallaufzeit durch den Betrieb nahe des Null-Temperatur Koeffizienten (N-ZTC). Der Betrieb bei N-ZTC minimiert temperaturbedingte Abweichungen der Performance und der Leistungsaufnahme. Qualitative und quantitative Vergleiche gegenรผber dem traditionellen zeitlichen Sicherheitsabstand werden prรคsentiert. (e)\textbf{(e)} Modellierung von Power-Management Techniken fรผr NCFET-basierte Prozessoren. Die NCFET Technologie hat einzigartige Eigenschaften, durch die herkรถmmliche Verfahren zur Spannungs- und Frequenzskalierungen zur Laufzeit (DVS/DVFS) suboptimale Ergebnisse erzielen. Dies erfordert NCFET-spezifische Power-Management Techniken, die in dieser Arbeit vorgestellt werden. (f)\textbf{(f)} Vorstellung eines neuartigen heterogenen Multicore Designs in NCFET Technologie. Das Design beinhaltet identische Kerne; Heterogenitรคt entsteht durch die Anwendung der individuellen, optimalen Konfiguration der Kerne. Amdahls Gesetz wird erweitert, um neue system- und anwendungsspezifische Parameter abzudecken und die Vorzรผge des neuen Designs aufzuzeigen. Die Auswertungen der vorgestellten Techniken werden mithilfe von Implementierungen und Simulationen auf Schaltkreisebene (gate-level) durchgefรผhrt. Des Weiteren werden Simulatoren auf Systemebene (system-level) verwendet, um Multicore Designs zu implementieren und zu simulieren. Zur Validierung und Bewertung der Effektivitรคt gegenรผber dem Stand der Technik werden analytische, gate-level und system-level Simulationen herangezogen, die sowohl synthetische als auch reale Anwendungen betrachten

    Cross-Layer Approaches for an Aging-Aware Design of Nanoscale Microprocessors

    Get PDF
    Thanks to aggressive scaling of transistor dimensions, computers have revolutionized our life. However, the increasing unreliability of devices fabricated in nanoscale technologies emerged as a major threat for the future success of computers. In particular, accelerated transistor aging is of great importance, as it reduces the lifetime of digital systems. This thesis addresses this challenge by proposing new methods to model, analyze and mitigate aging at microarchitecture-level and above

    NUMA ๊ตฌ์กฐ๋ฅผ ์ธ์ง€ํ•œ ์นฉ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ๊ณ„์ธต์  ์ „๋ ฅ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 8. Bernhard Egger.๋Œ€์นญํ˜• ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ์šด์˜์ฒด์ œ๋ฅผ ์‹คํ–‰ ์‹œํ‚ค๋Š” ์บ์‰ฌ ์ผ๊ด€์„ฑ์„ ๊ฐ€์ง€๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์œ„ํ•œ ์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ์ „๋ ฅ๊ด€๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋กœ ์กด์žฌํ•˜๋Š” ๋ฏธ๋ž˜์˜ ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๊ณ„์ธต์  ์ „๋ ฅ๊ด€๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์บ์‰ฌ ์ผ๊ด€์„ฑ์„ ๊ฐ€์ง€๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š” ์—†์œผ๋ฉฐ, ๋‹ค์ˆ˜์˜ ์ฝ”์–ด๋“ค์ด ์ „์••/์ฃผํŒŒ์ˆ˜๋ฅผ ๊ณต์œ ํ•˜๊ณ  ๋‹ค์ค‘ ์ „์••/๋‹ค์ค‘ ์ฃผํŒŒ์ˆ˜๋ฅผ ์ง€์›ํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” NUMA-์ธ์ง€ ๊ณ„์ธต์  ์ „๋ ฅ๊ด€๋ฆฌ ๊ธฐ์ˆ ๋กœ ๋™์  ์ „์•• ๋ฐ ์ฃผํŒŒ์ˆ˜ ๊ตํ™˜(DVFS)๊ณผ ์›Œํฌ๋กœ๋“œ ๋งˆ์ด๊ทธ๋ž˜์ด์…˜์„ ์‚ฌ์šฉํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ์›Œํฌ๋กœ๋“œ ๋งˆ์ด๊ทธ๋ž˜์ด์…˜ ๊ณ„ํš์„ ์œ„ํ•ด ์‚ฌ์šฉ๋œ ํƒ์š• ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์„œ๋กœ ์ƒ์ถฉํ•˜๋Š” ๋น„์Šทํ•œ ์ž‘์—…๋Ÿ‰์˜ ํŒจํ„ด์„ ๊ฐ€์ง„ ์ž‘์—…์„ ๊ฐ™์€ ์ „์•• ์˜์—ญ์œผ๋กœ ๋ชจ์œผ๋Š” ๋ชฉํ‘œ์™€ ์ž‘์—…์„ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์œ„์น˜์™€ ๊ฐ€๊นŒ์šด ๊ณณ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๊ณ ๋ คํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์†Œํ”„ํŠธ์›จ์–ด๋กœ ๊ตฌํ˜„๋˜์–ด ์บ์‰ฌ ์ผ๊ด€์„ฑ์ด ์—†๋Š” 48 ์ฝ”์–ด์˜ ์นฉ ๋ ˆ๋ฒจ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ ํ•˜๋“œ์›จ์–ด์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋ฐ ์ดํ„ฐ ์„ผํ„ฐ ์ž‘์—… ํŒจํ„ด์œผ๋กœ ๊ด‘๋ฒ”์œ„์— ๊ฑธ์นœ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ ์ตœ์ฒจ๋‹จ์˜ DVFS ๊ธฐ์ˆ ๊ณผ DVFS์™€ NUMA-๋น„์ธ์ง€ ์›Œํฌ๋กœ๋“œ ๋งˆ์ด๊ทธ๋ž˜์ด์…˜์„ ๊ฐ™์ด ์‚ฌ์šฉํ•œ ์ „๋ ฅ๊ด€๋ฆฌ ๊ธฐ์ˆ ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ๊ฐ 30%์™€ 5%์˜ ์ „๋ ฅ์†Œ๋ชจ๋‹น ์ฒ˜๋ฆฌ ์ž‘์—…๋Ÿ‰ ํ–ฅ์ƒ์„ ํฐ ์„ฑ๋Šฅ์†์‹ค ์—†์ด ์ด๋ฃจ์—ˆ๋‹ค.Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future many-core chips where power management presents one of the most important challenges. In this thesis, we present a hierarchical power management framework for many-core systems. The framework does not require coherent shared memory and supports multiple voltage/multiple-frequency (MVMF) architectures where several cores share the same voltage/frequency. We propose a hierarchical NUMA-aware power management technique that combines dynamic voltage and frequency scaling (DVFS) with workload migration. A greedy algorithm considers the conflicing goals of grouping workloads with similar utilization patterns in voltage domains and placing workloads as close as possible to their data. We implement the proposed scheme in software and evaluated it on existing hardware, a non-cache-coherent 48-core CMP. Compared to state-of-the-art power management techniques using DVFS-only and DVFS with NUMA-unaware migration, we achieve on average, a relative performance-per-watt improvement of 30 and 5 percent, respectively, for a wide range of datacenter workloads at no significant performance degradation.1 Introduction 1 2 Motivation and RelatedWork 5 2.1 Characteristics of Chip Multiprocessors 5 2.2 Dynamic Voltage and Frequency Scaling 7 2.3 Power Management on CMPs 8 2.4 Related Work 10 3 Cooperative Power Management 13 3.1 Cooperative Workload Migration 13 3.2 Hierarchical Organization 14 3.3 Domain Controllers 15 3.3.1 Core Controller 15 3.3.2 Frequency Controller 15 3.3.3 Voltage Controller 16 3.3.4 Chip Controller 16 3.3.5 Location of the Controllers 16 4 DVFS andWorkload Migration Policies 18 4.1 DVFS Policies 18 4.2 Phase Ordering and Frequency Considerations 19 4.3 Migration of Workloads 20 4.4 Scheduling Workload Migration 20 4.4.1 Schedule migration 21 4.4.2 Level migration 22 4.4.3 Assign target 25 4.4.4 Assign victim 26 4.5 Workload Migration Evaluation Model 27 5 Implementation 29 5.1 The Intel Single-chip Cloud Computer 29 5.2 Implementing Workload Migration 31 5.2.1 Migration Steps 31 5.2.2 Networking 33 5.3 Domain Controller Implementation 33 6 Experimental Setup 34 6.1 Hardware 34 6.2 Benchmark Scenarios 35 6.3 Comparison of Results 37 7 Results 38 7.1 Synthetic Scenarios 38 7.2 Datacenter Scenarios 42 7.2.1 Varying Number of Workloads 42 7.2.2 Independent Workloads 45 7.3 Overall Results Comparison 46 8 Discussion 48 8.1 Limitations 48 8.2 Extra Hardware Support 49 9 Conclusion 50 Appendices 51 A Benchmark Scenario Details 51 A.1 Synthetic Benchmark 53 A.2 Real World Benchmark 56 Bibliography 67 ์š”์•ฝ 73Maste
    corecore