Search CORE

9 research outputs found

Cooperative Power Management for Chip Multiprocessors using Space-Shared Scheduling

Author: 이승열
Publication venue: 서울대학교 대학원
Publication date: 01/08/2015
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. Bernhard Egger.최근 Cloud Computing 서비스를 제공하는 데이터센터 등에서는 Many-core chip이 기존 Multi-core를 대체하여 사용되고 있으며 Operating System도 Many-core 시스템을 사용할 수 있게 Space-sharing 방식으로 설계가 변경되고 있다. 이러한 추세속에서 기존의 전통적인 DVFS 방식을 이용해서는 Many-core 환경에서 효율적인 전력 사용이 어렵기 때문에 추가적인 전력 관리 방법과 Many-core의 특성을 고려한 Core 재배치 기술이 필요하다. Space-shared OS는 Core와 물리적인 메모리의 구성에 대한 자원 관리를 하는데, 최근의 Chip multiprocessor (CMP) 들은 각각의 Core에서 독립적으로 DVFS를 동작하도록 하지 않고 몇개의 Core들을 그룹화하여 Voltage 또는 Frequency를 함께 변경할 수 있도록 지원하고 있으며 메모리 또한 Coarse-grained 방식으로 독립된 파티션으로 할당 할 수 있게 관리된다. 본 연구는 이러한 CMP의 특성을 고려하여 Core 재배치와 DVFS 기술을 이용한 계층적 전력 관리 시스템을 연구하는데 목표가 있다. 특히 Core 재배치 기술은 Core의 위치에 따른 Data 성능도 함께 고려하고 있다. 이에 추가로 DVFS 성능 손실을 고려한 에너지 효율성 상승과 Core 재배치시 발생할 수 있는 효과를 미리 계산하여 최소한의 성능저하로 더 좋은 에너지 효율성을 얻을 수 있도록 연구를 진행하였다. 또한 실제 구현 및 실험은 Intel에서 출시한 Single-chip Cloud Computer (SCC)에서 진행하였으며 시나리오별로 1-2%의 성능 손실로 Performance per watt ratio가 27-32% 향상되었다. 또한 Migration 효과와 Data 지역성 등을 고려하지 않았던 기존 연구보다 성능이 5-11% 좋아졌다.Nowadays, many-core chips are especially attractive for data center operators to provide cloud computing service models. The trend in operating system designs, furthermore, is changing from traditional time-sharing to space-shared approaches to support recent many-core architectures. These CPU and OS changes make power and thermal constraints becoming one of most important design issues. Additional power management methods and core re-allocation techniques are necessary to overcome the limitations of traditional dynamic voltage and frequency scaling (DVFS). In this thesis, we present a cooperative hierarchical power management for many-core systems running a space-shared operating system. We consider two levels of space-shared system resources: space in the form of cores and physical memory. Recent chip multiprocessors (CMPs) provide group-level DVFS in which the voltage/frequency of cores is managed at the level of several cores instead of every single core. Memory is also allocated by a coarse-grained resource manager to isolate space partitions. Our research reflects these characteristics of CMPs. We show how to integrate core re-allocation and DVFS techniques through cooperative hierarchical power management. The core re-allocation technique considers the data performance in dependence of the core location. In addition, two important factors are performance loss caused by DVFS and the benefit of core re-allocation. We have implemented this framework on the Intel Single Chip Cloud Computer (SCC) and achieve a 27-32% better performance per watt ratio than naive DVFS policies at the expense of a minimal 1-2% overall performance loss. Furthermore, we have achieved a 5-11% higher performance than previous research with a migration technique that uses a naive migration algorithm that does also not consider the migration benefit and data locality.Abstract i Contents iii List of Figures vi List of Tables viii Chapter 1 Introduction 1 Chapter 2 Related Work 4 Chapter 3 Many-core Architectures 6 3.1 The Intel Single-chip Cloud Computer 6 3.1.1 Architecture Overview 6 3.1.2 Memory Addressing 7 3.1.3 DVFS Capabilities 8 3.2 Tilera 10 3.2.1 Architecture Overview 10 3.2.2 Memory Architecture 10 3.2.3 Switch Interface and Mesh 11 Chapter 4 Zero-copy OS Migration 13 4.1 Cooperative OS Migration 14 4.2 Migration Steps 14 4.3 Migration Volatile State 15 4.4 Networking 16 Chapter 5 Cooperative Hierarchical Power Management 17 5.1 Cooperative Core Re-Allocation 17 5.2 Hierarchical Organization 18 Chapter 6 Core Re-Allocation and DVFS Policies 21 6.1 Core Re-Allocation Considerations 22 6.2 Core Re-Allocation Algorithm 24 6.3 Evaluation of Core Re-Allocation 27 6.4 DVFS Policies 28 Chapter 7 Experimentation and Evaluation 29 7.1 Experimental Setup 29 7.2 Power Management Considerations 30 7.2.1 DVFS Performance Loss 31 7.2.2 Migration Benefit 32 7.2.3 Data-location Aware Migration 33 7.3 Results 34 7.3.1 Synthetic Periodic Workload 34 7.3.2 Profiled Workload 37 7.3.3 World Cup Workload 40 7.3.4 Overall Results 40 Chapter 8 Conclusion 43 APPENDICES 43 Chapter A Profiled Workload Benchmark Scenarios 44 A.1 Synthetic Benchmark Scenario based on Periodic Workloads 45 A.1.1 Synthetic Benchmark Scenario 1 45 A.1.2 Synthetic Benchmark Scenario 2 45 A.2 Memory Synthetic Benchmark Scenario based on Periodic Workloads 46 A.2.1 Memory Synthetic Benchmark Scenario 1 46 A.2.2 Memory Synthetic Benchmark Scenario 2 46 A.3 Benchmark Scenario based on Profiled Workloads 47 A.3.1 Profiled Benchmark Scenario 1 47 A.3.2 Profiled Benchmark Scenario 2 47 A.3.3 Profiled Benchmark Scenario 3 48 요약 54 Acknowledgements 55Maste

SNU Open Repository and Archive

Physical Planning and Uncore Power Management for Multi-Core Processors

Author: Chen Xi
Publication venue
Publication date: 02/10/2013
Field of study

For the microprocessor technology of today and the foreseeable future, multi-core is a key engine that drives performance growth under very tight power dissipation constraints. While previous research has been mostly focused on individual processor cores, there is a compelling need for studying how to efficiently manage shared resources among cores, including physical space, on-chip communication and on-chip storage. In managing physical space, floorplanning is the first and most critical step that largely affects communication efficiency and cost-effectiveness of chip designs. We consider floorplanning with regularity constraints that requires identical processing/memory cores to form an array. Such regularity can greatly facilitate design modularity and therefore shorten design turn-around time. Very little attention has been paid to automatic floorplanning considering regularity constraints because manual floorplanning has difficulty handling the complexity as chip core count increases. In this dissertation work, we investigate the regularity constraints in a simulated-annealing based floorplanner for multi/many core processor designs. A simple and effective technique is proposed to encode the regularity constraints in sequence-pair, which is a classic format of data representation in automatic floorplanning. To the best of our knowledge, this is the first work on regularity-constrained floorplanning in the context of multi/many core processor designs. On-chip communication and shared last level cache (LLC) play a role that is at least as equally important as processor cores in terms of chip performance and power. This dissertation research studies dynamic voltage and frequency scaling for on-chip network and LLC, which forms a single uncore domain of voltage and frequency. This is in contrast to most previous works where the network and LLC are partitioned and associated with processor cores based on physical proximity. The single shared domain can largely avoid the interfacing overhead across domain boundaries and is practical and very useful for industrial products. Our goal is to minimize uncore energy dissipation with little, e.g., 5% or less, performance degradation. The first part of this study is to identify a metric that can reflect the chip performance determined by uncore voltage/frequency. The second part is about how to monitor this metric with low overhead and high fidelity. The last part is the control policy that decides uncore voltage/frequency based on monitoring results. Our approach is validated through full system simulations on public architecture benchmarks

Texas A&M Repository

Performance Controlled Power Optimization for Virtualized Internet Datacenters

Author: Wang Yefu
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2011
Field of study

Modern data centers must provide performance assurance for complex system software such as web applications. In addition, the power consumption of data centers needs to be minimized to reduce operating costs and avoid system overheating. In recent years, more and more data centers start to adopt server virtualization strategies for resource sharing to reduce hardware and operating costs by consolidating applications previously running on multiple physical servers onto a single physical server. In this dissertation, several power efficient algorithms are proposed to effectively reduce server power consumption while achieving the required application-level performance for virtualized servers. First, at the server level this dissertation proposes two control solutions based on dynamic voltage and frequency scaling (DVFS) technology and request batching technology. The two solutions share a performance balancing technique that maintains performance balancing among all virtual machines so that they can have approximately the same performance level relative to their allowed peak values. Then, when the workload intensity is light, we adopt the request batching technology by using a controller to determine the time length for periodically batching incoming requests and putting the processor into sleep mode. When the workload intensity changes from light to moderate, request batching is automatically switched to DVFS to increase the processor frequency for performance guarantees. Second, at the datacenter level, this dissertation proposes a performance-controlled power optimization solution for virtualized server clusters with multi-tier applications. The solution utilizes both DVFS and server consolidation strategies for maximized power savings by integrating feedback control with optimization strategies. At the application level, a multi-input-multi-output controller is designed to achieve the desired performance for applications spanning multiple VMs, on a short time scale, by reallocating the CPU resources and DVFS. At the cluster level, a power optimizer is proposed to incrementally consolidate VMs onto the most power-efficient servers on a longer time scale. Finally, this dissertation proposes a VM scheduling algorithm that exploits core performance heterogeneity to optimize the overall system energy efficiency. The four algorithms at the three different levels are demonstrated with empirical results on hardware testbeds and trace-driven simulations and compared against state-of-the-art baselines

University of Tennessee, Knoxville: Trace

Distributed IC Power Delivery: Stability-Constrained Design Optimization and Workload-Aware Power Management

Author: Zhan Xin
Publication venue
Publication date: 15/10/2019
Field of study

ABSTRACT Power delivery presents key design challenges in today’s systems ranging from high performance micro-processors to mobile systems-on-a-chips (SoCs). A robust power delivery system is essential to ensure reliable operation of on-die devices. Nowadays it has become an important design trend to place multiple voltage regulators on-chip in a distributive manner to cope with power supply noise. However, stability concern arises because of the complex interactions be-tween multiple voltage regulators and bulky network of the surrounding passive parasitics. The recently developed hybrid stability theorem (HST) is promising to deal with the stability of such system by efficiently capturing the effects of all interactions, however, large overdesign and hence severe performance degradation are caused by the intrinsic conservativeness of the underlying HST framework. To address such challenge, this dissertation first extends the HST by proposing a frequency-dependent system partitioning technique to substantially reduce the pessimism in stability evaluation. By systematically exploring the theoretical foundation of the HST framework, we recognize all the critical constraints under which the partitioning technique can be performed rigorously to remove conservativeness while maintaining key theoretical properties of the partitioned subsystems. Based on that, we develop an efficient stability-ensuring automatic design flow for large power delivery systems with distributed on-chip regulation. In use of the proposed approach, we further discover new design insights for circuit designers such as how regulator topology, on-chip decoupling capacitance, and the number of integrated voltage regulators can be optimized for improved system tradeoffs between stability and performances. Besides stability, power efficiency must be improved in every possible way while maintaining high power quality. It can be argued that the ultimate power integrity and efficiency may be best achieved via a heterogeneous chain of voltage processing starting from on-board switching voltage regulators (VRs), to on-chip switching VRs, and finally to networks of distributed on-chip linear VRs. As such, we propose a heterogeneous voltage regulation (HVR) architecture encompassing regulators with complimentary characteristics in response time, size, and efficiency. By exploring the rich heterogeneity and tunability in HVR, we develop systematic workload-aware control policies to adapt heterogeneous VRs with respect to workload change at multiple temporal scales to significantly improve system power efficiency while providing a guarantee for power integrity. The proposed techniques are further supported by hardware-accelerated machine learning prediction of non-uniform spatial workload distributions for more accurate HVR adaptation at fine time granularity. Our evaluations based on the PARSEC benchmark suite show that the proposed adaptive 3-stage HVR reduces the total system energy dissipation by up to 23.9% and 15.7% on average compared with the conventional static two-stage voltage regulation using off- and on-chip switching VRs. Compared with the 3-stage static HVR, our runtime control reduces system energy by up to 17.9% and 12.2% on average. Furthermore, the proposed machine learning prediction offers up to 4.1% reduction of system energy

Texas A&M Repository

Multiple clock and voltage domains for chip multi processors

Author: Avi Mendelson
Efraim Rotem
Ran Ginosar
Uri Weiser
Publication venue
Publication date: 01/01/2009
Field of study

Power and thermal are major constraints for delivering compute performance in high-end CPU and are expected to be so in the future. CMP is becoming important by delivering more compute performance within the power constraints. Dynamic Voltage and Frequency Scaling (DVFS) has been studied in past work as a mean to increase save power and improving the overall processor’s performance while meeting the total power and/or thermal constraints. For such systems, power delivery limitations are becoming a significant practical design consideration, unfortunately this aspect of the design was almost ignored by many research works. This paper explores the various possible topologies to build a high end multi-core CPU and the available policies that maximize performance within the set of physical limitations. It evaluates single and multiple voltage and frequency domains and introduces a new clustered topology, grouping several cores together. A hybrid model, using measurements of a real CPU, cycle accurate simulator and an analytical model is introduced. The results presented indicate that considering power delivery limitations diverts the conclusions when such limitations are ignored. This paper shows that a single power domain topology performs up to 30 % better than multiple power domains on light-threaded workload. In the fully threaded application the results divert. Clustered topology performs well for any number of 1 threads

CiteSeerX

Crossref

NUMA 구조를 인지한 칩 멀티프로세서를 위한 계층적 전력 관리

Author: 안창민
Publication venue: 서울대학교 대학원
Publication date: 01/08/2017
Field of study

학위논문 (석사)-- 서울대학교 대학원 공과대학 컴퓨터공학부, 2017. 8. Bernhard Egger.대칭형 다중 처리 운영체제를 실행 시키는 캐쉬 일관성을 가지는 공유 메모리 아키텍처를 위한 전통적인 접근 방법은 전력관리가 가장 중요한 문제 중 하나로 존재하는 미래의 매니코어 시스템에는 적합하지 않다. 본 논문에서는 매니코어 시스템을 위한 계층적 전력관리 프레임워크를 소개한다. 제안한 프레임워크는 캐쉬 일관성을 가지는 공유 메모리가 필요 없으며, 다수의 코어들이 전압/주파수를 공유하고 다중 전압/다중 주파수를 지원하는 아키텍처에서 사용 가능하다. 이 프레임워크는 NUMA-인지 계층적 전력관리 기술로 동적 전압 및 주파수 교환(DVFS)과 워크로드 마이그래이션을 사용한다. 여기서 워크로드 마이그래이션 계획을 위해 사용된 탐욕 알고리즘은 서로 상충하는 비슷한 작업량의 패턴을 가진 작업을 같은 전압 영역으로 모으는 목표와 작업을 데이터가 있는 위치와 가까운 곳으로 이동하는 목표를 고려한다. 제안된 프레임워크는 소프트웨어로 구현되어 캐쉬 일관성이 없는 48 코어의 칩 레벨 멀티프로세서 하드웨어에서 평가되었다. 본 논문의 프레임워크를 데 이터 센터 작업 패턴으로 광범위에 걸친 실험을 수행한 결과 최첨단의 DVFS 기술과 DVFS와 NUMA-비인지 워크로드 마이그래이션을 같이 사용한 전력관리 기술에 비해 상대적으로 각각 30%와 5%의 전력소모당 처리 작업량 향상을 큰 성능손실 없이 이루었다.Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future many-core chips where power management presents one of the most important challenges. In this thesis, we present a hierarchical power management framework for many-core systems. The framework does not require coherent shared memory and supports multiple voltage/multiple-frequency (MVMF) architectures where several cores share the same voltage/frequency. We propose a hierarchical NUMA-aware power management technique that combines dynamic voltage and frequency scaling (DVFS) with workload migration. A greedy algorithm considers the conflicing goals of grouping workloads with similar utilization patterns in voltage domains and placing workloads as close as possible to their data. We implement the proposed scheme in software and evaluated it on existing hardware, a non-cache-coherent 48-core CMP. Compared to state-of-the-art power management techniques using DVFS-only and DVFS with NUMA-unaware migration, we achieve on average, a relative performance-per-watt improvement of 30 and 5 percent, respectively, for a wide range of datacenter workloads at no significant performance degradation.1 Introduction 1 2 Motivation and RelatedWork 5 2.1 Characteristics of Chip Multiprocessors 5 2.2 Dynamic Voltage and Frequency Scaling 7 2.3 Power Management on CMPs 8 2.4 Related Work 10 3 Cooperative Power Management 13 3.1 Cooperative Workload Migration 13 3.2 Hierarchical Organization 14 3.3 Domain Controllers 15 3.3.1 Core Controller 15 3.3.2 Frequency Controller 15 3.3.3 Voltage Controller 16 3.3.4 Chip Controller 16 3.3.5 Location of the Controllers 16 4 DVFS andWorkload Migration Policies 18 4.1 DVFS Policies 18 4.2 Phase Ordering and Frequency Considerations 19 4.3 Migration of Workloads 20 4.4 Scheduling Workload Migration 20 4.4.1 Schedule migration 21 4.4.2 Level migration 22 4.4.3 Assign target 25 4.4.4 Assign victim 26 4.5 Workload Migration Evaluation Model 27 5 Implementation 29 5.1 The Intel Single-chip Cloud Computer 29 5.2 Implementing Workload Migration 31 5.2.1 Migration Steps 31 5.2.2 Networking 33 5.3 Domain Controller Implementation 33 6 Experimental Setup 34 6.1 Hardware 34 6.2 Benchmark Scenarios 35 6.3 Comparison of Results 37 7 Results 38 7.1 Synthetic Scenarios 38 7.2 Datacenter Scenarios 42 7.2.1 Varying Number of Workloads 42 7.2.2 Independent Workloads 45 7.3 Overall Results Comparison 46 8 Discussion 48 8.1 Limitations 48 8.2 Extra Hardware Support 49 9 Conclusion 50 Appendices 51 A Benchmark Scenario Details 51 A.1 Synthetic Benchmark 53 A.2 Real World Benchmark 56 Bibliography 67 요약 73Maste

SNU Open Repository and Archive

Embedded computing systems design: architectural and application perspectives

Author: BACCHILLONE TONY SALVATORE
Publication venue: 'Pisa University Press'
Publication date: 10/05/2013
Field of study

Questo elaborato affronta varie problematiche legate alla progettazione e all'implementazione dei moderni sistemi embedded di computing, ponendo in rilevo, e talvolta in contrapposizione, le sfide che emergono all'avanzare della tecnologia ed i requisiti che invece emergono a livello applicativo, derivanti dalle necessità degli utenti finali e dai trend di mercato. La discussione sarà articolata tenendo conto di due punti di vista: la progettazione hardware e la loro applicazione a livello di sistema. A livello hardware saranno affrontati nel dettaglio i problemi di interconnettività on-chip. Aspetto che riguarda la parallelizzazione del calcolo, ma anche l'integrazione di funzionalità eterogenee. Sarà quindi discussa un'architettura d'interconnessione denominata Network-on-Chip (NoC). La soluzione proposta è in grado di supportare funzionalità avanzate di networking direttamente in hardware, consentendo tuttavia di raggiungere sempre un compromesso ottimale tra prestazioni in termini di traffico e requisiti di implementazioni a seconda dell'applicazione specifica. Nella discussione di questa tematica, verrà posto l'accento sul problema della configurabilità dei blocchi che compongono una NoC. Quello della configurabilità, è un problema sempre più sentito nella progettazione dei sistemi complessi, nei quali si cerca di sviluppare delle funzionalità, anche molto evolute, ma che siano semplicemente riutilizzabili. A tale scopo sarà introdotta una nuova metodologia, denominata Metacoding che consiste nell'astrarre i problemi di configurabilità attraverso linguaggi di programmazione di alto livello. Sulla base del metacoding verrà anche proposto un flusso di design automatico in grado di semplificare la progettazione e la configurazione di una NoC da parte del designer di rete. Come anticipato, la discussione si sposterà poi a livello di sistema, per affrontare la progettazione di tali sistemi dal punto di vista applicativo, focalizzando l'attenzione in particolare sulle applicazioni di monitoraggio remoto. A tal riguardo saranno studiati nel dettaglio tutti gli aspetti che riguardano la progettazione di un sistema per il monitoraggio di pazienti affetti da scompenso cardiaco cronico. Si partirà dalla definizione dei requisiti, che, come spesso accade a questo livello, derivano principalmente dai bisogni dell'utente finale, nel nostro caso medici e pazienti. Verranno discusse le problematiche di acquisizione, elaborazione e gestione delle misure. Il sistema proposto introduce vari aspetti innovativi tra i quali il concetto di protocollo operativo e l'elevata interoperabilità offerta. In ultima analisi, verranno riportati i risultati relativi alla sperimentazione del sistema implementato. Infine, il tema del monitoraggio remoto sarà concluso con lo studio delle reti di distribuzione elettrica intelligenti: le Smart Grid, cercando di fare uno studio dello stato dell'arte del settore, proponendo un'architettura di Home Area Network (HAN) e suggerendone una possibile implementazione attraverso Commercial Off the Shelf (COTS)

Electronic Thesis and Dissertation Archive - Università di Pisa