57 research outputs found

    Empirical CPU power modelling and estimation in the gem5 simulator

    Full text link

    Thermally-aware composite run-time CPU power models

    No full text
    Accurate and stable CPU power modelling is fundamental in modern system-on-chips (SoCs) for two main reasons: 1) they enable significant online energy savings by providing a run-time manager with reliable power consumption data for controlling CPU energy-saving techniques; 2) they can be used as accurate and trusted reference models for system design and exploration. We begin by showing the limitations in typical performance monitoring counter (PMC) based power modelling approaches and illustrate how an improved model formulation results in a more stable model that efficiently captures relationships between the input variables and the power consumption. Using this as a solid foundation, we present a methodology for adding thermal-awareness and analytically decomposing the power into its constituting parts. We develop and validate our methodology using data recorded from a quad-core ARM Cortex-A15 mobile CPU and we achieve an average prediction error of 3.7% across 39 diverse workloads, 8 Dynamic Voltage-Frequency Scaling (DVFS) levels and with a CPU temperature ranging from 31 degrees C to 91 degrees C. Moreover, we measure the effect of switching cores offline and decompose the existing power model to estimate the static power of each CPU and L2 cache, the dynamic power due to constant background (BG) switching, and the dynamic power caused by the activity of each CPU individually. Finally, we provide our model equations and software tools for implementing in a run-time manager or for using with an architectural simulator, such as gem5

    Energy Proportionality in Near-Threshold Computing Servers and Cloud Data Centers: Consolidating or Not?

    Get PDF
    Cloud Computing aims to efficiently tackle the increasing demand of computing resources, and its popularity has led to a dramatic increase in the number of computing servers and data centers worldwide. However, as effect of post-Dennard scaling, computing servers have become power-limited, and new system-level approaches must be used to improve their energy efficiency. This paper first presents an accurate power modelling characterization for a new server architecture based on the FD-SOI process technology for near-threshold computing (NTC). Then, we explore the existing energy vs. performance trade-offs when virtualized applications with different CPU utilization and memory footprint characteristics are executed. Finally, based on this analysis, we propose a novel dynamic virtual machine (VM) allocation method that exploits the knowledge of VMs characteristics together with our accurate server power model for next-generation NTC-based data centers, while guaranteeing quality of service (QoS) requirements. Our results demonstrate the inefficiency of current workload consolidation techniques for new NTC-based data center designs, and how our proposed method provides up to 45% energy savings when compared to state-of-the-art consolidation-based approaches

    Gem5-X: A Gem5-Based System Level Simulation Framework to Optimize Many-Core Platforms

    Get PDF
    The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    Connecting Palladio with multicore CPU simulators

    Get PDF
    In Software Engineering simulators are typically used for Software Performance En- gineering (SPE). It is important that the simulations are accurate in order to allow engineers to predict the performance in detail. Palladio is one of these approaches. Currently, Palladio only supports single-core CPU simulators, but there is also an auxiliary approach for multicore simulation. The main problem of this approach is the huge inaccuracy, which is about 74% with 16 cores. This bachelor thesis aims to investigate and improve Palladio’s performance in hardware CPU simulation and performance prediction. This work presents a new approach for connecting a multicore CPU simulator to Palladio to improve the simulation accuracy. The result of this thesis is a conceptual implemen- tation of an embedded multicore CPU Simulator in Palladio to enable more accurate multicore performance predictions. The presented approach enables Palladio to connect to a multicore simulator called MaxSim via a Java prototype, but the predictions aren’t more accurate in general. With a mean speedup deviation of 67.81% at 16 cores, the simulation is only slightly more accurate for the tested system.Softwareingenieure verwenden in der Regel Simulatoren für das Software Performance Engineering (SPE). Die Simulationsergebnisse müssen dabei genau sein, damit die Ingenieure die Leistung detailliert vorhersagen können. Palladio ist eines der Tools, welches für SPE eingesetzt wird. Aktuell unterstützt Palladio nur Single-Core CPU-Simulatoren, allerdings existiert auch ein BehilfsAnsatz für die Multicore-Simulation. Das Problem des Ansatzes ist die enorme Ungenauigkeit, welche bei 16 Cores rund 74, 48% beträgt. Diese Bachelorarbeit zielt darauf ab, die Leistungsfähigkeit von Palladio bei der Abbildung komplexer Architekturen auf Hardwaremodelle und die Genauigkeit der Leistungsprognosen zu untersuchen. In dieser Arbeit wird ein neuer Ansatz zur Anbindung eines Multicore-CPU-Simulators an eine bestehende Palladio-Komponente vorgestellt, um die Simulations-Genauigkeit für Multicore Leistungs-Prognosen zu verbessern. Der Vorgestellte Ansatz konnte mittels MaxSim und ProtoCom umgesetzt werden, jedoch sind die Vorhersagen im Allgemeinen nicht genauer. Die Leistungsprognose ist mit einer mittleren Abweichung der Beschleunigung von −67, 81% bei 16 Cores, für den getesteten Fall lediglich unwesentlich geringer
    • …
    corecore