40 research outputs found

    Dynamic Energy Management for Chip Multi-processors under Performance Constraints

    Get PDF
    We introduce a novel algorithm for dynamic energy management (DEM) under performance constraints in chip multi-processors (CMPs). Using the novel concept of delayed instructions count, performance loss estimations are calculated at the end of each control period for each core. In addition, a Kalman filtering based approach is employed to predict workload in the next control period for which voltage-frequency pairs must be selected. This selection is done with a novel dynamic voltage and frequency scaling (DVFS) algorithm whose objective is to reduce energy consumption but without degrading performance beyond the user set threshold. Using our customized Sniper based CMP system simulation framework, we demonstrate the effectiveness of the proposed algorithm for a variety of benchmarks for 16 core and 64 core network-on-chip based CMP architectures. Simulation results show consistent energy savings across the board. We present our work as an investigation of the tradeoff between the achievable energy reduction via DVFS when predictions are done using the effective Kalman filter for different performance penalty thresholds

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    Utilizing Criticality Stacks for Dynamic Voltage and Frequency Scaling

    Get PDF
    Thread imbalance is inevitable for multithreaded applications due to the necessity of synchronization primitives to coordinate access to memory and system resources. This imbalance leads to a bounding of application performance, but, more importantly for mobile devices, this imbalance also leads to energy inefficiencies. Recent works have begun to quantify this imbalance and look to leverage it not only for performance improvements, but for energy savings as well. All these works, though, test the theory through the use of simulators and power estimation tools. These results may show that the theory is sound, but the complexities of how a real machine handles synchronization may lead to diminished results by either having too large of a performance impact, or too little energy savings. In this work, we implement one such algorithm, PCSLB, and improve upon it in order to see if the results shown for this technique are feasible for use in real machines. With the improved algorithm, PCSLB-Max, and the CritScale Linux kernel module, we show that, in fact, there are energy saving available to us while mitigating the performance

    Rubik: fast analytical power management for latency-critical systems

    Get PDF
    Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a colocation scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces datacenter power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.National Science Foundation (U.S.) (Grant CCF-1318384

    GDP : using dataflow properties to accurately estimate interference-free performance at runtime

    Get PDF
    Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model

    Predicting Thread Criticality and Frequency Scalability for Active Load Balancing in Shared Memory Multithreaded Applications

    Get PDF
    With advancements in process technologies, manufacturers are able to pack many processor cores on a single chip. Consequently, there is a paradigm shift from single processor to single chip multiprocessors (CMP). As mobile industry is moving towards CMPs, the challenge of working in a low power/energy budget while still delivering good performance, has taken the precedence. Therefore, energy and power optimization has become a primary design constraint along with performance in CMP design space. CMPs get performance benefit from running multithreaded programs which utilizes Thread Level Parallelism (TLP). But these parallel programs have inherent load imbalance between threads due to synchronization which degrades performance as well as energy consumption. The solution is to develop energy efficient algorithms which can detect and reduce this imbalance to maximize performance while still giving energy savings. Dynamic thread criticality prediction is an approach to detect the load imbalance by identifying the critical threads which cause performance/energy degradation due to synchronization. Cores running critical threads can then be run on high power states using Dynamic Voltage Frequency Scaling, thus balancing the execution times of all thread. But in order to create an accurate balance and utilize DVFS efficiently, one also needs to know the impact of voltage/frequency scaling on thread's performance gain and power usage. Thread frequency scalability prediction can give a good estimate of the performance improvements that DVFS can provide to each thread. We present an active load balancing algorithm which uses thread criticality and frequency scalability prediction to get the maximum possible performance benefit in an energy efficient manner. Results show that balancing the load in an accurate way can give energy savings as high as 10% with minimal performance loss as compared to running all cores at high frequency

    Maximizing heterogeneous processor performance under power constraints

    Get PDF
    corecore