11 research outputs found

    Towards Energy-Proportional Computing for Enterprise-Class Server Workloads

    Get PDF
    Massive data centers housing thousands of computing nodes have become commonplace in enterprise computing, and the power consumption of such data centers is growing at an unprecedented rate. Adding to the problem is the inability of the servers to exhibit energy proportionality, i.e., provide energy-ecient execution under all levels of utilization, which diminishes the overall energy eciency of the data center. It is imperative that we realize eective strategies to control the power consumption of the server and improve the energy eciency of data centers. With the advent of Intel Sandy Bridge processors, we have the ability to specify a limit on power consumption during runtime, which creates opportunities to design new power-management techniques for enterprise workloads and make the systems that they run on more energy-proportional. In this paper, we investigate whether it is possible to achieve energy proportionality for an enterprise-class server workload, namely SPECpower ssj2008 benchmark, by using Intel's Running Average Power Limit (RAPL) interfaces. First, we analyze the power consumption and characterize the instantaneous power prole of the SPECpower benchmark at a subsystem-level using the on-chip energy meters exposed via the RAPL interfaces. We then analyze the impact of RAPL power limiting on the performance, per-transaction response time, power consumption, and energy eciency of the benchmark under dierent load levels. Our observations and results shed light on the ecacy of the RAPL interfaces and provide guidance for designing power-management techniques for enterprise-class workloads

    Per-task energy metering and accounting in the multicore era

    Get PDF
    Chip multi-core processors (CMPs) are the preferred processing platform across different domains such as data centers, real-time systems and mobile devices. In all those domains, energy is arguably the most expensive resource in a computing system, in particular, with fastest growth. Therefore, measuring the energy usage draws vast attention. Current studies mostly focus on obtaining finer-granularity energy measurement, such as measuring power in smaller time intervals, distributing energy to hardware components or software components. Such studies focus on scenarios where system energy is measured under the assumption that only one program is running in the system. So far, there is no hardware-level mechanism proposed to distribute the system energy to multiple running programs in a resource sharing multi-core system in an exact way. For the first time, we have formalized the need for per-task energy measurement in multicore by establishing a two-fold concept: Per-Task Energy Metering (PTEM) and Sensible Energy Accounting (SEA). In the scenario where many tasks running in parallel in a multicore system: For each task, the target of PTEM is to provide estimate of the actual energy consumption at runtime based on its resource usage during execution; and SEA aims at providing estimates on the energy it would have consumed when running in isolation with a particular fraction of system's resources. Accurately determining the energy consumed by each task in a system will become of prominent importance in future multi-core based systems as it offers several benefits including (i) Selection of appropriate co-runners, (ii) improved energy-aware task scheduling and (iii) energy-aware billing in data centers. We have shown how these two concepts can be applied to the main components of a computing system: the processor and the memory system. At first, we have applied PTEM to the processor by means of tracking the activities and occupancy of all the resources in a per-task basis. Secondly, we have applied PTEM to the memory system by means of tracking the activities and the state switches of memory banks. Then, we have applied SEA to the processor by predicting the activities and execution time for each task when they run with an fraction of chip resources alone. And last, we apply SEA to the memory system, by means of predicting activities, execution time and the time invoking memory system for each task. As for all these works, by trading-off the hardware cost with the estimation accuracy, we have obtained the implementable and affordable cost mechanisms with high accuracy. We have also shown how these techniques can be applied in different scenarios, such as, to detect significant energy usage variations for any particular task and to develop more energy efficient scheduling policy for the multi-core system. These works in this thesis have been published into IEEE/ACM journals and conferences proceedings that can be found in the publication chapter of this thesis.Los "Chip Multi-core Processors" (CMPs) son la plataforma de procesado preferida en diferentes dominios, tales como los centros de datos, sistemas de tiempo real y dispositivos móviles. En todos estos dominios, la energía puede ser el recurso más caro en el sistema de computación, concretamente, lo rápido que está creciendo. Por lo tanto, como medir el consumo energético está ganando mucha atención. Los estudios actuales se centran mayormente en cómo obtener medidas muy detalladas (finer granularity). Por ejemplo, tomar medidas de potencia en pequeños intervalos de tiempo, usando medidores de energía hardware o software. Estos estudios se centran en escenarios donde el consumo del sistema se mide bajo la suposición de que solo un programa se está ejecutando en el sistema. Aun no hay ninguna propuesta de un mecanismo a nivel de hardware para medir el consumo entre múltiples programas ejecutándose a la vez en un sistema multi-core con recursos compartidos. Por primera vez, hemos formalizado la necesidad de medir el consumo energético por-tarea en un multi-core estableciendo un concepto dual: Per-Taks Energy Metering (PTEM) y Sensible Energy Accounting (SEA). En un escenario donde varias tareas se ejecutan en paralelo en un sistema multi-core, por cada tarea, el objetivo de PTEM es estimar el consumo real energético durante tiempo de ejecución basándose en los recursos usados durante la ejecución, y SEA trata de proveer una estimación del consumo que tendría en solitario con solo una fracción concreta de los recursos del sistema. Determinar el consumo energético con precisión para cada tarea en un sistema tomara gran importancia en el futuro de los sistemas basados en multi-cores, ya que ofrecen varias ventajas tales como: (i) determinar los co-runners apropiados, (ii) mejorar la planificación de tareas teniendo en cuenta su consumo y (iii) facturación de los servicios de los data centers basada en el consumo. Hemos mostrado como estos dos conceptos pueden aplicarse a los principales componentes de un sistema de computación: el procesador y el sistema de memoria. Para empezar, hemos aplicado PTEM al procesador para registrar la actividad y la ocupación de todos los recursos por cada tarea. Luego, hemos aplicado SEA al procesador prediciendo la actividad y tiempo de ejecución por tarea cuando se ejecutan con solo una parte de los recursos del chip. Por último, hemos aplicado SEA al sistema de memoria para predecir la activada, el tiempo ejecución y cuando el sistema de memoria es invocado por cada tarea. Con todo ello, hemos alcanzado un compromiso entre el coste del hardware y la precisión en las estimaciones para obtener mecanismos implementables con un coste aceptable y una alta precisión. Durante nuestros estudios mostramos como esas técnicas pueden ser aplicadas a diferente escenarios, tales como: detectar variaciones significativas en el consumo energético por una tarea en concreto o como desarrollar políticas de planificación energéticamente más eficientes para sistemas multi-core. Los trabajos que hemos publicado durante el desarrollo de esta tesis en los IEEE/ACM journals y en varias conferencias pueden encontrarse en el capítulo de "publicaciones" de este documentoPostprint (published version

    Sensible energy accounting with abstract metering for multicore systems

    Get PDF
    Chip multicore processors (CMPs) are the preferred processing platform across different domains such as data centers, real-time systems, and mobile devices. In all those domains, energy is arguably the most expensive resource in a computing system. Accurately quantifying energy usage in a multicore environment presents a challenge as well as an opportunity for optimization. Standard metering approaches are not capable of delivering consistent results with shared resources, since the same task with the same inputs may have different energy consumption based on the mix of co-running tasks. However, it is reasonable for data-center operators to charge on the basis of estimated energy usage rather than time since energy is more correlated with their actual cost. This article introduces the concept of Sensible Energy Accounting (SEA). For a task running in a multicore system, SEA accurately estimates the energy the task would have consumed running in isolation with a given fraction of the CMP shared resources. We explain the potential benefits of SEA in different domains and describe two hardware techniques to implement it for a shared last-level cache and on-core resources in SMT processors. Moreover, with SEA, an energy-aware scheduler can find a highly efficient on-chip resource assignment, reducing by up to 39% the total processor energy for a 4-core system.Peer ReviewedPostprint (author's final draft

    Caracterización rápida y en tiempo de ejecución de grandes despliegues de aplicaciones

    Get PDF
    Tesis de la Universidad Complutense de Madrid, Facultad de Informática, leída el 19/01/2021Data centers are one of the most power hungry sections of the Information and Communications Technologies (ICT) sector. In the U.S. in 2014, data centers consumed around the 1.8% of the total U.S. electricity consumption. Worldwide data centers consumed in 2015 around 200 TWh of the global electricity usage. This electricity consumption is expected to increase to around 1200 TWh in 2025, which would represent 4.% of the global electricity usage. One of the mejor contributors to the overall data center power is the IT or computing power, therefore there is a special interest to imporve its energy efficiency. Scientific community has developed energy efficient techniques to reduce the energy consumption of IT equipment, such as resource management, power budgeting or power capping...Los centros de datos son una de las secciones del sector de Tecnologías de la Información y Comunicaciones (TIC) que tienen mayor consumo energético. Durante el año 2014 en EE.UU., los centros de datos consumieron alrededor del 1.8% del consumo eléctrico total en dicho país. A nivel mundial, los centros de datos representaron en el añó 2015 alrededor de 200TWh respecto al consumo eléctrico mundial. Según estimaciones, este consumo eléctrico puede aumentar hasta unos 1200 TWh en año 2025, lo que representaría el 4.5% del consumo eléctrico global. Uno de los mayores contribuidores al consumo global en los centros de datos es el representado por los equipos de computación o consumo de IT. A nivel computacional, se han desarrollado diversas técnicas para reducir el consumo de IT como pueden ser, la gestión de recursos, presupuestos de potencia y la limitación de consumo de los servidores ubicados en los centros de datos...Fac. de InformáticaTRUEunpu

    양자화된 학습을 통한 저전력 딥러닝 훈련 가속기 설계

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2022.2. 전동석.딥러닝의 시대가 도래함에 따라, 심층 인공 신경망 (DNN)을 처리하기 위해 요구되는 학습 및 추론 연산량 또한 기하급수적으로 증가하였다. 딥 러닝 시대의 도래와 함께 다양한 작업에 대한 신경망 훈련 및 특정 용도에 대해 훈련된 신경망 추론 수행 측면에서 심층 신경망 (DNN) 처리에 대한 컴퓨팅 요구가 극적으로 증가하였으며, 이러한 추세는 인공지능의 사용이 더욱 범용적으로 진화함에 따라 더욱 가속화 될 것으로 예상된다. 이러한 연산 요구를 해결하기 위해 데이터 센터 내부에 배치하기 위한 FPGA (Field-Programmable Gate Array) 또는 ASIC (Application-Specific Integrated Circuit) 기반 시스템에서 저전력을 위한 SoC (System-on-Chip)의 가속 블록에 이르기까지 다양한 맞춤형 하드웨어가 산업 및 학계에서 제안되었다. 본 논문에서는, 인공 신경망의 에너지 효율적인 훈련 처리를 위한 맞춤형 집적 회로 하드웨어를 보다 에너지 효율적으로 설계할 수 있는 다양한 방법론을 제안하고 실제 저전력 인공 신경망 훈련 시스템을 설계하고 제작하여, 그 효율을 평가하고자 한다. 특히, 본 논문에서는 이러한 저전력 고성능 설계 방법론을 크게 세 가지로 분류하여 분석을 진행하였다. 이러한 분류는 다음과 같다. (1) 훈련 알고리즘. 표준적으로 심층 신경망 훈련은 역전파 (Back-Propagation) 알고리즘으로 수행되지만, 더 효율적인 하드웨어 구현을 위해 스파이크을 기반으로 통신하는 뉴런이 있는 뉴로모픽 학습 알고리즘 또는 비대칭 피드백 을 기반으로 하는 생물학적 모사도가 높은 (Bio-Plausible) 알고리즘을 활용하여 더 효율적인 훈련 시스템을 설계하는 방법을 조사 및 제시하고, 그 하드웨어 효율성을 분석하였다. (2) 저정밀도 수 체계 활용. 일반적으로 사용되는 DNN 가속기에서 효율성을 높이는 가장 강력한 방법 중 하나는 수치 정밀도를 조정하는 것이다. DNN의 추론 단계에 낮은 정밀도 숫자를 사용하는 것은 잘 연구되었지만, 성능 저하 없이 DNN을 훈련하는 것은 상대적으 기술적 어려움이 있다. 본 논문에서는 다양한 모델과 시나리오에서 DNN을 성능 저하 없이 훈련하기 위한 새로운 수 체계를 제안하였다. (3) 시스템 구현 기법. 집적 회로에서 맞춤형 훈련 시스템을 실제로 실현할 때, 거의 무한한 설계 공간은 칩 내부의 데이터 흐름, 시스템 부하 분산, 가속/게이팅 블록 등 다양한 요소에 따라 결과의 품질이 크게 달라질 수 있다. 본 논문에서는 더 나은 성능과 효율성으로 이어지는 다양한 설계 기법을 소개하고 분석하고자 한다. 첫째로, 손글씨 분류 학습을 위한 뉴로모픽 학습 시스템을 제작하여 평가하였다. 이 학습 시스템은 전통적인 기계 학습의 훈련 성능을 유지하면서 낮은 훈련 오버헤드를 제공하는 것을 목표로 하여 설계되었다. 이 목적을 달성하기 위해, 더 적은 연산 요구량과 버퍼 메모리 필요치를 위해 기존의 뉴로모픽 알고리즘을 수정하였으며, 이 과정에서 훈련 성능 손실 없이 기존 역전파 기반 알고리즘에 근접한 훈련 성능을 달성하였다. 뿐만 아니라, 업데이트를 건너뛰는 메커니즘을 구현하고 Lock-Free 매개변수 업데이트 방식을 채택하여 훈련에 소모되는 에너지를 훈련이 진행됨에 따라 동적으로 감소시킬 수 있는 시스템 구현 기법 또한 소개하고 그 성능을 분석하였다. 이런 기법을 통해, 이 학습 시스템은 기존의 훈련 시스템 대비 뛰어난 분류 성능-에너지 소모량 관계를 보이면서도 기존의 역전파 알고리즘 기반의 인공 신경망의 훈련 성능을 유지하였다. 둘째로, 특수 명령어 체계 및 맞춤형 수 체계를 활용한 프로그램 가능한 DNN 훈련용 프로세서가 설계되고 제작되었다. 기존 DNN 추론용 가속기는 8비트 정수 기반으로 이루어진 경우가 많았지만, DNN 학습 설계시 8비트 수 체계를 이용하며 훈련 성능 저하를 보이지 않는 것은 상당한 기술적 난이도를 가지고 있었다. 이런 문제를 극복하기 위해, 본 논문에서는 공유형 멱지수 편향값을 활용하는 8비트 부동 소수점 수 체계를 새로이 제안하였으며, 이 수 체계의 효용성을 보이기 위해 이 DNN 훈련 프로세서가 설계되었다. 뿐만 아니라, 이 프로세서는 단순한 MAC 기반 Matrix-Multiplication 가속기가 아닌, Fused-Multiply-Add 트리를 기반으로 하는 에너지 효율적인 가속기 구조를 채택하면서도, 칩 내부에서의 데이터 이동량 최적화 및 컨볼루션의 공간성을 극대화할 수 있기 위해 데이터 전달 유닛을 입출력부에 2D로 제작하여 트리 기반에서의 컨볼루션 추론 및 훈련 단계에서의 공간성을 활용할 수 있는 방법을 제시하였다. 본 DNN 훈련 프로세서는 맞춤형 벡터 연산기, 가속 명령어 체계, 외부 DRAM으로의 직접적인 접근 제어 방식 등을 통해 한 프로세서 내에서 DNN 훈련의 모든 단계를 다양한 모델 및 환경에서 효율적으로 처리할 수 있도록 설계되었다. 이를 통해 본 프로세서는 기존의 연구에서 제시되었던 다른 프로세서에 비해 동일 모델을 처리하면서 2.48배 가량 더 높은 에너지 효율성, 43% 적은 DRAM 접근 요구량, 0.8%p 높은 훈련 성능을 달성하였다. 이렇게 소개된 두 가지 설계는 모두 실제 칩으로 제작되어 검증되었다. 측정 데이터 및 전력 소모량을 통해 본 논문에서 제안된 저전력 딥러닝 훈련 시스템 설계 기법의 효율을 검증하였으며, 특히 생물학적 모사도가 높은 훈련 알고리즘, 딥러닝 훈련에 최적화된 수 체계, 그리고 효율적인 시스템 구현 기법을 활용하여 시스템의 에너지 효율성을 개선하는 목표를 달성하였는지 정량적으로 분석하였다.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48× higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108박

    System Support For Energy Efficient Mobile Computing

    Get PDF
    Mobile devices are developed rapidly and they have been an integrated part of our daily life. With the blooming of Internet of Things, mobile computing will become more and more important. However, the battery drain problem is a critical issue that hurts user experience. High performance devices require more power support, while the battery capacity only increases 5% per year on average. Researchers are working on kinds of energy saving approaches. For examples, hardware components provide different power state to save idle power; operating systems provide power management APIs to better control power dissipation. However, the system energy efficiency is still low that cannot reach users’ expectation. To improve energy efficiency, we studied how to provide system support for mobile computing in four different aspects. First, we focused on the influence of user behavior on system energy consumption. We monitored and analyzed users’ application usages information. From the results, we built battery prediction model to estimate the battery time based on user behavior and hardware components’ usage. By adjusting user behavior, we can at most double the battery time. To understand why different applications can cause such huge energy difference, we built a power profiler Bugu to figure out where does the power go. Bugu analyzes power and event information for applications, it has high accuracy and low overhead. We analyzed almost 100 mobile applications’ power behavior and several implications are derived to save energy of applications and systems. In addition, to understand the energy behavior of modern hardware architectures, we analyzed the energy consumption and performance of heterogeneous platforms and compared them with homogeneous platforms. The results show that heterogeneous platforms indeed have great potential for energy saving which mostly comes from idle and low workload situations. However, a wrong scheduling decision may cause up to 30% more energy consumption. Scheduling becomes the key point for energy efficient computing. At last, as the increased power density leads to high device temperature, we investigated the thermal management system and developed an ambient temperature aware thermal control policy Falcon. It can save 4.85% total system power and more adaptive in various environments compared with the default approach. Finally, we discussed several potential directions for future research in this field

    Study and development of innovative strategies for energy-efficient cross-layer design of digital VLSI systems based on Approximate Computing

    Get PDF
    The increasing demand on requirements for high performance and energy efficiency in modern digital systems has led to the research of new design approaches that are able to go beyond the established energy-performance tradeoff. Looking at scientific literature, the Approximate Computing paradigm has been particularly prolific. Many applications in the domain of signal processing, multimedia, computer vision, machine learning are known to be particularly resilient to errors occurring on their input data and during computation, producing outputs that, although degraded, are still largely acceptable from the point of view of quality. The Approximate Computing design paradigm leverages the characteristics of this group of applications to develop circuits, architectures, algorithms that, by relaxing design constraints, perform their computations in an approximate or inexact manner reducing energy consumption. This PhD research aims to explore the design of hardware/software architectures based on Approximate Computing techniques, filling the gap in literature regarding effective applicability and deriving a systematic methodology to characterize its benefits and tradeoffs. The main contributions of this work are: -the introduction of approximate memory management inside the Linux OS, allowing dynamic allocation and de-allocation of approximate memory at user level, as for normal exact memory; - the development of an emulation environment for platforms with approximate memory units, where faults are injected during the simulation based on models that reproduce the effects on memory cells of circuital and architectural techniques for approximate memories; -the implementation and analysis of the impact of approximate memory hardware on real applications: the H.264 video encoder, internally modified to allocate selected data buffers in approximate memory, and signal processing applications (digital filter) using approximate memory for input/output buffers and tap registers; -the development of a fully reconfigurable and combinatorial floating point unit, which can work with reduced precision formats

    A Survey of Performance Optimization for Mobile Applications

    Get PDF
    Nowadays there is a mobile application for almost everything a user may think of, ranging from paying bills and gathering information to playing games and watching movies. In order to ensure user satisfaction and success of applications, it is important to provide high performant applications. This is particularly important for resource constraint systems such as mobile devices. Thereby, non-functional performance characteristics, such as energy and memory consumption, play an important role for user satisfaction. This paper provides a comprehensive survey of non-functional performance optimization for Android applications. We collected 155 unique publications, published between 2008 and 2020, that focus on the optimization of non-functional performance of mobile applications. We target our search at four performance characteristics, in particular: responsiveness, launch time, memory and energy consumption. For each performance characteristic, we categorize optimization approaches based on the method used in the corresponding publications. Furthermore, we identify research gaps in the literature for future work

    Energy-Performance Optimization for the Cloud

    Get PDF

    Memory Power Consumption in Main-Memory Database Systems

    Get PDF
    In main-memory database systems, memory can consume a substantial amount of power, comparable to that of the processors. However, existing memory power-saving mechanisms are much less effective than processor power management. Unless the system is almost idle, memory power consumption will be high. The reason for poor memory power proportionality is that the bulk of memory power consumption is attributable to background power, which is determined by memory power state residency. The memory workload in existing systems is evenly distributed over the memory modules and also in time, which precludes the occurrence of long idle intervals. As a result, deep low-power states, which could significantly reduce background power consumption, are rarely entered. In this work, we aim to reduce the memory power consumption of main-memory data- base systems. We start by investigating and explaining the patterns of memory power consumption, under various workloads. We then propose two techniques, implemented at the database system level, that skew memory traffic, creating long periods of idleness in a subset of memory modules. This allows those modules to enter low-power states, reducing overall memory power consumption. We prototyped these techniques in DimmStore, an experimental database system. The first technique is rate-aware data placement, which places data on memory modules according to its access frequency. The background power in the unused or least-used modules is reduced, without affecting background power in the most-used modules. Rate- aware placement saves power and has little performance impact. Under a TPC-C workload, rate-aware placement resulted in memory power savings up to 44%, with a maximum throughput reduction of 10%. The second technique is memory access gating, which targets background power in less- frequently accessed memory modules by inserting periodic idle intervals. Memory gating reduces power consumption of memory modules for which rate-aware placement alone does not create sufficient idleness to reduce power consumption. With gating, memory accesses to these modules become concentrated outside of the idle intervals, creating the opportunity for low-power state use. However, because it delays memory accesses, memory gating impacts performance. Higher memory power savings and lower performance impact occur in workloads with lower memory access rates. Thus, in the YCSB workload with a medium transaction rate, memory gating reduced memory power by 26%, adding 0.25 ms (30%) of transaction latency, compared to DimmStore without gating. In the more memory intensive TPC-C workload and low to medium transaction rate, gating can save 5% of memory power, adding 1.5 ms (60%) of transaction latency, compared to DimmStore without gating
    corecore