9 research outputs found

    On-Demand Solution to Minimize I-Cache Leakage Energy with Maintaining Performance

    Full text link

    Exploiting temporal locality in drowsy cache policies

    Full text link
    Technology projections indicate that static power will become a major concern in future generations of high-performance microprocessors. Caches represent a significant percentage of the overall microprocessor die area. Therefore, recent research has concentrated on the reduction of leakage current dissipated by caches. The variety of techniques to control current leakage can be classified as non-state preserving or state preserving. Non-state preserving techniques power off selected cache lines while state preserving place selected lines into a low-power state. Drowsy caches are a recently proposed state-preserving technique. In order to introduce low performance overhead, drowsy caches must be very selective on which cache lines are moved to a drowsy state. Past research on cache organization has focused on how best to exploit the temporal locality present in the data stream. In this paper we propose a novel drowsy cache policy called Reuse Most Recently used On (RMRO), which makes use of reuse information to trade off performance versus energy consumption. Our proposal improves the hit ratio for drowsy lines by about 67%, while reducing the power consumption by about 11.7% (assuming 70nm technology) with respect to previously proposed drowsy cache policies

    Architectural Improvements for Low-power Processor Cache

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 전주식.마이크로프로세서는 수행 성능을 증가시키고 소모하는 에너지를 줄이기 위해 연구가 진행되고 있다. 대부분의 경우 수행 성능과 소모 에너지들 간에는 트레이드오프(trade-off) 관계가 성립하여, 소모 에너지를 감소시키면 수행 성능이 낮아지게 된다. 본 논문에서는 프로세서의 구조적 개선을 통해, 수행 성능에 영향을 미치지 않으면서 소모 에너지를 감소시키는 방안과 수행 성능에 큰 영향을 미치는 여러 에너지 감소 방안들을 오버헤드를 최소화하면서 조합하는 방안을 제안한다. 첫 번째로, 수행 성능에 영향을 미치지 않으며 동적 에너지를 감소시키기 위해 선택적 워드 접근 기법을 제안한다. 저장장치 별 저장단위가 다르다는 점에 착안한 이 기법은 주소의 일부분을 캐시 접근 시에 활용하여 저장장치 별로 필요한 부분만을 전달한다. 이 기법을 모의 실험하여 L1 캐시에서 67.5%, L2 캐시에서 27.1%의 동적 에너지 감소를 이끌어 냈다. 정적 에너지까지 고려하면 L1 캐시에서 56.75%의 에너지 감소를 이끌어 냈다. 두 번째로, 수행 성능에 큰 영향을 미치는 필터 캐시, 순차적 캐시 그리고 드라우지 캐시와 논문 전반부에서 제시한 선택적 워드 접근 기법을, 오버헤드를 최소화하면서 조합하는 워드 필터를 사용한 순차적, 선택적 워드 접근 드라우지 캐시를 제안한다. 필터 캐시는 프로세서 레지스터와 L1 캐시 사이에 작은 저장장치를 구현하여 동적 에너지 소모량을 줄이는 기법이다. 해당 기법이 처음 제시되었을 때와 달리 클록 수의 증가로 인해 L1 캐시 접근 시간이 늘어나고, 이로 인해 필터 캐시를 사용할 경우 에너지의 감소와 함께 성능상의 이득까지 볼 수 있다. 이와 함께 기존에 성능상의 손해로 인해 쓰지 못했던 순차적 캐시와 드라우지 캐시와 같은 기법들을 추가적으로 사용할 수 있다. 순차적 캐시는 캐시의 태그 어레이의 적중 여부를 알기 전까지 데이터 어레이를 동작시키지 않는 기법이다. 이는 태그 어레이의 적중 시간만큼 캐시 접근 시간이 길어지는 반면, 적중된 웨이만을 구동시키면 되기 때문에 데이터 어레이의 동적 에너지를 감소시킬 수 있다. 필터 캐시와 같이 사용할 경우, 상대적으로 전력 소모가 적은 태그 어레이를 필터 캐시와 병렬적으로 접근하게 되면 기존 순차적 캐시에서 손해를 보는 태그 어레이 접속 시간을 숨길 수 있다. 드라우지 캐시는 SRAM 셀에 동작전압을 정상 모드(높은 전압)와 저전력 모드(낮은 전압), 두 종류를 공급하고 동작이 발생하지 않는 부분의 전압을 낮추어 공급함으로 캐시의 정적 전력 소모를 줄이는 기법이다. 저전력 모드에 있는 셀에 접근할 경우 낮은 전압을 높은 전압으로 바꾸어주는데 이때 추가적인 접근 시간이 발생한다. 본 논문에서는 해당 셀에 접근하여 전압을 높이는 깨움 비트 전송을 필터 캐시와 L1 캐시 태그 어레이 접속과 병렬적으로 하여 기존 드라우지 캐시에서 발생하게 되는 성능 감소를 막았다. 이와 같이 드라우지 캐시 기법과, 필터 캐시, 순차적 캐시와 선택적 워드 접근 기법을 모두 적용하여 모의 실험한 결과, 전체 프로세서 캐시에서 73.4%의 동적 에너지 감소를, 83.2%의 정적 에너지 감소를, 총 71.7%의 에너지 감소를 이끌어 내었다. 요약하면, 정적 에너지 감소를 위해 드라우지 캐시를 구현하면서 발생하는 추가 시간을 필터 캐시와 순차적 캐시를 이용해 효율적으로 숨기고, 저장 단위 차이를 이용하는 선택적 워드 접근 기법을 추가적으로 구현해 저전력 프로세서 설계를 하였다.The microprocessor is researched to improve the execution performance and reduce the energy consumption. In most cases, the trade-off relationship is established between the energy consumption and execution performance. So if reducing the energy consumption, the execution performance is lowered. In this paper, I propose two low power method by improving the architecture of the processor cache. The one is the method lowering dynamic energy without affecting the execution performance, and the other is the method combined some energy reduction plans which affect a significant impact on execution performance. First, I propose 'Selective Word Reading(SWR)' technique which reduce the dynamic energy of the processor cache without loss of performance. This technique was developed because of the differences between store unit sizes per storage level. In the SWR cache, only the necessary part of a block is activated during the cache access process. For a 32 kB four-way set associative L1 cache, a 32B block size, and four mats per sub-bank, the SWR cache saves dynamic energy by 67.5% without consideration of the leakage energy and by 56.75% with consideration of the leakage energy with no performance degradation and negligible area reduction. Additionally, in a 1MB 16-way set associative L2 cache, a 64B block size, and eight mats per sub-bank, the SWR cache saves dynamic energy by 27.1% for the L2 cache. Second, I propose Sequential-SWR-Drowsy Cache with the Word Filter(SSDF) technique which reduce the entire energy of the processor cache with combining a sequential cache, a selective word reading, a filter cache and a drowsy cache. These techniques are affecting a significant impact on execution performance and I offer the method which can reduce the performance overhead with maximizing the effect of the power consumption. The filter cache is a technique to reduce the dynamic energy consumption that implements a small storage device between the L1 cache and the processor registers. Unlike when it is presented first, by increasing the number of CPU clocks, the access time of the L1 cache is increased and thus, the filter cache, this approach can be seen to advantage of the performance as well as the power consumption. Furthermore, it is possible to use further techniques such as the drowsy cache and the sequential cache without additional damage to the performance. The sequential cache is a technique to delay the operation of the data array until the tag array knows whether it is hit or not. Since the access time of the sequential cache is increased by the tag-array-access time, and to drive only the hit way, so it is possible to reduce the dynamic energy of the data array. When used with the filter cache, if accessed in parallel with the filter cache and the L1 tag array whose power consumption is relatively small, it can hide the tag-array-access time. The drowsy cache supplies the two kind of the operating voltage to the SRAM cell and it makes the cells is placed in two modes – normal mode in high voltage and drowsy mode (low-power mode) in low voltage. And the some cells which access rarely will be placed in drowsy mode, it will decrease the static energy consumption of the cache. If an application want to access the cell of the drowsy mode, at this time that it converts the low voltage to the high voltage, and it will make the additional access time. In this paper, we prevented the degradation of performance by the parallel access of the wake-up call is occurred when the filter cache and the L1 cache tag array is accessed. This technique, SSDF cache, saves 73.4% of the dynamic energy, 83.2% of the static energy and 71.7% of the total cache energy consumption.요 약 i 목 차 iv 그림 목차 vii 표 목차 x 제 1장 서 론 1 1.1 연구 배경 1 1.2 연구 내용 9 1.3 논문 구성 11 제2장 관련 연구 12 2.1 동적 전력 감소 기법 12 2.2 정적 전력 감소 기법 18 2.2.1 내용 미저장 방식 18 2.2.2 내용 저장 방식 19 제 3장 선택적 워드 접근 기법 22 3.1 개발 동기 22 3.2 구현 25 3.2.1 개념적 구현 25 3.2.2 실제적 구현 29 3.3 전력 소모량 계산 34 3.3.1 전력 소모량 수식 34 3.4 워드 버퍼의 고려 38 제 4장 워드 필터를 사용한 순차적·선택적 워드 접근 42 4.1 개발 동기 42 4.2 관련 연구와 제안된 연구의 구현 45 4.2.1 전통적 L1 기준 캐시 45 4.2.2 필터 캐시 47 4.2.3 동시 접근 기술을 사용한 필터 캐시 50 4.2.4 순차적 캐시 51 4.2.5 병렬적 L1 태그 접근 기법과 필터 캐시를 사용한 순차적 캐시 52 4.2.6 드라우지 캐시 54 4.2.7 필터 캐시를 사용한 드라우지 캐시 55 4.2.8 병렬적 동시 접근을 이용한 드라우지·필터 캐시 56 4.2.9 필터 캐시를 사용한 순차적 드라우지 캐시 57 4.2.10 L1 태그에 병렬적 접근과 깨움 동작을 하는 순차적 드라우지·필터 캐시 59 4.3 선택적 워드 접근 기법과의 조합 구현 61 제 5장 성능 평가 및 결과 65 5.1 실험 환경 65 5.2 선택적 워드 접근 기법 실험 결과 69 5.2.1 동적 에너지 감소량 69 5.2.2 정적 에너지를 고려한 상태에서의 에너지 감소 71 5.2.3 워드 버퍼를 가정한 경우 에너지 소모량 74 5.2.4 에너지-지연시간 곱 75 5.3 SSDF 캐시의 실험 결과 76 5.3.1 필터 캐시의 영향 분석 76 5.3.2 드라우지 캐시에 의한 영향 분석 78 5.3.3 SSDF 캐시의 에너지 소모량 79 5.3.3.1 동적 에너지 소모량 79 5.3.3.2 정적 에너지 소모량 81 5.3.3.3 전체 에너지 소모량 82 5.3.3.4 비대칭 SSDF 캐시 83 제 6장 결론 86 참고 문헌 89 Abstract 95Docto

    A low-power cache system for high-performance processors

    Get PDF
    制度:新 ; 報告番号:甲3439号 ; 学位の種類:博士(工学) ; 授与年月日:12-Sep-11 ; 早大学位記番号:新576

    Dynamic cache reconfiguration based techniques for improving cache energy efficiency

    Get PDF
    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In this research, we propose novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. We propose software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with the state-of-art techniques and have found that our techniques outperform them in their energy efficiency. This research has important applications in improving energy-efficiency of higher-end embedded, desktop, server processors and multitasking systems. We have also proposed performance estimation approach for efficient design space exploration and have implemented time-sampling based simulation acceleration approach for full-system architectural simulators.Comment: PhD thesis, dynamic cache reconfiguratio

    Transparent management of scratchpad memories in shared memory programming models

    Get PDF
    Cache-coherent shared memory has traditionally been the favorite memory organization for chip multiprocessors thanks to its high programmability. In this organization the cache hierarchy is in charge of moving the data and keeping it coherent between all the caches, enabling the usage of shared memory programming models where the programmer does not need to carry out any data management operation. Unfortunately, performing all the data management operations in hardware causes severe problems, being the primary concerns the power consumption originated in the caches and the amount of coherence traffic in the interconnection network. A good solution is to introduce ScratchPad Memories (SPMs) alongside the cache hierarchy, forming a hybrid memory hierarchy. SPMs are more power-efficient than caches and do not generate coherence traffic, but they degrade programmability. In particular, SPMs require the programmer to partition the data, to program data transfers, and to keep coherence between different copies of the data. A promising solution to exploit the benefits of the SPMs without harming programmability is to allow programmers to use shared memory programming models and to automatically generate code that manages the SPMs. Unfortunately, current compilers and runtime systems encounter serious limitations to automatically generate code for hybrid memory hierarchies from shared memory programming models. This thesis proposes to transparently manage the SPMs of hybrid memory hierarchies in shared memory programming models. In order to achieve this goal this thesis proposes a combination of hardware and compiler techniques to manage the SPMs in fork-join programming models and a set of runtime system techniques to manage the SPMs in task programming models. The proposed techniques allow to program hybrid memory hierarchies with these two well-known and easy-to-use forms of shared memory programming models, capitalizing on the benefits of hybrid memory hierarchies in power consumption and network traffic without harming programmability. The first contribution of this thesis is a hardware/software co-designed coherence protocol to transparently manage the SPMs of hybrid memory hierarchies in fork-join programming models. The solution allows the compiler to always generate code to manage the SPMs with tiling software caches, even in the presence of unknown memory aliasing hazards between memory references to the SPMs and to the cache hierarchy. On the software side, the compiler generates a special form of memory instruction for memory references with possible aliasing hazards. On the hardware side, the special memory instructions are diverted to the correct copy of the data using a set of directories that track what data is mapped to the SPMs. The second contribution of this thesis is a set of runtime system techniques to manage the SPMs of hybrid memory hierarchies in task programming models. The proposed runtime system techniques exploit the characteristics of these programming models to map the data specified in the task dependences to the SPMs. Different policies are proposed to mitigate the communication costs of the data transfers, overlapping them with other execution phases such as the task scheduling phase or the execution of the previous task. The runtime system can also reduce the number of data transfers by using a task scheduler that exploits data locality in the SPMs. In addition, the proposed techniques are combined with mechanisms that reduce the impact of fine-grained tasks, such as hardware runtime systems or large SPM sizes. The accomplishment of this thesis is that hybrid memory hierarchies can be programmed with fork-join and task programming models. Consequently, architectures with hybrid memory hierarchies can be exposed to the programmer as a shared memory multiprocessor, taking advantage of the benefits of the SPMs while maintaining the programming simplicity of shared memory programming models.La memoria compartida con coherencia de caches es la jerarquía de memoria más utilizada en multiprocesadores gracias a su programabilidad. En esta solución la jerarquía de caches se encarga de mover los datos y mantener la coherencia entre las caches, habilitando el uso de modelos de programación de memoria compartida donde el programador no tiene que realizar ninguna operación para gestionar las memorias. Desafortunadamente, realizar estas operaciones en la arquitectura causa problemas severos, siendo especialmente relevantes el consumo de energía de las caches y la cantidad de tráfico de coherencia en la red de interconexión. Una buena solución es añadir Memorias ScratchPad (SPMs) acompañando la jerarquía de caches, formando una jerarquía de memoria híbrida. Las SPMs son más eficientes en energía y tráfico de coherencia, pero dificultan la programabilidad ya que requieren que el programador particione los datos, programe transferencias de datos y mantenga la coherencia entre diferentes copias de datos. Una solución prometedora para beneficiarse de las ventajas de las SPMs sin dificultar la programabilidad es permitir que el programador use modelos de programación de memoria compartida y generar código para gestionar las SPMs automáticamente. El problema es que los compiladores y los entornos de ejecución actuales sufren graves limitaciones al gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Esta tesis propone gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Para conseguir este objetivo esta tesis propone una combinación de técnicas hardware y de compilador para gestionar las SPMs en modelos de programación fork-join, y técnicas en entornos de ejecución para gestionar las SPMs en modelos de programación basados en tareas. Las técnicas propuestas hacen que las jerarquías de memoria híbridas puedan programarse con estos dos modelos de programación de memoria compartida, de tal forma que las ventajas en energía y tráfico de coherencia se puedan explotar sin dificultar la programabilidad. La primera contribución de esta tesis en un protocolo de coherencia hardware/software para gestionar SPMs en modelos de programación fork-join. La propuesta consigue que el compilador siempre pueda generar código para gestionar las SPMs, incluso cuando hay posibles alias de memoria entre referencias a memoria a las SPMs y a la jerarquía de caches. En la solución el compilador genera instrucciones especiales para las referencias a memoria con posibles alias, y el hardware sirve las instrucciones especiales con la copia válida de los datos usando directorios que guardan información sobre qué datos están mapeados en las SPMs. La segunda contribución de esta tesis son una serie de técnicas para gestionar SPMs en modelos de programación basados en tareas. Las técnicas aprovechan las características de estos modelos de programación para mapear las dependencias de las tareas en las SPMs y se complementan con políticas para minimizar los costes de las transferencias de datos, como solaparlas con fases del entorno de ejecución o la ejecución de tareas anteriores. El número de transferencias también se puede reducir utilizando un planificador que tenga en cuenta la localidad de datos y, además, las técnicas se pueden combinar con mecanismos para reducir los efectos negativos de tener tareas pequeñas, como entornos de ejecución en hardware o SPMs de más capacidad. Las propuestas de esta tesis consiguen que las jerarquías de memoria híbridas se puedan programar con modelos de programación fork-join y basados en tareas. En consecuencia, las arquitecturas con jerarquías de memoria híbridas se pueden exponer al programador como multiprocesadores de memoria compartida, beneficiándose de las ventajas de las SPMs en energía y tráfico de coherencia y manteniendo la simplicidad de uso de los modelos de programación de memoria compartida

    Leakage Energy Management in Cache Hierarchies

    No full text
    Energy management is important for a spectrum of systems ranging from high-performance architectures to lowend mobile and embedded devices. With the increasing number of transistors, smaller feature sizes, lower supply and threshold voltages, the focus on energy optimization is shifting from dynamic to leakage energy. Leakage energy is of particular concern in dense cache memories that form a major portion of the transistor budget. In this work, we present several architectural techniques that exploit the data duplication across the different levels of cache hierarchy. Specifically, we employ both state-preserving (dataretaining) and state-destroying leakage control mechanisms to L2 subblocks when their data also exist in L1. Using a set of media and array-dominated applications, we demonstrate the effectiveness of the proposed techniques through cycle-accurate simulation. We also compare our schemes with the previously proposed cache decay policy. This comparison indicates that one of our schemes generates competitive results with cache decay
    corecore