14 research outputs found

    On the suitability of time-randomized processors for secure and reliable high-performance computing

    Get PDF
    Time-randomized processor (TRP) architectures have been shown as one of the most promising approaches to deal with the overwhelming complexity of the timing analysis of high complex processor architectures for safety-related real-time systems. With TRPs the timing analysis step mainly relies on collecting measurements of the task under analysis rather than on complex timing models of the processor. Additionally, randomization techniques applied in TRPs provide increased reliability and security features. In this thesis, we elaborate on the reliability and security properties of TRPs and the suitability of extending this processor architecture design paradigm to the high-performance computing domain

    High-Performance low-vcc in-order core

    Get PDF
    Power density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).Peer ReviewedPostprint (published version

    Gestión de contenidos en caches operando a bajo voltaje

    Get PDF
    La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer ReviewedPostprint (author's final draft

    Two-Layer Error Control Codes Combining Rectangular and Hamming Product Codes for Cache Error

    Get PDF
    We propose a novel two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable large-scale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To-Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8×, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133×, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way single-error correcting double-error detecting (SECDED). The cost of the proposed approach is no more than 4% external memory access overhead

    Concertina: Squeezing in cache content to operate at near-threshold voltage

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.Peer ReviewedPostprint (author's final draft

    Architectural Improvements for Low-power Processor Cache

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 전주식.마이크로프로세서는 수행 성능을 증가시키고 소모하는 에너지를 줄이기 위해 연구가 진행되고 있다. 대부분의 경우 수행 성능과 소모 에너지들 간에는 트레이드오프(trade-off) 관계가 성립하여, 소모 에너지를 감소시키면 수행 성능이 낮아지게 된다. 본 논문에서는 프로세서의 구조적 개선을 통해, 수행 성능에 영향을 미치지 않으면서 소모 에너지를 감소시키는 방안과 수행 성능에 큰 영향을 미치는 여러 에너지 감소 방안들을 오버헤드를 최소화하면서 조합하는 방안을 제안한다. 첫 번째로, 수행 성능에 영향을 미치지 않으며 동적 에너지를 감소시키기 위해 선택적 워드 접근 기법을 제안한다. 저장장치 별 저장단위가 다르다는 점에 착안한 이 기법은 주소의 일부분을 캐시 접근 시에 활용하여 저장장치 별로 필요한 부분만을 전달한다. 이 기법을 모의 실험하여 L1 캐시에서 67.5%, L2 캐시에서 27.1%의 동적 에너지 감소를 이끌어 냈다. 정적 에너지까지 고려하면 L1 캐시에서 56.75%의 에너지 감소를 이끌어 냈다. 두 번째로, 수행 성능에 큰 영향을 미치는 필터 캐시, 순차적 캐시 그리고 드라우지 캐시와 논문 전반부에서 제시한 선택적 워드 접근 기법을, 오버헤드를 최소화하면서 조합하는 워드 필터를 사용한 순차적, 선택적 워드 접근 드라우지 캐시를 제안한다. 필터 캐시는 프로세서 레지스터와 L1 캐시 사이에 작은 저장장치를 구현하여 동적 에너지 소모량을 줄이는 기법이다. 해당 기법이 처음 제시되었을 때와 달리 클록 수의 증가로 인해 L1 캐시 접근 시간이 늘어나고, 이로 인해 필터 캐시를 사용할 경우 에너지의 감소와 함께 성능상의 이득까지 볼 수 있다. 이와 함께 기존에 성능상의 손해로 인해 쓰지 못했던 순차적 캐시와 드라우지 캐시와 같은 기법들을 추가적으로 사용할 수 있다. 순차적 캐시는 캐시의 태그 어레이의 적중 여부를 알기 전까지 데이터 어레이를 동작시키지 않는 기법이다. 이는 태그 어레이의 적중 시간만큼 캐시 접근 시간이 길어지는 반면, 적중된 웨이만을 구동시키면 되기 때문에 데이터 어레이의 동적 에너지를 감소시킬 수 있다. 필터 캐시와 같이 사용할 경우, 상대적으로 전력 소모가 적은 태그 어레이를 필터 캐시와 병렬적으로 접근하게 되면 기존 순차적 캐시에서 손해를 보는 태그 어레이 접속 시간을 숨길 수 있다. 드라우지 캐시는 SRAM 셀에 동작전압을 정상 모드(높은 전압)와 저전력 모드(낮은 전압), 두 종류를 공급하고 동작이 발생하지 않는 부분의 전압을 낮추어 공급함으로 캐시의 정적 전력 소모를 줄이는 기법이다. 저전력 모드에 있는 셀에 접근할 경우 낮은 전압을 높은 전압으로 바꾸어주는데 이때 추가적인 접근 시간이 발생한다. 본 논문에서는 해당 셀에 접근하여 전압을 높이는 깨움 비트 전송을 필터 캐시와 L1 캐시 태그 어레이 접속과 병렬적으로 하여 기존 드라우지 캐시에서 발생하게 되는 성능 감소를 막았다. 이와 같이 드라우지 캐시 기법과, 필터 캐시, 순차적 캐시와 선택적 워드 접근 기법을 모두 적용하여 모의 실험한 결과, 전체 프로세서 캐시에서 73.4%의 동적 에너지 감소를, 83.2%의 정적 에너지 감소를, 총 71.7%의 에너지 감소를 이끌어 내었다. 요약하면, 정적 에너지 감소를 위해 드라우지 캐시를 구현하면서 발생하는 추가 시간을 필터 캐시와 순차적 캐시를 이용해 효율적으로 숨기고, 저장 단위 차이를 이용하는 선택적 워드 접근 기법을 추가적으로 구현해 저전력 프로세서 설계를 하였다.The microprocessor is researched to improve the execution performance and reduce the energy consumption. In most cases, the trade-off relationship is established between the energy consumption and execution performance. So if reducing the energy consumption, the execution performance is lowered. In this paper, I propose two low power method by improving the architecture of the processor cache. The one is the method lowering dynamic energy without affecting the execution performance, and the other is the method combined some energy reduction plans which affect a significant impact on execution performance. First, I propose 'Selective Word Reading(SWR)' technique which reduce the dynamic energy of the processor cache without loss of performance. This technique was developed because of the differences between store unit sizes per storage level. In the SWR cache, only the necessary part of a block is activated during the cache access process. For a 32 kB four-way set associative L1 cache, a 32B block size, and four mats per sub-bank, the SWR cache saves dynamic energy by 67.5% without consideration of the leakage energy and by 56.75% with consideration of the leakage energy with no performance degradation and negligible area reduction. Additionally, in a 1MB 16-way set associative L2 cache, a 64B block size, and eight mats per sub-bank, the SWR cache saves dynamic energy by 27.1% for the L2 cache. Second, I propose Sequential-SWR-Drowsy Cache with the Word Filter(SSDF) technique which reduce the entire energy of the processor cache with combining a sequential cache, a selective word reading, a filter cache and a drowsy cache. These techniques are affecting a significant impact on execution performance and I offer the method which can reduce the performance overhead with maximizing the effect of the power consumption. The filter cache is a technique to reduce the dynamic energy consumption that implements a small storage device between the L1 cache and the processor registers. Unlike when it is presented first, by increasing the number of CPU clocks, the access time of the L1 cache is increased and thus, the filter cache, this approach can be seen to advantage of the performance as well as the power consumption. Furthermore, it is possible to use further techniques such as the drowsy cache and the sequential cache without additional damage to the performance. The sequential cache is a technique to delay the operation of the data array until the tag array knows whether it is hit or not. Since the access time of the sequential cache is increased by the tag-array-access time, and to drive only the hit way, so it is possible to reduce the dynamic energy of the data array. When used with the filter cache, if accessed in parallel with the filter cache and the L1 tag array whose power consumption is relatively small, it can hide the tag-array-access time. The drowsy cache supplies the two kind of the operating voltage to the SRAM cell and it makes the cells is placed in two modes – normal mode in high voltage and drowsy mode (low-power mode) in low voltage. And the some cells which access rarely will be placed in drowsy mode, it will decrease the static energy consumption of the cache. If an application want to access the cell of the drowsy mode, at this time that it converts the low voltage to the high voltage, and it will make the additional access time. In this paper, we prevented the degradation of performance by the parallel access of the wake-up call is occurred when the filter cache and the L1 cache tag array is accessed. This technique, SSDF cache, saves 73.4% of the dynamic energy, 83.2% of the static energy and 71.7% of the total cache energy consumption.요 약 i 목 차 iv 그림 목차 vii 표 목차 x 제 1장 서 론 1 1.1 연구 배경 1 1.2 연구 내용 9 1.3 논문 구성 11 제2장 관련 연구 12 2.1 동적 전력 감소 기법 12 2.2 정적 전력 감소 기법 18 2.2.1 내용 미저장 방식 18 2.2.2 내용 저장 방식 19 제 3장 선택적 워드 접근 기법 22 3.1 개발 동기 22 3.2 구현 25 3.2.1 개념적 구현 25 3.2.2 실제적 구현 29 3.3 전력 소모량 계산 34 3.3.1 전력 소모량 수식 34 3.4 워드 버퍼의 고려 38 제 4장 워드 필터를 사용한 순차적·선택적 워드 접근 42 4.1 개발 동기 42 4.2 관련 연구와 제안된 연구의 구현 45 4.2.1 전통적 L1 기준 캐시 45 4.2.2 필터 캐시 47 4.2.3 동시 접근 기술을 사용한 필터 캐시 50 4.2.4 순차적 캐시 51 4.2.5 병렬적 L1 태그 접근 기법과 필터 캐시를 사용한 순차적 캐시 52 4.2.6 드라우지 캐시 54 4.2.7 필터 캐시를 사용한 드라우지 캐시 55 4.2.8 병렬적 동시 접근을 이용한 드라우지·필터 캐시 56 4.2.9 필터 캐시를 사용한 순차적 드라우지 캐시 57 4.2.10 L1 태그에 병렬적 접근과 깨움 동작을 하는 순차적 드라우지·필터 캐시 59 4.3 선택적 워드 접근 기법과의 조합 구현 61 제 5장 성능 평가 및 결과 65 5.1 실험 환경 65 5.2 선택적 워드 접근 기법 실험 결과 69 5.2.1 동적 에너지 감소량 69 5.2.2 정적 에너지를 고려한 상태에서의 에너지 감소 71 5.2.3 워드 버퍼를 가정한 경우 에너지 소모량 74 5.2.4 에너지-지연시간 곱 75 5.3 SSDF 캐시의 실험 결과 76 5.3.1 필터 캐시의 영향 분석 76 5.3.2 드라우지 캐시에 의한 영향 분석 78 5.3.3 SSDF 캐시의 에너지 소모량 79 5.3.3.1 동적 에너지 소모량 79 5.3.3.2 정적 에너지 소모량 81 5.3.3.3 전체 에너지 소모량 82 5.3.3.4 비대칭 SSDF 캐시 83 제 6장 결론 86 참고 문헌 89 Abstract 95Docto

    Low Vccmin fault-tolerant cache with highly predictable performance

    No full text

    Spare Block Cache Architecture to Enable Low-Voltage Operation

    Get PDF
    Power consumption is a major concern for modern processors. Voltage scaling is one of the most effective mechanisms to reduce power consumption. However, voltage scaling is limited by large memory structures, such as caches, where many cells can fail at low voltage operation. As a result, voltage scaling is limited by a minimum voltage (Vccmin), below which the processor may not operate reliably. Researchers have proposed architectural mechanisms, error detection and correction techniques, and circuit solutions to allow the cache to operate reliably at low voltages. Architectural solutions reduce cache capacity at low voltages at the expense of logic complexity. Circuit solutions change the SRAM cell organization and have the disadvantage of reducing the cache capacity (for the same area) even when the system runs at a high voltage. Error detection and correction mechanisms use Error Correction Codes (ECC) codes to keep the cache operation reliable at low voltage, but have the disadvantage of increasing cache access time. In this thesis, we propose a novel architectural technique that uses spare cache blocks to back up a set-associative cache at low voltage. In our mechanism, we perform memory tests at low voltage to detect errors in all cache lines and tag them as faulty or fault-free. We have designed shifter and adder circuits for our architecture, and evaluated our design using the SimpleScalar simulator. We constructed a fault model for our design to find the cache set failure probability at low voltage. Our evaluation shows that, at 485mV, our designed cache operates with an equivalent bit failure probability to a conventional cache operating at 782mV. We have compared instructions per cycle (IPC), miss rates, and cache accesses of our design with a conventional cache operating at nominal voltage. We have also compared our cache performance with a cache using the previously proposed Bit-Fix mechanism. Our result show that our designed spare cache mechanism is 15% more area efficient compared to Bit-Fix. Our proposed approach provides a significant improvement in power and EPI (energy per instruction) over a conventional cache and Bit-Fix, at the expense of having lower performance at high voltage

    Low Vccmin fault-tolerant cache with highly predictable performance

    No full text
    Transistors per area unit double in every new technology node. However, the electric field density and power demand grow if Vcc is not scaled. Therefore, Vcc must be scaled in pace with new technology nodes to prevent excessive degradation and keep power demand within reasonable limits. Unfortunately, low Vcc operation exacerbates the effect of variations and decreases noise and stability margins, increasing the likelihood of errors in SRAM memories such as caches. Those errors translate into performance loss and performance variation across different cores, which is especially undesirable in a multi-core processor. This paper presents (i) a novel scheme to tolerate high faulty bit rates in caches by disabling only faulty subblocks, (ii) a dynamic address remapping scheme to reduce performance variation across different cores, which is key for performance predictability, and (iii) a comparison with state-of-the-art techniques for faulty bit tolerance in caches. Results for some typical first level data cache configurations show 15% average performance increase and standard deviation reduction from 3.13% down to 0.55% when compared to cache line disabling schemes.Peer ReviewedPostprint (published version

    Low Vccmin fault-tolerant cache with highly predictable performance

    No full text
    Transistors per area unit double in every new technology node. However, the electric field density and power demand grow if Vcc is not scaled. Therefore, Vcc must be scaled in pace with new technology nodes to prevent excessive degradation and keep power demand within reasonable limits. Unfortunately, low Vcc operation exacerbates the effect of variations and decreases noise and stability margins, increasing the likelihood of errors in SRAM memories such as caches. Those errors translate into performance loss and performance variation across different cores, which is especially undesirable in a multi-core processor. This paper presents (i) a novel scheme to tolerate high faulty bit rates in caches by disabling only faulty subblocks, (ii) a dynamic address remapping scheme to reduce performance variation across different cores, which is key for performance predictability, and (iii) a comparison with state-of-the-art techniques for faulty bit tolerance in caches. Results for some typical first level data cache configurations show 15% average performance increase and standard deviation reduction from 3.13% down to 0.55% when compared to cache line disabling schemes.Peer Reviewe
    corecore