49 research outputs found

    STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ตœ๊ธฐ์˜.์ง€๋‚œ ์ˆ˜์‹ญ ๋…„๊ฐ„ '๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ' ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์˜จ ์นฉ ์บ์‹œ์˜ ํฌ๊ธฐ๋Š” ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•ด์™”๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ธˆ๊นŒ์ง€ ์บ์‹œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์˜จ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ SRAM์€ ๋‚ฎ์€ ์ง‘์ ๋„์™€ ๋†’์€ ๋Œ€๊ธฐ ์ „๋ ฅ ์†Œ๋ชจ๋กœ ์ธํ•ด ํฐ ์บ์‹œ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ด๋Ÿฌํ•œ SRAM์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋†’์€ ์ง‘์ ๋„์™€ ๋‚ฎ์€ ๋Œ€๊ธฐ ์ „๋ ฅ์„ ์†Œ๋ชจํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ STT-RAM์œผ๋กœ SRAM์„ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ STT-RAM์€ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ๋•Œ ๋งŽ์€ ์—๋„ˆ์ง€์™€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํžˆ SRAM์„ STT-RAM์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ์บ์‹œ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ STT-RAM์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ๋Š” ๊ณ„์ธต ๊ฐ„์— ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํฌํ•จ์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ๋น„๊ตํ•˜์—ฌ ๋” ํฐ ์œ ํšจ ์šฉ๋Ÿ‰์„ ๊ฐ–์ง€๋งŒ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ๋Š” ์ƒ์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์—์„œ ๋‚ด๋ณด๋‚ด์ง„ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์— ์จ์•ผ ํ•˜๋ฏ€๋กœ ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ํŠน์„ฑ์€ ์“ฐ๊ธฐ ํŠน์„ฑ์ด ๋‹จ์ ์ธ STT-RAM์„ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์„ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฌ์‚ฌ์šฉ ๊ฑฐ๋ฆฌ ์˜ˆ์ธก์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” SRAM/STT-RAM ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ, ๋น„ํœ˜๋ฐœ์„ฑ STT-RAM์„ ์ด์šฉํ•ด ์บ์‹œ๋ฅผ ์„ค๊ณ„ํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๋‹ค. STT-RAM์˜ ๋น„ํšจ์œจ์ ์ธ ์“ฐ๊ธฐ ๋™์ž‘์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ•ด๊ฒฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ์ค‘ ํ•œ ๊ฐ€์ง€๋Š” STT-RAM ์†Œ์ž๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” ์‹œ๊ฐ„์„ ์ค„์—ฌ (ํœ˜๋ฐœ์„ฑ STT-RAM) ์“ฐ๊ธฐ ํŠน์„ฑ์„ ํ–ฅ์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. STT-RAM์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์žƒ๋Š” ๊ฒƒ์€ ํ™•๋ฅ ์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜ค๋ฅ˜ ์ •์ • ๋ถ€ํ˜ธ(ECC)๋ฅผ ์ด์šฉํ•ด ์ฃผ๊ธฐ์ ์œผ๋กœ ์˜ค๋ฅ˜๋ฅผ ์ •์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ํœ˜๋ฐœ์„ฑ STT-RAM ์„ค๊ณ„ ์š”์†Œ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๊ณ  ์‹คํ—˜์„ ํ†ตํ•ด ํ•ด๋‹น ์„ค๊ณ„ ์š”์†Œ๋“ค์ด ์บ์‹œ ์—๋„ˆ์ง€์™€ ์„ฑ๋Šฅ์— ์ฃผ๋Š” ์˜ํ–ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ์˜ ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‹จ์ˆœํžˆ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์™€ ๋ถ„์‚ฐ์บ์‹œ๋ฅผ ๊ฒฐํ•ฉํ•˜๋ฉด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์˜ ํšจ์œจ์„ฑ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š” SRAM ํ™œ์šฉ๋„๊ฐ€ ๋‚ฎ์•„์ง„๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ์˜ ์—๋„ˆ์ง€ ๊ฐ์†Œ๋ฅผ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์—†๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ SRAM ํ™œ์šฉ๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋Š” ๋‘ ๊ฐ€์ง€ ์ตœ์ ํ™” ๊ธฐ์ˆ ์ธ ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”์™€ ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”๋Š” highly-associative ์บ์‹œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฑ…ํฌ ๋‚ด๋ถ€์—์„œ ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™”๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์บ์‹œ ๋ฑ…ํฌ์— ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฅด๊ฒŒ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด๋‹ค.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM. The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing. The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i Contents iii List of Figures vii List of Tables xi Chapter 1 Introduction 1 1.1 Exclusive Last-Level Hybrid Cache 2 1.2 Designing Volatile STT-RAM Cache 4 1.3 Distributed Hybrid Cache 5 Chapter 2 Background 9 2.1 STT-RAM 9 2.1.1 Thermal Stability 10 2.1.2 Read and Write Operation of STT-RAM 11 2.1.3 Failures of STT-RAM 11 2.1.4 Volatile STT-RAM 13 2.1.5 Related Work 14 2.2 Exclusive Last-Level Hybrid Cache 18 2.2.1 Cache Hierarchies 18 2.2.2 Related Work 19 2.3 Distributed Hybrid Cache 21 2.3.1 Prediction Hybrid Cache 21 2.3.2 Distributed Cache Partitioning 22 2.3.3 Related Work 23 Chapter 3 Exclusive Last-Level Hybrid Cache 27 3.1 Motivation 27 3.1.1 Exclusive Cache Hierarchy 27 3.1.2 Reuse Distance 29 3.2 Architecture 30 3.2.1 Reuse Distance Predictor 30 3.2.2 Hybrid Cache Architecture 32 3.3 Evaluation 34 3.3.1 Methodology 34 3.3.2 LLC Energy Consumption 35 3.3.3 Main Memory Energy Consumption 38 3.3.4 Performance 39 3.3.5 Area Overhead 39 3.4 Summary 39 Chapter 4 Designing Volatile STT-RAM Cache 41 4.1 Analysis 41 4.1.1 Retention Failure of a Volatile STT-RAM Cell 41 4.1.2 Memory Array Design 43 4.2 Evaluation 45 4.2.1 Methodology 45 4.2.2 Last-Level Cache Energy 46 4.2.3 Performance 51 4.3 Summary 52 Chapter 5 Distributed Hybrid Cache 55 5.1 Motivation 55 5.2 Architecture 58 5.2.1 Intra-Bank Optimization 59 5.2.2 Inter-Bank Optimization 63 5.2.3 Other Optimizations 67 5.3 Evaluation Methodology 69 5.4 Evaluation Results 73 5.4.1 Energy Consumption and Performance 73 5.4.2 Analysis of Intra-bank Optimization 76 5.4.3 Analysis of Inter-bank Optimization 78 5.4.4 Impact of Inter-Bank Optimization on Network Energy 79 5.4.5 Sensitivity Analysis 80 5.4.6 Implementation Overhead 81 5.5 Summary 82 Chapter 6 Conculsion 85 Bibliography 88 ์ดˆ๋ก 101Docto

    A survey of emerging architectural techniques for improving cache energy consumption

    Get PDF
    The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    Compression-aware and performance-efficient insertion policies for long-lasting hybrid LLCs

    Get PDF
    Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17ร— compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.This work was partially funded by the HiPEAC collaboration grant 2020, the Center for Advancing Electronics Dresden (cfaed), the German Research Council (DFG) through the HetCIM project (502388442) under the Priority Program on โ€˜Disruptive Memory Technologiesโ€™ (SPP 2377), and from grants (1) PID2019-105660RB-C21 and PID2019-107255GB- C22/AEI/10.13039/501100011033 from Agencia Estatal de Investigaciรณn (AEI), and (2) gaZ: T5820R research group from Dept. of Science, University and Knowledge Society, Government of Aragon.Peer ReviewedPostprint (author's final draft

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37โ€“30%, additional energy savings in the main memory in the range of 6โ€“8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reportingโ€”depending on the specific scenario and the kind of applications usedโ€”SLLC energy savings in the range of 4โ€“11% higher than those of DASCA, delivering higher performance in the range of 1.5โ€“14% and additional improvements in DRAM energy consumption in the range of 2โ€“9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37โ€“30%, additional energy savings in the main memory in the range of 6โ€“8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reportingโ€”depending on the specific scenario and the kind of applications usedโ€”SLLC energy savings in the range of 4โ€“11% higher than those of DASCA, delivering higher performance in the range of 1.5โ€“14% and additional improvements in DRAM energy consumption in the range of 2โ€“9% higher than DASCA

    Computing with Spintronics: Circuits and architectures

    Get PDF
    This thesis makes the following contributions towards the design of computing platforms with spintronic devices. 1) It explores the use of spintronic memories in the design of a domain-specific processor for an emerging class of data-intensive applications, namely recognition, mining and synthesis (RMS). Two different spintronic memory technologies โ€” Domain Wall Memory (DWM) and STT-MRAM โ€” are utilized to realize the different levels in the memory hierarchy of the domain-specific processor, based on their respective access characteristics. Architectural tradeoffs created by the use of spintronic memories are analyzed. The proposed design achieves 1.5X-4X improvements in energy-delay product compared to a CMOS baseline. 2) It describes the first attempt to use DWM in the cache hierarchy of general-purpose processors. DWM promises unparalleled density by packing several bits of data into each bit-cell. TapeCache, the proposed DWM-based cache architecture, utilizes suitable circuit and architectural optimizations to address two key challenges (i) the high energy and latency requirement of write operations and (ii) the need for shift operations to access the data stored in each DWM bit-cell. At the circuit level, DWM bit-cells that are tailored to the distinct design requirements of different levels in the cache hierarchy are proposed. At the architecture level, TapeCache proposes suitable cache organization and management policies to alleviate the performance impact of shift operations required to access data stored in DWM bit-cells. TapeCache achieves more than 7X improvements in both cache area and energy with virtually identical performance compared to an SRAM-based cache hierarchy. 3) It investigates the design of the on-chip memory hierarchy of general-purpose graphics processing units (GPGPUs)โ€”massively parallel processors that are optimized for data-intensive high-throughput workloadsโ€”using DWM. STAG, a high density, energy-efficient Spintronic- Tape Architecture for GPGPU cache hierarchies is described. STAG utilizes different DWM bit-cells to realize different memory arrays in the GPGPU cache hierarchy. To address the challenge of high access latencies due to shifts, STAG predicts upcoming cache accesses by leveraging unique characteristics of GPGPU architectures and workloads, and prefetches data that are both likely to be accessed and require large numbers of shift operations. STAG achieves 3.3X energy reduction and 12.1% performance improvement over CMOS SRAM under iso-area conditions. 4) While the potential of spintronic devices for memories is widely recognized, their utility in realizing logic is much less clear. The thesis presents Spintastic, a new paradigm that utilizes Stochastic Computing (SC) to realize spintronic logic. In SC, data is encoded in the form of pseudo-random bitstreams, such that the probability of a \u271\u27 in a bitstream corresponds to the numerical value that it represents. SC can enable compact, low-complexity logic implementations of various arithmetic functions. Spintastic establishes the synergy between stochastic computing and spin-based logic by demonstrating that they mutually alleviate each other\u27s limitations. On the one hand, various building blocks of SC, which incur significant overheads in CMOS implementations, can be efficiently realized by exploiting the physical characteristics of spin devices. On the other hand, the reduced logic complexity and low logic depth of SC circuits alleviates the shortcomings of spintronic logic. Based on this insight, the design of spin-based stochastic arithmetic circuits, bitstream generators, bitstream permuters and stochastic-to-binary converter circuits are presented. Spintastic achieves 7.1X energy reduction over CMOS implementations for a wide range of benchmarks from the image processing, signal processing, and RMS application domains. 5) In order to evaluate the proposed spintronic designs, the thesis describes various device-to-architecture modeling frameworks. Starting with devices models that are calibrated to measurements, the characteristics of spintronic devices are successively abstracted into circuit-level and architectural models, which are incorporated into suitable simulation frameworks. (Abstract shortened by UMI.

    Software Techniques for Energy Efficient Memories

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations

    Get PDF
    The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)

    HOPE: Holistic STT-RAM Architecture Exploration Framework for Future Cross-Platform Analysis

    Full text link
    Spin Transfer Torque Random Access Memory (STT-RAM) is an emerging Non-Volatile Memory (NVM) technology that has garnered attention to overcome the drawbacks of conventional CMOS-based technologies. However, such technologies must be evaluated before deployment under real workloads and architecture. But there is a lack of available open-source STT-RAM-based system evaluation framework, which hampers research and experimentation and impacts the adoption of STT- RAM in a system. This paper proposes a novel, extendable STT-RAM memory controller design integrated inside the gem5 simulator. Our framework enables understanding various aspects of STT-RAM, i.e., power, delay, clock cycles, energy, and system throughput. We will open-source our HOPE framework, which will fuel research and aid in accelerating the development of future system architectures based on STT-RAM. It will also facilitate the user for further tool enhancement
    corecore