64 research outputs found

    STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ตœ๊ธฐ์˜.์ง€๋‚œ ์ˆ˜์‹ญ ๋…„๊ฐ„ '๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ' ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์˜จ ์นฉ ์บ์‹œ์˜ ํฌ๊ธฐ๋Š” ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•ด์™”๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ธˆ๊นŒ์ง€ ์บ์‹œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์˜จ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ SRAM์€ ๋‚ฎ์€ ์ง‘์ ๋„์™€ ๋†’์€ ๋Œ€๊ธฐ ์ „๋ ฅ ์†Œ๋ชจ๋กœ ์ธํ•ด ํฐ ์บ์‹œ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ด๋Ÿฌํ•œ SRAM์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋†’์€ ์ง‘์ ๋„์™€ ๋‚ฎ์€ ๋Œ€๊ธฐ ์ „๋ ฅ์„ ์†Œ๋ชจํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ STT-RAM์œผ๋กœ SRAM์„ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ STT-RAM์€ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ๋•Œ ๋งŽ์€ ์—๋„ˆ์ง€์™€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํžˆ SRAM์„ STT-RAM์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ์บ์‹œ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ STT-RAM์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ๋Š” ๊ณ„์ธต ๊ฐ„์— ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํฌํ•จ์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ๋น„๊ตํ•˜์—ฌ ๋” ํฐ ์œ ํšจ ์šฉ๋Ÿ‰์„ ๊ฐ–์ง€๋งŒ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ๋Š” ์ƒ์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์—์„œ ๋‚ด๋ณด๋‚ด์ง„ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์— ์จ์•ผ ํ•˜๋ฏ€๋กœ ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ํŠน์„ฑ์€ ์“ฐ๊ธฐ ํŠน์„ฑ์ด ๋‹จ์ ์ธ STT-RAM์„ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์„ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฌ์‚ฌ์šฉ ๊ฑฐ๋ฆฌ ์˜ˆ์ธก์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” SRAM/STT-RAM ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ, ๋น„ํœ˜๋ฐœ์„ฑ STT-RAM์„ ์ด์šฉํ•ด ์บ์‹œ๋ฅผ ์„ค๊ณ„ํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๋‹ค. STT-RAM์˜ ๋น„ํšจ์œจ์ ์ธ ์“ฐ๊ธฐ ๋™์ž‘์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ•ด๊ฒฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ์ค‘ ํ•œ ๊ฐ€์ง€๋Š” STT-RAM ์†Œ์ž๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” ์‹œ๊ฐ„์„ ์ค„์—ฌ (ํœ˜๋ฐœ์„ฑ STT-RAM) ์“ฐ๊ธฐ ํŠน์„ฑ์„ ํ–ฅ์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. STT-RAM์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์žƒ๋Š” ๊ฒƒ์€ ํ™•๋ฅ ์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜ค๋ฅ˜ ์ •์ • ๋ถ€ํ˜ธ(ECC)๋ฅผ ์ด์šฉํ•ด ์ฃผ๊ธฐ์ ์œผ๋กœ ์˜ค๋ฅ˜๋ฅผ ์ •์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ํœ˜๋ฐœ์„ฑ STT-RAM ์„ค๊ณ„ ์š”์†Œ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๊ณ  ์‹คํ—˜์„ ํ†ตํ•ด ํ•ด๋‹น ์„ค๊ณ„ ์š”์†Œ๋“ค์ด ์บ์‹œ ์—๋„ˆ์ง€์™€ ์„ฑ๋Šฅ์— ์ฃผ๋Š” ์˜ํ–ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ์˜ ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‹จ์ˆœํžˆ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์™€ ๋ถ„์‚ฐ์บ์‹œ๋ฅผ ๊ฒฐํ•ฉํ•˜๋ฉด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์˜ ํšจ์œจ์„ฑ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š” SRAM ํ™œ์šฉ๋„๊ฐ€ ๋‚ฎ์•„์ง„๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ์˜ ์—๋„ˆ์ง€ ๊ฐ์†Œ๋ฅผ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์—†๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ SRAM ํ™œ์šฉ๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋Š” ๋‘ ๊ฐ€์ง€ ์ตœ์ ํ™” ๊ธฐ์ˆ ์ธ ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”์™€ ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”๋Š” highly-associative ์บ์‹œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฑ…ํฌ ๋‚ด๋ถ€์—์„œ ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™”๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์บ์‹œ ๋ฑ…ํฌ์— ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฅด๊ฒŒ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด๋‹ค.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM. The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing. The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i Contents iii List of Figures vii List of Tables xi Chapter 1 Introduction 1 1.1 Exclusive Last-Level Hybrid Cache 2 1.2 Designing Volatile STT-RAM Cache 4 1.3 Distributed Hybrid Cache 5 Chapter 2 Background 9 2.1 STT-RAM 9 2.1.1 Thermal Stability 10 2.1.2 Read and Write Operation of STT-RAM 11 2.1.3 Failures of STT-RAM 11 2.1.4 Volatile STT-RAM 13 2.1.5 Related Work 14 2.2 Exclusive Last-Level Hybrid Cache 18 2.2.1 Cache Hierarchies 18 2.2.2 Related Work 19 2.3 Distributed Hybrid Cache 21 2.3.1 Prediction Hybrid Cache 21 2.3.2 Distributed Cache Partitioning 22 2.3.3 Related Work 23 Chapter 3 Exclusive Last-Level Hybrid Cache 27 3.1 Motivation 27 3.1.1 Exclusive Cache Hierarchy 27 3.1.2 Reuse Distance 29 3.2 Architecture 30 3.2.1 Reuse Distance Predictor 30 3.2.2 Hybrid Cache Architecture 32 3.3 Evaluation 34 3.3.1 Methodology 34 3.3.2 LLC Energy Consumption 35 3.3.3 Main Memory Energy Consumption 38 3.3.4 Performance 39 3.3.5 Area Overhead 39 3.4 Summary 39 Chapter 4 Designing Volatile STT-RAM Cache 41 4.1 Analysis 41 4.1.1 Retention Failure of a Volatile STT-RAM Cell 41 4.1.2 Memory Array Design 43 4.2 Evaluation 45 4.2.1 Methodology 45 4.2.2 Last-Level Cache Energy 46 4.2.3 Performance 51 4.3 Summary 52 Chapter 5 Distributed Hybrid Cache 55 5.1 Motivation 55 5.2 Architecture 58 5.2.1 Intra-Bank Optimization 59 5.2.2 Inter-Bank Optimization 63 5.2.3 Other Optimizations 67 5.3 Evaluation Methodology 69 5.4 Evaluation Results 73 5.4.1 Energy Consumption and Performance 73 5.4.2 Analysis of Intra-bank Optimization 76 5.4.3 Analysis of Inter-bank Optimization 78 5.4.4 Impact of Inter-Bank Optimization on Network Energy 79 5.4.5 Sensitivity Analysis 80 5.4.6 Implementation Overhead 81 5.5 Summary 82 Chapter 6 Conculsion 85 Bibliography 88 ์ดˆ๋ก 101Docto

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    Gestiรณn de jerarquรญas de memoria hรญbridas a nivel de sistema

    Get PDF
    Tesis inรฉdita de la Universidad Complutense de Madrid, Facultad de Informรกtica, Departamento de Arquitectura de Computadoras y Automรกtica y de Ku Leuven, Arenberg Doctoral School, Faculty of Engineering Science, leรญda el 11/05/2017.In electronics and computer science, the term โ€˜memoryโ€™ generally refers to devices that are used to store information that we use in various appliances ranging from our PCs to all hand-held devices, smart appliances etc. Primary/main memory is used for storage systems that function at a high speed (i.e. RAM). The primary memory is often associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices. The secondary/auxiliary memory, in comparison provides program and data storage that is slower to access but offers larger capacity. Examples include external hard drives, portable flash drives, CDs, and DVDs. These devices and media must be either plugged in or inserted into a computer in order to be accessed by the system. Since secondary storage technology is not always connected to the computer, it is commonly used for backing up data. The term storage is often used to describe secondary memory. Secondary memory stores a large amount of data at lesser cost per byte than primary memory; this makes secondary storage about two orders of magnitude less expensive than primary storage. There are two main types of semiconductor memory: volatile and nonvolatile. Examples of non-volatile memory are โ€˜Flashโ€™ memory (sometimes used as secondary, sometimes primary computer memory) and ROM/PROM/EPROM/EEPROM memory (used for firmware such as boot programs). Examples of volatile memory are primary memory (typically dynamic RAM, DRAM), and fast CPU cache memory (typically static RAM, SRAM, which is fast but energy-consuming and offer lower memory capacity per are a unit than DRAM). Non-volatile memory technologies in Si-based electronics date back to the 1990s. Flash memory is widely used in consumer electronic products such as cellphones and music players and NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. The rapid increase of leakage currents in Silicon CMOS transistors with scaling poses a big challenge for the integration of SRAM memories. There is also the case of susceptibility to read/write failure with low power schemes. As a result of this, over the past decade, there has been an extensive pooling of time, resources and effort towards developing emerging memory technologies like Resistive RAM (ReRAM/RRAM), STT-MRAM, Domain Wall Memory and Phase Change Memory(PRAM). Emerging non-volatile memory technologies promise new memories to store more data at less cost than the expensive-to build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. These new memory technologies combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the non-volatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. The research and information on these Non-Volatile Memory (NVM) technologies has matured over the last decade. These NVMs are now being explored thoroughly nowadays as viable replacements for conventional SRAM based memories even for the higher levels of the memory hierarchy. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional(3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years...En el campo de la informรกtica, el tรฉrmino โ€˜memoriaโ€™ se refiere generalmente a dispositivos que son usados para almacenar informaciรณn que posteriormente serรก usada en diversos dispositivos, desde computadoras personales (PC), mรณviles, dispositivos inteligentes, etc. La memoria principal del sistema se utiliza para almacenar los datos e instrucciones de los procesos que se encuentre en ejecuciรณn, por lo que se requiere que funcionen a alta velocidad (por ejemplo, DRAM). La memoria principal estรก implementada habitualmente mediante memorias semiconductoras direccionables, siendo DRAM y SRAM los principales exponentes. Por otro lado, la memoria auxiliar o secundaria proporciona almacenaje(para ficheros, por ejemplo); es mรกs lenta pero ofrece una mayor capacidad. Ejemplos tรญpicos de memoria secundaria son discos duros, memorias flash portables, CDs y DVDs. Debido a que estos dispositivos no necesitan estar conectados a la computadora de forma permanente, son muy utilizados para almacenar copias de seguridad. La memoria secundaria almacena una gran cantidad de datos aun coste menor por bit que la memoria principal, siendo habitualmente dos รณrdenes de magnitud mรกs barata que la memoria primaria. Existen dos tipos de memorias de tipo semiconductor: volรกtiles y no volรกtiles. Ejemplos de memorias no volรกtiles son las memorias Flash (algunas veces usadas como memoria secundaria y otras veces como memoria principal) y memorias ROM/PROM/EPROM/EEPROM (usadas para firmware como programas de arranque). Ejemplos de memoria volรกtil son las memorias DRAM (RAM dinรกmica), actualmente la opciรณn predominante a la hora de implementar la memoria principal, y las memorias SRAM (RAM estรกtica) mรกs rรกpida y costosa, utilizada para los diferentes niveles de cache. Las tecnologรญas de memorias no volรกtiles basadas en electrรณnica de silicio se remontan a la dรฉcada de1990. Una variante de memoria de almacenaje por carga denominada como memoria Flash es mundialmente usada en productos electrรณnicos de consumo como telefonรญa mรณvil y reproductores de mรบsica mientras NAND Flash solid state disks(SSDs) estรกn progresivamente desplazando a los dispositivos de disco duro como principal unidad de almacenamiento en computadoras portรกtiles, de escritorio e incluso en centros de datos. En la actualidad, hay varios factores que amenazan la actual predominancia de memorias semiconductoras basadas en cargas (capacitivas). Por un lado, se estรก alcanzando el lรญmite de integraciรณn de las memorias Flash, lo que compromete su escalado en el medio plazo. Por otra parte, el fuerte incremento de las corrientes de fuga de los transistores de silicio CMOS actuales, supone un enorme desafรญo para la integraciรณn de memorias SRAM. Asimismo, estas memorias son cada vez mรกs susceptibles a fallos de lectura/escritura en diseรฑos de bajo consumo. Como resultado de estos problemas, que se agravan con cada nueva generaciรณn tecnolรณgica, en los รบltimos aรฑos se han intensificado los esfuerzos para desarrollar nuevas tecnologรญas que reemplacen o al menos complementen a las actuales. Los transistores de efecto campo elรฉctrico ferroso (FeFET en sus siglas en inglรฉs) se consideran una de las alternativas mรกs prometedores para sustituir tanto a Flash (por su mayor densidad) como a DRAM (por su mayor velocidad), pero aรบn estรก en una fase muy inicial de su desarrollo. Hay otras tecnologรญas algo mรกs maduras, en el รกmbito de las memorias RAM resistivas, entre las que cabe destacar ReRAM (o RRAM), STT-RAM, Domain Wall Memory y Phase Change Memory (PRAM)...Depto. de Arquitectura de Computadores y AutomรกticaFac. de InformรกticaTRUEunpu

    Energy-Aware Data Movement In Non-Volatile Memory Hierarchies

    Get PDF
    While technology scaling enables increased density for memory cells, the intrinsic high leakage power of conventional CMOS technology and the demand for reduced energy consumption inspires the use of emerging technology alternatives such as eDRAM and Non-Volatile Memory (NVM) including STT-MRAM, PCM, and RRAM. The utilization of emerging technology in Last Level Cache (LLC) designs which occupies a signifcant fraction of total die area in Chip Multi Processors (CMPs) introduces new dimensions of vulnerability, energy consumption, and performance delivery. To be specific, a part of this research focuses on eDRAM Bit Upset Vulnerability Factor (BUVF) to assess vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on the write voltage, storage and bit-line capacitance. This dissertation broaden the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory. Large-latency and power-hungry Sense Amplifers (SAs) have been adapted to combat PV in the past. Herein, a novel approach is proposed to leverage the PV in NVM arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time. On the other hand, this dissertation investigates a novel technique to prioritize the service to 1) Extensive Read Reused Accessed blocks of the LLC that are silently dropped from higher levels of cache, and 2) the portion of the working set that may exhibit distant re-reference interval in L2. In particular, we develop a lightweight Multi-level Access History Profiler to effciently identify ERRA blocks through aggregating the LLC block addresses tagged with identical Most Signifcant Bits into a single entry. Experimental results indicate that the proposed technique can reduce the L2 read miss ratio by 51.7% on average across PARSEC and SPEC2006 workloads. In addition, this dissertation will broaden and apply advancements in theories of subspace recovery to pioneer computationally-aware in-situ operand reconstruction via the novel Logic In Interconnect (LI2) scheme. LI2 will be developed, validated, and re?ned both theoretically and experimentally to realize a radically different approach to post-Moore\u27s Law computing by leveraging low-rank matrices features offering data reconstruction instead of fetching data from main memory to reduce energy/latency cost per data movement. We propose LI2 enhancement to attain high performance delivery in the post-Moore\u27s Law era through equipping the contemporary micro-architecture design with a customized memory controller which orchestrates the memory request for fetching low-rank matrices to customized Fine Grain Reconfigurable Accelerator (FGRA) for reconstruction while the other memory requests are serviced as before. The goal of LI2 is to conquer the high latency/energy required to traverse main memory arrays in the case of LLC miss, by using in-situ construction of the requested data dealing with low-rank matrices. Thus, LI2 exchanges a high volume of data transfers with a novel lightweight reconstruction method under specific conditions using a cross-layer hardware/algorithm approach

    High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory

    Get PDF
    In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design

    Computing with Spintronics: Circuits and architectures

    Get PDF
    This thesis makes the following contributions towards the design of computing platforms with spintronic devices. 1) It explores the use of spintronic memories in the design of a domain-specific processor for an emerging class of data-intensive applications, namely recognition, mining and synthesis (RMS). Two different spintronic memory technologies โ€” Domain Wall Memory (DWM) and STT-MRAM โ€” are utilized to realize the different levels in the memory hierarchy of the domain-specific processor, based on their respective access characteristics. Architectural tradeoffs created by the use of spintronic memories are analyzed. The proposed design achieves 1.5X-4X improvements in energy-delay product compared to a CMOS baseline. 2) It describes the first attempt to use DWM in the cache hierarchy of general-purpose processors. DWM promises unparalleled density by packing several bits of data into each bit-cell. TapeCache, the proposed DWM-based cache architecture, utilizes suitable circuit and architectural optimizations to address two key challenges (i) the high energy and latency requirement of write operations and (ii) the need for shift operations to access the data stored in each DWM bit-cell. At the circuit level, DWM bit-cells that are tailored to the distinct design requirements of different levels in the cache hierarchy are proposed. At the architecture level, TapeCache proposes suitable cache organization and management policies to alleviate the performance impact of shift operations required to access data stored in DWM bit-cells. TapeCache achieves more than 7X improvements in both cache area and energy with virtually identical performance compared to an SRAM-based cache hierarchy. 3) It investigates the design of the on-chip memory hierarchy of general-purpose graphics processing units (GPGPUs)โ€”massively parallel processors that are optimized for data-intensive high-throughput workloadsโ€”using DWM. STAG, a high density, energy-efficient Spintronic- Tape Architecture for GPGPU cache hierarchies is described. STAG utilizes different DWM bit-cells to realize different memory arrays in the GPGPU cache hierarchy. To address the challenge of high access latencies due to shifts, STAG predicts upcoming cache accesses by leveraging unique characteristics of GPGPU architectures and workloads, and prefetches data that are both likely to be accessed and require large numbers of shift operations. STAG achieves 3.3X energy reduction and 12.1% performance improvement over CMOS SRAM under iso-area conditions. 4) While the potential of spintronic devices for memories is widely recognized, their utility in realizing logic is much less clear. The thesis presents Spintastic, a new paradigm that utilizes Stochastic Computing (SC) to realize spintronic logic. In SC, data is encoded in the form of pseudo-random bitstreams, such that the probability of a \u271\u27 in a bitstream corresponds to the numerical value that it represents. SC can enable compact, low-complexity logic implementations of various arithmetic functions. Spintastic establishes the synergy between stochastic computing and spin-based logic by demonstrating that they mutually alleviate each other\u27s limitations. On the one hand, various building blocks of SC, which incur significant overheads in CMOS implementations, can be efficiently realized by exploiting the physical characteristics of spin devices. On the other hand, the reduced logic complexity and low logic depth of SC circuits alleviates the shortcomings of spintronic logic. Based on this insight, the design of spin-based stochastic arithmetic circuits, bitstream generators, bitstream permuters and stochastic-to-binary converter circuits are presented. Spintastic achieves 7.1X energy reduction over CMOS implementations for a wide range of benchmarks from the image processing, signal processing, and RMS application domains. 5) In order to evaluate the proposed spintronic designs, the thesis describes various device-to-architecture modeling frameworks. Starting with devices models that are calibrated to measurements, the characteristics of spintronic devices are successively abstracted into circuit-level and architectural models, which are incorporated into suitable simulation frameworks. (Abstract shortened by UMI.

    ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์ตœ๊ธฐ์˜.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies. The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM. To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption. The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems. In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1 1.1 Inefficiencies in the Current Memory Systems 2 1.1.1 On-Chip Caches 2 1.1.2 Main Memory 2 1.2 New Memory Technologies: Opportunities and Challenges 3 1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3 1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6 1.3 Dissertation Overview 9 Chapter 2 Previous Work 11 2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11 2.1.1 Hybrid Caches 11 2.1.2 Volatile STT-RAM 13 2.1.3 Redundant Write Elimination 14 2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15 2.2.1 PIM Architectures in the 1990s 15 2.2.2 Modern PIM Architectures based on 3D Stacking 15 2.2.3 Modern PIM Architectures on Memory Dies 17 Chapter 3 Loop-Aware Sleepy Instruction Cache 19 3.1 Architecture 20 3.1.1 Loop Cache 21 3.1.2 Loop-Aware Sleep Controller 22 3.2 Evaluation and Discussion 24 3.2.1 Simulation Environment 24 3.2.2 Energy 25 3.2.3 Performance 27 3.2.4 Sensitivity Analysis 27 3.3 Summary 28 Chapter 4 Lower-Bits Cache 29 4.1 Architecture 29 4.2 Experiments 32 4.2.1 Simulator and Cache Model 32 4.2.2 Results 33 4.3 Summary 34 Chapter 5 Prediction Hybrid Cache 35 5.1 Problem and Motivation 37 5.1.1 Problem Definition 37 5.1.2 Motivation 37 5.2 Write Intensity Predictor 38 5.2.1 Keeping Track of Trigger Instructions 39 5.2.2 Identifying Hot Trigger Instructions 40 5.2.3 Dynamic Set Sampling 41 5.2.4 Summary 42 5.3 Prediction Hybrid Cache 43 5.3.1 Need for Write Intensity Prediction 43 5.3.2 Organization 43 5.3.3 Operations 44 5.3.4 Dynamic Threshold Adjustment 45 5.4 Evaluation Methodology 48 5.4.1 Simulator Configuration 48 5.4.2 Workloads 50 5.5 Single-Core Evaluations 51 5.5.1 Energy Consumption and Speedup 51 5.5.2 Energy Breakdown 53 5.5.3 Coverage and Accuracy 54 5.5.4 Sensitivity to Write Intensity Threshold 55 5.5.5 Impact of Dynamic Set Sampling 55 5.5.6 Results for Non-Write-Intensive Workloads 56 5.6 Multicore Evaluations 57 5.7 Summary 59 Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61 6.1 Motivation 62 6.1.1 Energy Impact of Inefficient Write Operations 62 6.1.2 Limitations of Existing Approaches 63 6.1.3 Potential of Dead Writes 64 6.2 Dead Write Classification 65 6.2.1 Dead-on-Arrival Fills 65 6.2.2 Dead-Value Fills 66 6.2.3 Closing Writes 66 6.2.4 Decomposition 67 6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68 6.3.1 Dead Write Prediction 68 6.3.2 Bidirectional Bypass 71 6.4 Evaluation Methodology 72 6.4.1 Simulation Configuration 72 6.4.2 Workloads 74 6.5 Evaluation for Single-Core Systems 75 6.5.1 Energy Consumption and Speedup 75 6.5.2 Coverage and Accuracy 78 6.5.3 Sensitivity to Signature 78 6.5.4 Sensitivity to Update Policy 80 6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80 6.5.6 Impact of Prefetching 80 6.6 Evaluation for Multi-Core Systems 81 6.6.1 Energy Consumption and Speedup 81 6.6.2 Application to Inclusive Caches 83 6.6.3 Application to Three-Level Cache Hierarchy 84 6.7 Summary 85 Chapter 7 Link Power Management for Hybrid Memory Cubes 87 7.1 Background and Motivation 88 7.1.1 Hybrid Memory Cube 88 7.1.2 Motivation 89 7.2 HMC Link Power Management 91 7.2.1 Link Delay Monitor 91 7.2.2 Power State Transition 94 7.2.3 Overhead 95 7.3 Two-Level Prefetching 95 7.4 Application to Multi-HMC Systems 97 7.5 Experiments 98 7.5.1 Methodology 98 7.5.2 Link Energy Consumption and Speedup 100 7.5.3 HMC Energy Consumption 102 7.5.4 Runtime Behavior of LPM 102 7.5.5 Sensitivity to Slowdown Threshold 104 7.5.6 LPM without Prefetching 104 7.5.7 Impact of Prefetching on Link Traffic 105 7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107 7.5.9 Tighter Off-Chip Bandwidth Margin 107 7.5.10 Multithreaded Workloads 108 7.5.11 Multi-HMC Systems 109 7.6 Summary 111 Chapter 8 Tesseract PIM System for Parallel Graph Processing 113 8.1 Background and Motivation 115 8.1.1 Large-Scale Graph Processing 115 8.1.2 Graph Processing on Conventional Systems 117 8.1.3 Processing-in-Memory 118 8.2 Tesseract Architecture 119 8.2.1 Overview 119 8.2.2 Remote Function Call via Message Passing 122 8.2.3 Prefetching 124 8.2.4 Programming Interface 126 8.2.5 Application Mapping 127 8.3 Evaluation Methodology 128 8.3.1 Simulation Configuration 128 8.3.2 Workloads 129 8.4 Evaluation Results 130 8.4.1 Performance 130 8.4.2 Iso-Bandwidth Comparison 133 8.4.3 Execution Time Breakdown 134 8.4.4 Prefetch Efficiency 134 8.4.5 Scalability 135 8.4.6 Effect of Higher Off-Chip Network Bandwidth 136 8.4.7 Effect of Better Graph Distribution 137 8.4.8 Energy/Power Consumption and Thermal Analysis 138 8.5 Summary 139 Chapter 9 PIM-Enabled Instructions 141 9.1 Potential of ISA Extensions as the PIM Interface 143 9.2 PIM Abstraction 145 9.2.1 Operations 145 9.2.2 Memory Model 147 9.2.3 Software Modification 148 9.3 Architecture 148 9.3.1 Overview 148 9.3.2 PEI Computation Unit (PCU) 149 9.3.3 PEI Management Unit (PMU) 150 9.3.4 Virtual Memory Support 153 9.3.5 PEI Execution 153 9.3.6 Comparison with Active Memory Operations 154 9.4 Target Applications for Case Study 155 9.4.1 Large-Scale Graph Processing 155 9.4.2 In-Memory Data Analytics 156 9.4.3 Machine Learning and Data Mining 157 9.4.4 Operation Summary 157 9.5 Evaluation Methodology 158 9.5.1 Simulation Configuration 158 9.5.2 Workloads 159 9.6 Evaluation Results 159 9.6.1 Performance 160 9.6.2 Sensitivity to Input Size 163 9.6.3 Multiprogrammed Workloads 164 9.6.4 Balanced Dispatch: Idea and Evaluation 165 9.6.5 Design Space Exploration for PCUs 165 9.6.6 Performance Overhead of the PMU 167 9.6.7 Energy, Area, and Thermal Issues 167 9.7 Summary 168 Chapter 10 Aggregation-in-Memory 171 10.1 Motivation 173 10.1.1 Rethinking PIM for Energy Efficiency 173 10.1.2 Aggregation as PIM Operations 174 10.2 Architecture 176 10.2.1 Overview 176 10.2.2 Programming Model 177 10.2.3 On-Chip Caches 177 10.2.4 Coherence and Consistency 181 10.2.5 Main Memory 181 10.2.6 Potential Generalization Opportunities 183 10.3 Compiler Support 184 10.4 Contributions over Prior Art 185 10.4.1 PIM-Enabled Instructions 185 10.4.2 Parallel Reduction in Caches 187 10.4.3 Row Buffer Locality of DRAM Writes 188 10.5 Target Applications 188 10.6 Evaluation Methodology 190 10.6.1 Simulation Configuration 190 10.6.2 Hardware Overhead 191 10.6.3 Workloads 192 10.7 Evaluation Results 192 10.7.1 Energy Consumption and Performance 192 10.7.2 Dynamic Energy Breakdown 196 10.7.3 Comparison with Aggressive Writeback 197 10.7.4 Multiprogrammed Workloads 198 10.7.5 Comparison with Intrinsic-based Code 198 10.8 Summary 199 Chapter 11 Conclusion 201 11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202 11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203 Bibliography 205 ์š”์•ฝ 227Docto

    ARCHITECTING EMERGING MEMORY TECHNOLOGIES FOR ENERGY-EFFICIENT COMPUTING IN MODERN PROCESSORS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    • โ€ฆ
    corecore