14 research outputs found
Architectural Techniques for Disturbance Mitigation in Future Memory Systems
With the recent advancements of CMOS technology, scaling down the feature size has improved memory capacity, power, performance and cost. However, such dramatic progress in memory technology has increasingly made the precise control of the manufacturing process below 22nm more difficult. In spite of all these virtues, the technology scaling road map predicts significant process variation from cell-to-cell. It also predicts electromagnetic disturbances among memory cells that easily deviate their circuit characterizations from design goals and pose threats to the reliability, energy efficiency and security.
This dissertation proposes simple, energy-efficient and low-overhead techniques that combat the challenges resulting from technology scaling in future memory systems. Specifically, this dissertation investigates solutions tuned to particular types of disturbance challenges, such as inter-cell or intra-cell disturbance, that are energy efficient while guaranteeing memory reliability.
The contribution of this dissertation will be threefold. First, it uses a deterministic
counter-based approach to target the root of inter-cell disturbances in Dynamic random access memory (DRAM) and provide further benefits to overall energy consumption while deterministically mitigating inter-cell disturbances. Second, it uses Markov chains to reason about the reliability of Spin-Transfer Torque Magnetic Random-Access Memory (STT-RAM) that suffers from intra-cell disturbances and then investigates on-demand refresh policies to recover from the persistent effect of such disturbances. Third, It leverages an encoding technique integrated with a novel word level compression scheme to reduce the vulnerability of cells to inter-cell write disturbances in Phase Change Memory (PCM). However, mitigating inter-cell write disturbances and also minimizing the write energy may increase the number of updated PCM cells and result in degraded endurance. Hence, It uses multi-objective optimization to balance the write energy and endurance in PCM cells while mitigating intercell disturbances.
The work in this dissertation provides important insights into how to tackle the critical reliability challenges that high-density memory systems confront in deep scaled technology nodes. It advocates for various memory technologies to guarantee reliability of future memory systems while incurring nominal costs in terms of energy, area and performance
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Phase change memory (PCM) has recently emerged as a promising technology to
meet the fast growing demand for large capacity memory in computer systems,
replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC)
PCM offers high density with low per-byte fabrication cost. However, despite
many advantages, such as scalability and low leakage, the energy for
programming intermediate states is considerably larger than programing
single-level cell PCM. In this paper, we study encoding techniques to reduce
write energy for MLC PCM when the encoding granularity is lowered below the
typical cache line size. We observe that encoding data blocks at small
granularity to reduce write energy actually increases the write energy because
of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing
suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new
Word-Level Compression (WLC) which compresses more than 91% of the memory lines
and provides enough room to store the auxiliary data using a novel restricted
coset encoding applied at small data block granularities.
Experimental results show that the proposed encoding at 16-bit data
granularity reduces the write energy by 39%, on average, versus the leading
encoding approach for write energy reduction. Furthermore, it improves
endurance by 20% and is more reliable than the leading approach. Hardware
synthesis evaluation shows that the proposed encoding can be implemented
on-chip with only a nominal area overhead.Comment: 12 page
Architectural Techniques to Enable Reliable and Scalable Memory Systems
High capacity and scalable memory systems play a vital role in enabling our
desktops, smartphones, and pervasive technologies like Internet of Things
(IoT). Unfortunately, memory systems are becoming increasingly prone to faults.
This is because we rely on technology scaling to improve memory density, and at
small feature sizes, memory cells tend to break easily. Today, memory
reliability is seen as the key impediment towards using high-density devices,
adopting new technologies, and even building the next Exascale supercomputer.
To ensure even a bare-minimum level of reliability, present-day solutions tend
to have high performance, power and area overheads. Ideally, we would like
memory systems to remain robust, scalable, and implementable while keeping the
overheads to a minimum. This dissertation describes how simple cross-layer
architectural techniques can provide orders of magnitude higher reliability and
enable seamless scalability for memory systems while incurring negligible
overheads.Comment: PhD thesis, Georgia Institute of Technology (May 2017
Dependable Embedded Systems
This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from todayโs points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021.8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.
Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์๋ณํ ๋ฉ๋ชจ๋ฆฌ๋(PCM) ๋งค๋ ฅ์ ์ธ ํน์ฑ์ ํตํด ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์๋ก์ด ์๋์ ์์์ ์๋ ธ๋ค. ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๊ด๋ จ ์ ํ ์ ์กฐ์
์ฒด(์ : ์ธํ
, SK ํ์ด๋์ค, ์ผ์ฑ)๊ฐ ๊ด๋ จ ์ ํ ๊ฐ๋ฐ์ ๋ฐ์ฐจ๋ฅผ ๊ฐํ๊ณ ์๋ค. PCM์ ๋จ์ํ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ฒ ์ด์ค์๋ง ๊ตญํ๋์ง ์๊ณ ๋ค์ํ ์ํฉ์ ์ ์ฉ๋ ์ ์๋ค. ์๋ฅผ ๋ค์ด, PCM์ ๋นํ๋ฐ์ฑ์ผ๋ก ์ธํด ๋๊ธฐ ์ ๋ ฅ์ด ๋ฎ๋ค. ๋ฐ๋ผ์ ๊ณ์ฐ ์ง์ฝ์ ์ธ ์ ํ๋ฆฌ์ผ์ด์
๋๋ ๋ชจ๋ฐ์ผ ์ ํ๋ฆฌ์ผ์ด์
์(์ฆ, ๊ธด ๋ฉ๋ชจ๋ฆฌ ์ ํด ์๊ฐ) PCM ๊ธฐ๋ฐ ์ปดํจํ
์์คํ
์์ ์คํํ๊ธฐ์ ์ ํฉํ๋ค.
PCM์ ์ด๋ฌํ ๋งค๋ ฅ์ ์ธ ํน์ฑ์๋ ๋ถ๊ตฌํ๊ณ PCM์ ๋ฎ์ ์ ๋ขฐ์ฑ๊ณผ ๊ธด ๋๊ธฐ ์๊ฐ์ผ๋ก ์ธํด ์ฌ์ ํ ์ผ๋ฐ ์ฐ์
์์ฅ์์๋ DRAM๊ณผ ๋ค์ ๊ฒฉ์ฐจ๊ฐ ์๋ค. ํนํ ๋ฎ์ ์ ๋ขฐ์ฑ์ ์ง๋ ์์ญ ๋
๋์ PCM ๊ธฐ์ ์ ๋ฐ์ ์ ์ ํดํ๋ ๋ฌธ์ ๋ค. ๋ฐ๋์ฒด ๊ณต์ ๊ธฐ์ ์ด ์๋
์ ๊ฑธ์ณ ๋น ๋ฅด๊ฒ ์ถ์๋จ์ ๋ฐ๋ผ DRAM์ 10nm ๊ธ ๊ณต์ ๊ธฐ์ ์ ๋๋ฌํ์๋ค. ์ด์ด์, ์ฐ๊ธฐ ๋ฐฉํด ์ค๋ฅ (WDE)๊ฐ 54nm ๋ฑ๊ธ ํ๋ก์ธ์ค ๊ธฐ์ ์๋๋ก ์ถ์๋๋ฉด PCM์ ์ฌ๊ฐํ ๋ฌธ์ ๊ฐ ๋ ๊ฒ์ผ๋ก ๋ณด๊ณ ๋์๋ค. ๋ฐ๋ผ์, WDE ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๊ฒ์ PCM์ด DRAM๊ณผ ๋๋ฑํ ๊ฒฝ์๋ ฅ์ ๊ฐ์ถ๋๋ก ํ๋ ๋ฐ ์์ด ํ์์ ์ด๋ค. ์ด ๋ฌธ์ ๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด ์ด ๋
ผ๋ฌธ์์๋ 2-๋ ๋ฒจ SRAM ๊ธฐ๋ฐ ํ
์ด๋ธ์ ํ์ฉํ์ฌ WDE ์๋ฅผ ํฌ๊ฒ ์ค์ฌ ํ์์ ๋ฐ๋ผ ์ค ์์ ์
์ ๋ณต์ํ ์ ์๋ ์๋ก์ด ์ ๊ทผ ๋ฐฉ์์ ์ ์ํ๋ค. ๋ํ, ์๋ SRAM์์ ์๋ฐฑ ๊ฐ์ ์ฝ๊ธฐ ํฌํธ๊ฐ ํ์ํ ๋์ฒด ์ ์ฑ
์ ๊ตฌํํ๊ธฐ ์ํด ์๋ก์ด ๋๋ค ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค.
PCM์ ๋ ๋ฒ์งธ ๋ฌธ์ ๋ DRAM์ ๋นํด ์ง์ฐ ์๊ฐ์ด ๊ธธ๋ค๋ ๊ฒ์ด๋ค. ํนํ PCM์ ๋ ํฐ ํธ๋์ญ์
๋จ์๋ฅผ ์ฑํํ์ฌ ๋จ์์๊ฐ ๋น ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋ ํฅ์์ ๋๋ชจํ๋ค. ๊ทธ๋ฌ๋ ๋ฒ์ฉ ํ๋ก์ธ์ ์บ์ ๋ผ์ธ๊ณผ ๋ค๋ฅธ ์ ๋ ํฌ๊ธฐ๋ ์ฝ๊ธฐ-์์ -์ฐ๊ธฐ (RMW) ๋ชจ๋์ ๋์
์ผ๋ก ์ธํด ์์คํ
์ฑ๋ฅ์ ์ ํํ๊ฒ ๋๋ค. PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์์ RMW ๊ด๋ จ ์ฐ๊ตฌ๊ฐ ์์๊ธฐ ๋๋ฌธ์ ๋ณธ ๋
ผ๋ฌธ์ RMW ๋ชจ๋์ ํ์ฌ ํ PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์ ๋ฐ์ ์ธ ์์คํ
์ฑ๋ฅ๊ณผ ์ ๋ขฐ์ฑ์ ํฅ์ํ๊ฒ ์ํฌ ์ ์๋ ์๋ก์ด ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ ์๋ ์ํคํ
์ฒ๋ ์ถ๊ฐ ์คํ ๋ฆฌ์ง ๋ฆฌ์์ค๋ฅผ ๋์
ํ์ง ์๊ณ ๋ ๋ฐ์ดํฐ ์ฌ์ฌ์ฉ์ฑ์ ํฅ์์ํจ๋ค. ๋ํ, ์ฑ๋ฅ ํฅ์์ ์ํด ๋ช
๋ น ์ ํ๊ณผ ๊ด๊ณ์์ด ๋ช
๋ น์ ๋ณํฉํ๋ ์๋ก์ด ์์
์ ์ ์ํ๋ค.
๋ ๋ค๋ฅธ ๋ฌธ์ ๋ PCM์ ์ํ ์์ ํ ์๋ฎฌ๋ ์ด์
ํ๋ซํผ์ด ๋ถ์ฌํ๋ค๋ ๊ฒ์ด๋ค. PCM ๊ด๋ จ ์ ํ(์ : Intel Optane)์ ๋ํด ๋ฐํ๋ ์ ๋ณด๋ ๋์ธ๋น ๋ฌธ์ ๋ก ์ธํด ๋ถ์กฑํ๋ค. ํ์ง๋ง ์๋ ค์ ธ ์๋ ์ ๋ณด๋ฅผ ์ ์ ํ ์ทจํฉํ๋ฉด ์์ค ์ ํ๊ณผ ์ ์ฌํ ์ํคํ
์ฒ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ฐ๋ฐํ ์ ์๋ค. ์ด๋ฅผ ์ํด ๋ณธ ๋
ผ๋ฌธ์ PCM ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ํ์ํ ๋ชจ๋ ๋ชจ๋ ์ ๋ณด๋ฅผ ํ์ฉํ์ฌ ํฅํ ์ด์ ๊ด๋ จ๋ ์ฐ๊ตฌ์์ ์ถฉ๋ถํ ์ฌ์ฉ ๊ฐ๋ฅํ ์ ์ฉ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ตฌํํ์๋ค.1 INTRODUCTION 1
1.1 Limitation of Traditional Main Memory Systems 1
1.2 Phase-Change Memory as Main Memory 3
1.2.1 Opportunities of PCM-based System 3
1.2.2 Challenges of PCM-based System 4
1.3 Dissertation Overview 7
2 BACKGROUND AND PREVIOUS WORK 8
2.1 Phase-Change Memory 8
2.2 Mitigation Schemes for Write Disturbance Errors 10
2.2.1 Write Disturbance Errors 10
2.2.2 Verification and Correction 12
2.2.3 Lazy Correction 13
2.2.4 Data Encoding-based Schemes 14
2.2.5 Sparse-Insertion Write Cache 16
2.3 Performance Enhancement for Read-Modify-Write 17
2.3.1 Traditional Read-Modify-Write 17
2.3.2 Write Coalescing for RMW 19
2.4 Architecture Simulators for PCM 21
2.4.1 NVMain 21
2.4.2 Ramulator 22
2.4.3 DRAMsim3 22
3 IN-MODULE DISTURBANCE BARRIER 24
3.1 Motivation 25
3.2 IMDB: In Module-Disturbance Barrier 29
3.2.1 Architectural Overview 29
3.2.2 Implementation of Data Structures 30
3.2.3 Modification of Media Controller 36
3.3 Replacement Policy 38
3.3.1 Replacement Policy for IMDB 38
3.3.2 Approximate Lowest Number Estimator 40
3.4 Putting All Together: Case Studies 43
3.5 Evaluation 45
3.5.1 Configuration 45
3.5.2 Architectural Exploration 47
3.5.3 Effectiveness of the Replacement Policy 48
3.5.4 Sensitivity to Main Table Configuration 49
3.5.5 Sensitivity to Barrier Buffer Size 51
3.5.6 Sensitivity to AppLE Group Size 52
3.5.7 Comparison with Other Studies 54
3.6 Discussion 59
3.7 Summary 63
4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64
4.1 Motivation 65
4.2 Utilization of DRAM Cache for RMW 67
4.2.1 Architectural Design 67
4.2.2 Algorithm 70
4.3 Typeless Command Merging 73
4.3.1 Architectural Design 73
4.3.2 Algorithm 74
4.4 An Alternative Implementation: SRC-RMW 78
4.4.1 Implementation of SRC-RMW 78
4.4.2 Design Constraint 80
4.5 Case Study 82
4.6 Evaluation 85
4.6.1 Configuration 85
4.6.2 Speedup 88
4.6.3 Read Reliability 91
4.6.4 Energy Consumption: Selecting a Proper Page Size 93
4.6.5 Comparison with Other Studies 95
4.7 Discussion 97
4.8 Summary 99
5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100
5.1 Motivation 101
5.2 PCMCsim: PCM Controller Simulator 103
5.2.1 Architectural Overview 103
5.2.2 Underlying Classes of PCMCsim 104
5.2.3 Implementation of Contention Behavior 108
5.2.4 Modules of PCMCsim 109
5.3 Evaluation 116
5.3.1 Correctness of the Simulator 116
5.3.2 Comparison with Other Simulators 117
5.4 Summary 119
6 Conclusion 120
Abstract (In Korean) 141
Acknowledgment 143๋ฐ
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ,2021. 8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes
์๋ก์ด ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ ๊ธฐ๋ฐ์ผ๋ก ํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ค๊ณ ๊ธฐ์
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2017. 2. ์ต๊ธฐ์.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies.
The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM.
To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption.
The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems.
In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1
1.1 Inefficiencies in the Current Memory Systems 2
1.1.1 On-Chip Caches 2
1.1.2 Main Memory 2
1.2 New Memory Technologies: Opportunities and Challenges 3
1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3
1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6
1.3 Dissertation Overview 9
Chapter 2 Previous Work 11
2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11
2.1.1 Hybrid Caches 11
2.1.2 Volatile STT-RAM 13
2.1.3 Redundant Write Elimination 14
2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15
2.2.1 PIM Architectures in the 1990s 15
2.2.2 Modern PIM Architectures based on 3D Stacking 15
2.2.3 Modern PIM Architectures on Memory Dies 17
Chapter 3 Loop-Aware Sleepy Instruction Cache 19
3.1 Architecture 20
3.1.1 Loop Cache 21
3.1.2 Loop-Aware Sleep Controller 22
3.2 Evaluation and Discussion 24
3.2.1 Simulation Environment 24
3.2.2 Energy 25
3.2.3 Performance 27
3.2.4 Sensitivity Analysis 27
3.3 Summary 28
Chapter 4 Lower-Bits Cache 29
4.1 Architecture 29
4.2 Experiments 32
4.2.1 Simulator and Cache Model 32
4.2.2 Results 33
4.3 Summary 34
Chapter 5 Prediction Hybrid Cache 35
5.1 Problem and Motivation 37
5.1.1 Problem Definition 37
5.1.2 Motivation 37
5.2 Write Intensity Predictor 38
5.2.1 Keeping Track of Trigger Instructions 39
5.2.2 Identifying Hot Trigger Instructions 40
5.2.3 Dynamic Set Sampling 41
5.2.4 Summary 42
5.3 Prediction Hybrid Cache 43
5.3.1 Need for Write Intensity Prediction 43
5.3.2 Organization 43
5.3.3 Operations 44
5.3.4 Dynamic Threshold Adjustment 45
5.4 Evaluation Methodology 48
5.4.1 Simulator Configuration 48
5.4.2 Workloads 50
5.5 Single-Core Evaluations 51
5.5.1 Energy Consumption and Speedup 51
5.5.2 Energy Breakdown 53
5.5.3 Coverage and Accuracy 54
5.5.4 Sensitivity to Write Intensity Threshold 55
5.5.5 Impact of Dynamic Set Sampling 55
5.5.6 Results for Non-Write-Intensive Workloads 56
5.6 Multicore Evaluations 57
5.7 Summary 59
Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61
6.1 Motivation 62
6.1.1 Energy Impact of Inefficient Write Operations 62
6.1.2 Limitations of Existing Approaches 63
6.1.3 Potential of Dead Writes 64
6.2 Dead Write Classification 65
6.2.1 Dead-on-Arrival Fills 65
6.2.2 Dead-Value Fills 66
6.2.3 Closing Writes 66
6.2.4 Decomposition 67
6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68
6.3.1 Dead Write Prediction 68
6.3.2 Bidirectional Bypass 71
6.4 Evaluation Methodology 72
6.4.1 Simulation Configuration 72
6.4.2 Workloads 74
6.5 Evaluation for Single-Core Systems 75
6.5.1 Energy Consumption and Speedup 75
6.5.2 Coverage and Accuracy 78
6.5.3 Sensitivity to Signature 78
6.5.4 Sensitivity to Update Policy 80
6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80
6.5.6 Impact of Prefetching 80
6.6 Evaluation for Multi-Core Systems 81
6.6.1 Energy Consumption and Speedup 81
6.6.2 Application to Inclusive Caches 83
6.6.3 Application to Three-Level Cache Hierarchy 84
6.7 Summary 85
Chapter 7 Link Power Management for Hybrid Memory Cubes 87
7.1 Background and Motivation 88
7.1.1 Hybrid Memory Cube 88
7.1.2 Motivation 89
7.2 HMC Link Power Management 91
7.2.1 Link Delay Monitor 91
7.2.2 Power State Transition 94
7.2.3 Overhead 95
7.3 Two-Level Prefetching 95
7.4 Application to Multi-HMC Systems 97
7.5 Experiments 98
7.5.1 Methodology 98
7.5.2 Link Energy Consumption and Speedup 100
7.5.3 HMC Energy Consumption 102
7.5.4 Runtime Behavior of LPM 102
7.5.5 Sensitivity to Slowdown Threshold 104
7.5.6 LPM without Prefetching 104
7.5.7 Impact of Prefetching on Link Traffic 105
7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107
7.5.9 Tighter Off-Chip Bandwidth Margin 107
7.5.10 Multithreaded Workloads 108
7.5.11 Multi-HMC Systems 109
7.6 Summary 111
Chapter 8 Tesseract PIM System for Parallel Graph Processing 113
8.1 Background and Motivation 115
8.1.1 Large-Scale Graph Processing 115
8.1.2 Graph Processing on Conventional Systems 117
8.1.3 Processing-in-Memory 118
8.2 Tesseract Architecture 119
8.2.1 Overview 119
8.2.2 Remote Function Call via Message Passing 122
8.2.3 Prefetching 124
8.2.4 Programming Interface 126
8.2.5 Application Mapping 127
8.3 Evaluation Methodology 128
8.3.1 Simulation Configuration 128
8.3.2 Workloads 129
8.4 Evaluation Results 130
8.4.1 Performance 130
8.4.2 Iso-Bandwidth Comparison 133
8.4.3 Execution Time Breakdown 134
8.4.4 Prefetch Efficiency 134
8.4.5 Scalability 135
8.4.6 Effect of Higher Off-Chip Network Bandwidth 136
8.4.7 Effect of Better Graph Distribution 137
8.4.8 Energy/Power Consumption and Thermal Analysis 138
8.5 Summary 139
Chapter 9 PIM-Enabled Instructions 141
9.1 Potential of ISA Extensions as the PIM Interface 143
9.2 PIM Abstraction 145
9.2.1 Operations 145
9.2.2 Memory Model 147
9.2.3 Software Modification 148
9.3 Architecture 148
9.3.1 Overview 148
9.3.2 PEI Computation Unit (PCU) 149
9.3.3 PEI Management Unit (PMU) 150
9.3.4 Virtual Memory Support 153
9.3.5 PEI Execution 153
9.3.6 Comparison with Active Memory Operations 154
9.4 Target Applications for Case Study 155
9.4.1 Large-Scale Graph Processing 155
9.4.2 In-Memory Data Analytics 156
9.4.3 Machine Learning and Data Mining 157
9.4.4 Operation Summary 157
9.5 Evaluation Methodology 158
9.5.1 Simulation Configuration 158
9.5.2 Workloads 159
9.6 Evaluation Results 159
9.6.1 Performance 160
9.6.2 Sensitivity to Input Size 163
9.6.3 Multiprogrammed Workloads 164
9.6.4 Balanced Dispatch: Idea and Evaluation 165
9.6.5 Design Space Exploration for PCUs 165
9.6.6 Performance Overhead of the PMU 167
9.6.7 Energy, Area, and Thermal Issues 167
9.7 Summary 168
Chapter 10 Aggregation-in-Memory 171
10.1 Motivation 173
10.1.1 Rethinking PIM for Energy Efficiency 173
10.1.2 Aggregation as PIM Operations 174
10.2 Architecture 176
10.2.1 Overview 176
10.2.2 Programming Model 177
10.2.3 On-Chip Caches 177
10.2.4 Coherence and Consistency 181
10.2.5 Main Memory 181
10.2.6 Potential Generalization Opportunities 183
10.3 Compiler Support 184
10.4 Contributions over Prior Art 185
10.4.1 PIM-Enabled Instructions 185
10.4.2 Parallel Reduction in Caches 187
10.4.3 Row Buffer Locality of DRAM Writes 188
10.5 Target Applications 188
10.6 Evaluation Methodology 190
10.6.1 Simulation Configuration 190
10.6.2 Hardware Overhead 191
10.6.3 Workloads 192
10.7 Evaluation Results 192
10.7.1 Energy Consumption and Performance 192
10.7.2 Dynamic Energy Breakdown 196
10.7.3 Comparison with Aggressive Writeback 197
10.7.4 Multiprogrammed Workloads 198
10.7.5 Comparison with Intrinsic-based Code 198
10.8 Summary 199
Chapter 11 Conclusion 201
11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202
11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203
Bibliography 205
์์ฝ 227Docto
Near Data Processing for Efficient and Trusted Systems
We live in a world which constantly produces data at a rate which only increases with time. Conventional processor architectures fail to process this abundant data in an efficient manner as they expend significant energy in instruction processing and moving data over deep memory hierarchies. Furthermore, to process large amounts of data in a cost effective manner, there is increased demand for remote computation. While cloud service providers have come up with innovative solutions to cater to this increased demand, the security concerns users feel for their data remains a strong impediment to their wide scale adoption.
An exciting technique in our repertoire to deal with these challenges is near-data processing. Near-data processing (NDP) is a data-centric paradigm which moves computation to where data resides. This dissertation exploits NDP to both process the data deluge we face efficiently and design low-overhead secure hardware designs.
To this end, we first propose Compute Caches, a novel NDP technique. Simple augmentations to underlying SRAM design enable caches to perform commonly used operations. In-place computation in caches not only avoids excessive data movement over memory hierarchy, but also significantly reduces instruction processing energy as independent sub-units inside caches perform computation in parallel. Compute Caches significantly improve the performance and reduce energy expended for a suite of data intensive applications.
Second, this dissertation identifies security advantages of NDP. While memory bus side channel has received much attention, a low-overhead hardware design which defends against it remains elusive. We observe that smart memory, memory with compute capability, can dramatically simplify this problem. To exploit this observation, we propose InvisiMem which uses the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing memory bus side channel efficiently. Our solutions obviate the need for expensive constructs like Oblivious RAM (ORAM) and Merkle trees, and have one to two orders of magnitude lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.
This dissertation also addresses a related vulnerability of page fault side channel in which the Operating System (OS) induces page faults to learn application's address trace and deduces application secrets from it. To tackle it, we propose Sanctuary which obfuscates page fault channel while allowing the OS to manage memory as a resource. To do so, we design a novel construct, Oblivious Page Management (OPAM) which is derived from ORAM but is customized for page management context. We employ near-memory page moves to reduce OPAM overhead and also propose a novel memory partition to reduce OPAM transactions required. For a suite of cloud applications which process sensitive data we show that page fault channel can be tackled at reasonable overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144139/1/shaizeen_1.pd