219 research outputs found
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021.8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.
Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์๋ณํ ๋ฉ๋ชจ๋ฆฌ๋(PCM) ๋งค๋ ฅ์ ์ธ ํน์ฑ์ ํตํด ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์๋ก์ด ์๋์ ์์์ ์๋ ธ๋ค. ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๊ด๋ จ ์ ํ ์ ์กฐ์
์ฒด(์ : ์ธํ
, SK ํ์ด๋์ค, ์ผ์ฑ)๊ฐ ๊ด๋ จ ์ ํ ๊ฐ๋ฐ์ ๋ฐ์ฐจ๋ฅผ ๊ฐํ๊ณ ์๋ค. PCM์ ๋จ์ํ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ฒ ์ด์ค์๋ง ๊ตญํ๋์ง ์๊ณ ๋ค์ํ ์ํฉ์ ์ ์ฉ๋ ์ ์๋ค. ์๋ฅผ ๋ค์ด, PCM์ ๋นํ๋ฐ์ฑ์ผ๋ก ์ธํด ๋๊ธฐ ์ ๋ ฅ์ด ๋ฎ๋ค. ๋ฐ๋ผ์ ๊ณ์ฐ ์ง์ฝ์ ์ธ ์ ํ๋ฆฌ์ผ์ด์
๋๋ ๋ชจ๋ฐ์ผ ์ ํ๋ฆฌ์ผ์ด์
์(์ฆ, ๊ธด ๋ฉ๋ชจ๋ฆฌ ์ ํด ์๊ฐ) PCM ๊ธฐ๋ฐ ์ปดํจํ
์์คํ
์์ ์คํํ๊ธฐ์ ์ ํฉํ๋ค.
PCM์ ์ด๋ฌํ ๋งค๋ ฅ์ ์ธ ํน์ฑ์๋ ๋ถ๊ตฌํ๊ณ PCM์ ๋ฎ์ ์ ๋ขฐ์ฑ๊ณผ ๊ธด ๋๊ธฐ ์๊ฐ์ผ๋ก ์ธํด ์ฌ์ ํ ์ผ๋ฐ ์ฐ์
์์ฅ์์๋ DRAM๊ณผ ๋ค์ ๊ฒฉ์ฐจ๊ฐ ์๋ค. ํนํ ๋ฎ์ ์ ๋ขฐ์ฑ์ ์ง๋ ์์ญ ๋
๋์ PCM ๊ธฐ์ ์ ๋ฐ์ ์ ์ ํดํ๋ ๋ฌธ์ ๋ค. ๋ฐ๋์ฒด ๊ณต์ ๊ธฐ์ ์ด ์๋
์ ๊ฑธ์ณ ๋น ๋ฅด๊ฒ ์ถ์๋จ์ ๋ฐ๋ผ DRAM์ 10nm ๊ธ ๊ณต์ ๊ธฐ์ ์ ๋๋ฌํ์๋ค. ์ด์ด์, ์ฐ๊ธฐ ๋ฐฉํด ์ค๋ฅ (WDE)๊ฐ 54nm ๋ฑ๊ธ ํ๋ก์ธ์ค ๊ธฐ์ ์๋๋ก ์ถ์๋๋ฉด PCM์ ์ฌ๊ฐํ ๋ฌธ์ ๊ฐ ๋ ๊ฒ์ผ๋ก ๋ณด๊ณ ๋์๋ค. ๋ฐ๋ผ์, WDE ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๊ฒ์ PCM์ด DRAM๊ณผ ๋๋ฑํ ๊ฒฝ์๋ ฅ์ ๊ฐ์ถ๋๋ก ํ๋ ๋ฐ ์์ด ํ์์ ์ด๋ค. ์ด ๋ฌธ์ ๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด ์ด ๋
ผ๋ฌธ์์๋ 2-๋ ๋ฒจ SRAM ๊ธฐ๋ฐ ํ
์ด๋ธ์ ํ์ฉํ์ฌ WDE ์๋ฅผ ํฌ๊ฒ ์ค์ฌ ํ์์ ๋ฐ๋ผ ์ค ์์ ์
์ ๋ณต์ํ ์ ์๋ ์๋ก์ด ์ ๊ทผ ๋ฐฉ์์ ์ ์ํ๋ค. ๋ํ, ์๋ SRAM์์ ์๋ฐฑ ๊ฐ์ ์ฝ๊ธฐ ํฌํธ๊ฐ ํ์ํ ๋์ฒด ์ ์ฑ
์ ๊ตฌํํ๊ธฐ ์ํด ์๋ก์ด ๋๋ค ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค.
PCM์ ๋ ๋ฒ์งธ ๋ฌธ์ ๋ DRAM์ ๋นํด ์ง์ฐ ์๊ฐ์ด ๊ธธ๋ค๋ ๊ฒ์ด๋ค. ํนํ PCM์ ๋ ํฐ ํธ๋์ญ์
๋จ์๋ฅผ ์ฑํํ์ฌ ๋จ์์๊ฐ ๋น ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋ ํฅ์์ ๋๋ชจํ๋ค. ๊ทธ๋ฌ๋ ๋ฒ์ฉ ํ๋ก์ธ์ ์บ์ ๋ผ์ธ๊ณผ ๋ค๋ฅธ ์ ๋ ํฌ๊ธฐ๋ ์ฝ๊ธฐ-์์ -์ฐ๊ธฐ (RMW) ๋ชจ๋์ ๋์
์ผ๋ก ์ธํด ์์คํ
์ฑ๋ฅ์ ์ ํํ๊ฒ ๋๋ค. PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์์ RMW ๊ด๋ จ ์ฐ๊ตฌ๊ฐ ์์๊ธฐ ๋๋ฌธ์ ๋ณธ ๋
ผ๋ฌธ์ RMW ๋ชจ๋์ ํ์ฌ ํ PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์ ๋ฐ์ ์ธ ์์คํ
์ฑ๋ฅ๊ณผ ์ ๋ขฐ์ฑ์ ํฅ์ํ๊ฒ ์ํฌ ์ ์๋ ์๋ก์ด ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ ์๋ ์ํคํ
์ฒ๋ ์ถ๊ฐ ์คํ ๋ฆฌ์ง ๋ฆฌ์์ค๋ฅผ ๋์
ํ์ง ์๊ณ ๋ ๋ฐ์ดํฐ ์ฌ์ฌ์ฉ์ฑ์ ํฅ์์ํจ๋ค. ๋ํ, ์ฑ๋ฅ ํฅ์์ ์ํด ๋ช
๋ น ์ ํ๊ณผ ๊ด๊ณ์์ด ๋ช
๋ น์ ๋ณํฉํ๋ ์๋ก์ด ์์
์ ์ ์ํ๋ค.
๋ ๋ค๋ฅธ ๋ฌธ์ ๋ PCM์ ์ํ ์์ ํ ์๋ฎฌ๋ ์ด์
ํ๋ซํผ์ด ๋ถ์ฌํ๋ค๋ ๊ฒ์ด๋ค. PCM ๊ด๋ จ ์ ํ(์ : Intel Optane)์ ๋ํด ๋ฐํ๋ ์ ๋ณด๋ ๋์ธ๋น ๋ฌธ์ ๋ก ์ธํด ๋ถ์กฑํ๋ค. ํ์ง๋ง ์๋ ค์ ธ ์๋ ์ ๋ณด๋ฅผ ์ ์ ํ ์ทจํฉํ๋ฉด ์์ค ์ ํ๊ณผ ์ ์ฌํ ์ํคํ
์ฒ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ฐ๋ฐํ ์ ์๋ค. ์ด๋ฅผ ์ํด ๋ณธ ๋
ผ๋ฌธ์ PCM ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ํ์ํ ๋ชจ๋ ๋ชจ๋ ์ ๋ณด๋ฅผ ํ์ฉํ์ฌ ํฅํ ์ด์ ๊ด๋ จ๋ ์ฐ๊ตฌ์์ ์ถฉ๋ถํ ์ฌ์ฉ ๊ฐ๋ฅํ ์ ์ฉ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ตฌํํ์๋ค.1 INTRODUCTION 1
1.1 Limitation of Traditional Main Memory Systems 1
1.2 Phase-Change Memory as Main Memory 3
1.2.1 Opportunities of PCM-based System 3
1.2.2 Challenges of PCM-based System 4
1.3 Dissertation Overview 7
2 BACKGROUND AND PREVIOUS WORK 8
2.1 Phase-Change Memory 8
2.2 Mitigation Schemes for Write Disturbance Errors 10
2.2.1 Write Disturbance Errors 10
2.2.2 Verification and Correction 12
2.2.3 Lazy Correction 13
2.2.4 Data Encoding-based Schemes 14
2.2.5 Sparse-Insertion Write Cache 16
2.3 Performance Enhancement for Read-Modify-Write 17
2.3.1 Traditional Read-Modify-Write 17
2.3.2 Write Coalescing for RMW 19
2.4 Architecture Simulators for PCM 21
2.4.1 NVMain 21
2.4.2 Ramulator 22
2.4.3 DRAMsim3 22
3 IN-MODULE DISTURBANCE BARRIER 24
3.1 Motivation 25
3.2 IMDB: In Module-Disturbance Barrier 29
3.2.1 Architectural Overview 29
3.2.2 Implementation of Data Structures 30
3.2.3 Modification of Media Controller 36
3.3 Replacement Policy 38
3.3.1 Replacement Policy for IMDB 38
3.3.2 Approximate Lowest Number Estimator 40
3.4 Putting All Together: Case Studies 43
3.5 Evaluation 45
3.5.1 Configuration 45
3.5.2 Architectural Exploration 47
3.5.3 Effectiveness of the Replacement Policy 48
3.5.4 Sensitivity to Main Table Configuration 49
3.5.5 Sensitivity to Barrier Buffer Size 51
3.5.6 Sensitivity to AppLE Group Size 52
3.5.7 Comparison with Other Studies 54
3.6 Discussion 59
3.7 Summary 63
4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64
4.1 Motivation 65
4.2 Utilization of DRAM Cache for RMW 67
4.2.1 Architectural Design 67
4.2.2 Algorithm 70
4.3 Typeless Command Merging 73
4.3.1 Architectural Design 73
4.3.2 Algorithm 74
4.4 An Alternative Implementation: SRC-RMW 78
4.4.1 Implementation of SRC-RMW 78
4.4.2 Design Constraint 80
4.5 Case Study 82
4.6 Evaluation 85
4.6.1 Configuration 85
4.6.2 Speedup 88
4.6.3 Read Reliability 91
4.6.4 Energy Consumption: Selecting a Proper Page Size 93
4.6.5 Comparison with Other Studies 95
4.7 Discussion 97
4.8 Summary 99
5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100
5.1 Motivation 101
5.2 PCMCsim: PCM Controller Simulator 103
5.2.1 Architectural Overview 103
5.2.2 Underlying Classes of PCMCsim 104
5.2.3 Implementation of Contention Behavior 108
5.2.4 Modules of PCMCsim 109
5.3 Evaluation 116
5.3.1 Correctness of the Simulator 116
5.3.2 Comparison with Other Simulators 117
5.4 Summary 119
6 Conclusion 120
Abstract (In Korean) 141
Acknowledgment 143๋ฐ
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ,2021. 8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes
Energy-Aware Data Movement In Non-Volatile Memory Hierarchies
While technology scaling enables increased density for memory cells, the intrinsic high leakage power of conventional CMOS technology and the demand for reduced energy consumption inspires the use of emerging technology alternatives such as eDRAM and Non-Volatile Memory (NVM) including STT-MRAM, PCM, and RRAM. The utilization of emerging technology in Last Level Cache (LLC) designs which occupies a signifcant fraction of total die area in Chip Multi Processors (CMPs) introduces new dimensions of vulnerability, energy consumption, and performance delivery. To be specific, a part of this research focuses on eDRAM Bit Upset Vulnerability Factor (BUVF) to assess vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on the write voltage, storage and bit-line capacitance. This dissertation broaden the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory. Large-latency and power-hungry Sense Amplifers (SAs) have been adapted to combat PV in the past. Herein, a novel approach is proposed to leverage the PV in NVM arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time. On the other hand, this dissertation investigates a novel technique to prioritize the service to 1) Extensive Read Reused Accessed blocks of the LLC that are silently dropped from higher levels of cache, and 2) the portion of the working set that may exhibit distant re-reference interval in L2. In particular, we develop a lightweight Multi-level Access History Profiler to effciently identify ERRA blocks through aggregating the LLC block addresses tagged with identical Most Signifcant Bits into a single entry. Experimental results indicate that the proposed technique can reduce the L2 read miss ratio by 51.7% on average across PARSEC and SPEC2006 workloads. In addition, this dissertation will broaden and apply advancements in theories of subspace recovery to pioneer computationally-aware in-situ operand reconstruction via the novel Logic In Interconnect (LI2) scheme. LI2 will be developed, validated, and re?ned both theoretically and experimentally to realize a radically different approach to post-Moore\u27s Law computing by leveraging low-rank matrices features offering data reconstruction instead of fetching data from main memory to reduce energy/latency cost per data movement. We propose LI2 enhancement to attain high performance delivery in the post-Moore\u27s Law era through equipping the contemporary micro-architecture design with a customized memory controller which orchestrates the memory request for fetching low-rank matrices to customized Fine Grain Reconfigurable Accelerator (FGRA) for reconstruction while the other memory requests are serviced as before. The goal of LI2 is to conquer the high latency/energy required to traverse main memory arrays in the case of LLC miss, by using in-situ construction of the requested data dealing with low-rank matrices. Thus, LI2 exchanges a high volume of data transfers with a novel lightweight reconstruction method under specific conditions using a cross-layer hardware/algorithm approach
Recommended from our members
Repurposing Software Defenses with Specialized Hardware
Computer security has largely been the domain of software for the last few decades. Although this approach has been moderately successful during this period, its problems have started becoming more apparent recently because of one primary reason โ performance. Software solutions typically exact a significant toll in terms of program slowdown, especially when applied to large, complex software. In the past, when chips became exponentially faster, this growing burden could be accommodated almost for free. But as Mooreโs law winds down, security-related slowdowns become more apparent, increasingly intolerable, and subsequently abandoned. As a result, the community has started looking elsewhere for continued protection, as attacks continue to become progressively more sophisticated.
One way to mitigate this problem is to complement these defenses in hardware. Despite lacking the semantic perspective of high-level software, specialized hardware typically is not only faster, but also more energy-efficient. However, hardware vendors also have to factor in the cost of integrating security solutions from the perspective of effectiveness, longevity, and cost of development, while allaying the customerโs concerns of performance. As a result, although numerous hardware solutions have been proposed in the past, the fact that so few of them have actually transitioned into practice implies that they were unable to strike an optimal balance of the above qualities.
This dissertation proposes the thesis that it is possible to add hardware features that complement and improve program security, traditionally provided by software, without requiring extensive modifications to existing hardware microarchitecture. As such, it marries the collective concerns of not only users and software developers, who demand performant but secure products, but also that of hardware vendors, since implementation simplicity directly relates to reduction in time and cost of development and deployment. To support this thesis, this dissertation discusses two hardware security features aimed at securing program code and data separately and details their full system implementations, and a study of a negative result where the design was deemed practically infeasible, given its high implementation complexity.
Firstly, the dissertation discusses code protection by reviving instruction set randomization (ISR), an idea originally proposed for countering code injection and considered impractical in the face of modern attack vectors that employ reuse of existing program code (also known as code reuse attacks). With Polyglot, we introduce ISR with strong AES encryption along with basic code randomization that disallows code decryption at runtime, thus countering most forms of state-of-the-art dynamic code reuse attacks, that read the code at runtime prior to building the code reuse payload. Through various optimizations and corner case workarounds, we show how Polyglot enables code execution with minimal hardware changes while maintaining a small attack surface and incurring nominal overheads even when the code is strongly encrypted in the binary and memory.
Next, the dissertation presents REST, a hardware primitive that allows programs to mark memory regions invalid for regular memory accesses. This is achieved simply by storing a large, pre-determined random value at those locations with a special store instruction and then, detecting incoming values at the data cache for matches to the predetermined value. Subsequently, we show how this primitive can be used to protect data from common forms of spatial and temporal memory safety attacks. Notably, because of the simplicity of the primitive, REST requires trivial microarchitectural modifications and hence, is easy to implement, and exhibits negligible performance overheads. Additionally, we demonstrate how it is able to provide practical heap safety even for legacy binaries.
For the above proposals, we also detail their hardware implementations on FPGAs, and discuss how each fits within a complete multiprocess system. This serves to give the reader an idea of usage and deployment challenges on a broader scale that goes beyond just the techniqueโs effectiveness within the context of a single program.
Lastly, the dissertation discusses an alternative to the virtual address space, that randomizes the sequence of addresses in a manner invisible to even the program, thus achieving transparent randomization of the entire address space at a very fine granularity. The biggest challenge is to achieve this with minimal microarchitectural changes while accommodating linear data structures in the program (e.g., arrays, structs), both of which are fundamentally based on a linear address space. As a result, this modified address space subsumes the benefits of most other spatial randomization schemes, with the additional benefit of ideally making traversal from one data structure to another impossible. Our study of this idea concludes that although valuable, current memory safety techniques are cheaper to implement and secure enough, so that there are no perceivable use cases for this model of address space safety
Exploiting task-based programming models for resilience
Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events.
Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations.
This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems.
The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique.
The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact.
The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC.
This provides a useful and wide range of trade-offs between redundancy and error rates.
The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven mรกs comunes a medida que las tecnologรญas de silicio reducen su tamaรฑo. La variabilidad de fabricaciรณn y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de apariciรณn siempre causa errores inesperados.
Ademรกs, la memoria tambiรฉn sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos estรกn causados por efectos como la radiaciรณn o los reducidos mรกrgenes de voltaje y frecuencia.
Otras restricciones coetรกneas, como el consumo de energรญa y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez mรกs complejas. Para programar tales procesadores, se desarrollaron modelos de programaciรณn basados en entornos de ejecuciรณn. Estos sistemas forman una nueva abstracciรณn entre hardware y software, realizando tareas como la distribuciรณn del trabajo entre recursos informรกticos: nรบcleos de procesadores, aceleradores, etc. Estos entornos de ejecuciรณn disponen de mucha informaciรณn tanto sobre el hardware como sobre las aplicaciones, y ofrecen asรญ muchas posibilidades de optimizaciรณn.
Esta tesis propone soluciones a los fallos en memoria entre mรบltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los cรณdigos de correcciรณn de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecuciรณn.
La primera contribuciรณn de esta tesis es una tรฉcnica a nivel algorรญtmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los mรฉtodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperaciรณn de datos desarrollada
permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programaciรณn para superponer las computaciones del algoritmo y de la recuperaciรณn de datos.
La segunda parte de esta tesis propone una mรฉtrica para caracterizar el impacto de los fallos en la memoria. Esta mรฉtrica supera en precisiรณn a las mรฉtricas de vanguardia y permite identificar datos que son menos relevantes para el programa.
Se propone una estrategia a nivel del sistema operativo retrasando la notificaciรณn de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa.
Por รบltimo, la contribuciรณn a nivel arquitectรณnico de esta tesis es un esquema de Cรณdigo de Correcciรณn de Errores (ECC por sus siglas en inglรฉs) adaptable dinรกmicamente. Este esquema puede aumentar la protecciรณn de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodologรญa para estimar el riesgo de fallo en tiempo de ejecuciรณn utilizando nuestra mรฉtrica,
a travรฉs de las herramientas de monitorizaciรณn del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinรกmicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracciรณn del coste de redundancia requerido para un ECC uniformemente mรกs fuerte.
El trabajo presentado en esta tesis demuestra que los entornos de ejecuciรณn permiten aprovechar al mรกximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar mรกs fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un
programa en forma de ECC. En todos los casos, el entorno de ejecuciรณn permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperaciรณn, identificando datos redundantes, o focalizando esfuerzos de protecciรณn en los datos crรญticos.Postprint (published version
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
Recommended from our members
A Statistical View of Architecture Design
Computer architectures are becoming more and more complicated to meet the continuouslyincreasing demand on performance, security and sustainability from applications. Many factorsexist in the design and engineering space of various components and policies in the architectures,and it is not intuitive how these factors interact with each other and how they make impactson the architecture behaviors. Seeking for the best architectures for specific applicationsand requirements automatically is even more challenging. Meanwhile, the architecture designneed to deal with more and more non-determinism from lower level technologies. Emergingtechnologies exhibit statistical properties inherently, such as the wearout phenomenon inNEMs, PCM, ReRAM, etc. Due to the manufacturing and processing variations, there alsoexists variability among different devices or within the same device (e.g. different cells onthe same memory chip). Hence, to better understand and control the architecture behaviors,we introduce the statistical perspective of architecture design: by specifying the architecturaldesign goals and the desired statistical properties, we guide the architecture design with thesestatistical properties and exploit a series of techniques to achieve these properties.In the first part of the thesis, we introduce Herniated Hash Tables. Our architectural designgoal is that the hash table implementation is highly scalable in both storage efficiency andperformance, while the desired statistical property is to achieve as good storage efficiencyand performance as with uniform distributions given non-uniform distributions across hashbuckets. Herniated Hash Tables exploit multi-level phase change memory (PCM) to in-placeexpand storage for each hash bucket to accommodate asymmetrically chained entries. Theorganization, coupled with an addressing and prefetching scheme, also improves performancesignificantly by creating more memory parallelism.In the second part of the thesis, we introduce Lemonade from Lemons, harnessing devicewearout to create limited-use security architectures. The architectural design goal is tocreate hardware security architectures that resist attacks by statistically enforcing an upperbound on hardware uses, and consequently attacks. The desired statistical property is that thesystem-level minimum and maximum uses can be guaranteed with high probabilities despite ofdevice-level variability. We introduce techniques for architecturally controlling these boundsand explore the cost in area, energy and latency of using these techniques to achieve systemlevelusage targets given device-level wearout distributions.In the third part of the thesis, we demonstrate Memory Cocktail Therapy: A General,Learning-Based Framework to Optimize Dynamic Tradeoffs in NVMs. Limited write enduranceand long latencies remain the primary challenges of building practical memory systems fromNVMs. Researchers have proposed a variety of architectural techniques to achieve differenttradeoffs between lifetime, performance and energy efficiency; however, no individual techniquecan satisfy requirements for all applications and different objectives. Our architecturaldesign goal is that NVM systems can achieve optimal tradeoffs for specific applications andobjectives, and the statistical goal is that the selected NVM configuration is nearly optimal.Memory Cocktail Therapy uses machine learning techniques to model the architecture behaviorsin terms of all the configurable parameters based on a small number of sample configurations.Then, it selects the optimal configuration according to user-defined objectives whichleads to the desired tradeoff between performance, lifetime and energy efficiency
A Scalable Flash-Based Hardware Architecture for the Hierarchical Temporal Memory Spatial Pooler
Hierarchical temporal memory (HTM) is a biomimetic machine learning algorithm focused upon modeling the structural and algorithmic properties of the neocortex. It is comprised of two components, realizing pattern recognition of spatial and temporal data, respectively. HTM research has gained momentum in recent years, leading to both hardware and software exploration of its algorithmic formulation. Previous work on HTM has centered on addressing performance concerns; however, the memory-bound operation of HTM presents significant challenges to scalability.
In this work, a scalable flash-based storage processor unit, Flash-HTM (FHTM), is presented along with a detailed analysis of its potential scalability. FHTM leverages SSD flash technology to implement the HTM cortical learning algorithm spatial pooler. The ability for FHTM to scale with increasing model complexity is addressed with respect to design footprint, memory organization, and power efficiency. Additionally, a mathematical model of the hardware is evaluated against the MNIST dataset, yielding 91.98% classification accuracy. A fully custom layout is developed to validate the design in a TSMC 180nm process. The area and power footprints of the spatial pooler are 30.538mm2 and 5.171mW, respectively. Storage processor units have the potential to be viable platforms to support implementations of HTM at scale
- โฆ