189 research outputs found
새로운 메모리 기술을 기반으로 한 메모리 시스템 설계 기술
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 최기영.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies.
The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM.
To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption.
The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems.
In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1
1.1 Inefficiencies in the Current Memory Systems 2
1.1.1 On-Chip Caches 2
1.1.2 Main Memory 2
1.2 New Memory Technologies: Opportunities and Challenges 3
1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3
1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6
1.3 Dissertation Overview 9
Chapter 2 Previous Work 11
2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11
2.1.1 Hybrid Caches 11
2.1.2 Volatile STT-RAM 13
2.1.3 Redundant Write Elimination 14
2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15
2.2.1 PIM Architectures in the 1990s 15
2.2.2 Modern PIM Architectures based on 3D Stacking 15
2.2.3 Modern PIM Architectures on Memory Dies 17
Chapter 3 Loop-Aware Sleepy Instruction Cache 19
3.1 Architecture 20
3.1.1 Loop Cache 21
3.1.2 Loop-Aware Sleep Controller 22
3.2 Evaluation and Discussion 24
3.2.1 Simulation Environment 24
3.2.2 Energy 25
3.2.3 Performance 27
3.2.4 Sensitivity Analysis 27
3.3 Summary 28
Chapter 4 Lower-Bits Cache 29
4.1 Architecture 29
4.2 Experiments 32
4.2.1 Simulator and Cache Model 32
4.2.2 Results 33
4.3 Summary 34
Chapter 5 Prediction Hybrid Cache 35
5.1 Problem and Motivation 37
5.1.1 Problem Definition 37
5.1.2 Motivation 37
5.2 Write Intensity Predictor 38
5.2.1 Keeping Track of Trigger Instructions 39
5.2.2 Identifying Hot Trigger Instructions 40
5.2.3 Dynamic Set Sampling 41
5.2.4 Summary 42
5.3 Prediction Hybrid Cache 43
5.3.1 Need for Write Intensity Prediction 43
5.3.2 Organization 43
5.3.3 Operations 44
5.3.4 Dynamic Threshold Adjustment 45
5.4 Evaluation Methodology 48
5.4.1 Simulator Configuration 48
5.4.2 Workloads 50
5.5 Single-Core Evaluations 51
5.5.1 Energy Consumption and Speedup 51
5.5.2 Energy Breakdown 53
5.5.3 Coverage and Accuracy 54
5.5.4 Sensitivity to Write Intensity Threshold 55
5.5.5 Impact of Dynamic Set Sampling 55
5.5.6 Results for Non-Write-Intensive Workloads 56
5.6 Multicore Evaluations 57
5.7 Summary 59
Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61
6.1 Motivation 62
6.1.1 Energy Impact of Inefficient Write Operations 62
6.1.2 Limitations of Existing Approaches 63
6.1.3 Potential of Dead Writes 64
6.2 Dead Write Classification 65
6.2.1 Dead-on-Arrival Fills 65
6.2.2 Dead-Value Fills 66
6.2.3 Closing Writes 66
6.2.4 Decomposition 67
6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68
6.3.1 Dead Write Prediction 68
6.3.2 Bidirectional Bypass 71
6.4 Evaluation Methodology 72
6.4.1 Simulation Configuration 72
6.4.2 Workloads 74
6.5 Evaluation for Single-Core Systems 75
6.5.1 Energy Consumption and Speedup 75
6.5.2 Coverage and Accuracy 78
6.5.3 Sensitivity to Signature 78
6.5.4 Sensitivity to Update Policy 80
6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80
6.5.6 Impact of Prefetching 80
6.6 Evaluation for Multi-Core Systems 81
6.6.1 Energy Consumption and Speedup 81
6.6.2 Application to Inclusive Caches 83
6.6.3 Application to Three-Level Cache Hierarchy 84
6.7 Summary 85
Chapter 7 Link Power Management for Hybrid Memory Cubes 87
7.1 Background and Motivation 88
7.1.1 Hybrid Memory Cube 88
7.1.2 Motivation 89
7.2 HMC Link Power Management 91
7.2.1 Link Delay Monitor 91
7.2.2 Power State Transition 94
7.2.3 Overhead 95
7.3 Two-Level Prefetching 95
7.4 Application to Multi-HMC Systems 97
7.5 Experiments 98
7.5.1 Methodology 98
7.5.2 Link Energy Consumption and Speedup 100
7.5.3 HMC Energy Consumption 102
7.5.4 Runtime Behavior of LPM 102
7.5.5 Sensitivity to Slowdown Threshold 104
7.5.6 LPM without Prefetching 104
7.5.7 Impact of Prefetching on Link Traffic 105
7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107
7.5.9 Tighter Off-Chip Bandwidth Margin 107
7.5.10 Multithreaded Workloads 108
7.5.11 Multi-HMC Systems 109
7.6 Summary 111
Chapter 8 Tesseract PIM System for Parallel Graph Processing 113
8.1 Background and Motivation 115
8.1.1 Large-Scale Graph Processing 115
8.1.2 Graph Processing on Conventional Systems 117
8.1.3 Processing-in-Memory 118
8.2 Tesseract Architecture 119
8.2.1 Overview 119
8.2.2 Remote Function Call via Message Passing 122
8.2.3 Prefetching 124
8.2.4 Programming Interface 126
8.2.5 Application Mapping 127
8.3 Evaluation Methodology 128
8.3.1 Simulation Configuration 128
8.3.2 Workloads 129
8.4 Evaluation Results 130
8.4.1 Performance 130
8.4.2 Iso-Bandwidth Comparison 133
8.4.3 Execution Time Breakdown 134
8.4.4 Prefetch Efficiency 134
8.4.5 Scalability 135
8.4.6 Effect of Higher Off-Chip Network Bandwidth 136
8.4.7 Effect of Better Graph Distribution 137
8.4.8 Energy/Power Consumption and Thermal Analysis 138
8.5 Summary 139
Chapter 9 PIM-Enabled Instructions 141
9.1 Potential of ISA Extensions as the PIM Interface 143
9.2 PIM Abstraction 145
9.2.1 Operations 145
9.2.2 Memory Model 147
9.2.3 Software Modification 148
9.3 Architecture 148
9.3.1 Overview 148
9.3.2 PEI Computation Unit (PCU) 149
9.3.3 PEI Management Unit (PMU) 150
9.3.4 Virtual Memory Support 153
9.3.5 PEI Execution 153
9.3.6 Comparison with Active Memory Operations 154
9.4 Target Applications for Case Study 155
9.4.1 Large-Scale Graph Processing 155
9.4.2 In-Memory Data Analytics 156
9.4.3 Machine Learning and Data Mining 157
9.4.4 Operation Summary 157
9.5 Evaluation Methodology 158
9.5.1 Simulation Configuration 158
9.5.2 Workloads 159
9.6 Evaluation Results 159
9.6.1 Performance 160
9.6.2 Sensitivity to Input Size 163
9.6.3 Multiprogrammed Workloads 164
9.6.4 Balanced Dispatch: Idea and Evaluation 165
9.6.5 Design Space Exploration for PCUs 165
9.6.6 Performance Overhead of the PMU 167
9.6.7 Energy, Area, and Thermal Issues 167
9.7 Summary 168
Chapter 10 Aggregation-in-Memory 171
10.1 Motivation 173
10.1.1 Rethinking PIM for Energy Efficiency 173
10.1.2 Aggregation as PIM Operations 174
10.2 Architecture 176
10.2.1 Overview 176
10.2.2 Programming Model 177
10.2.3 On-Chip Caches 177
10.2.4 Coherence and Consistency 181
10.2.5 Main Memory 181
10.2.6 Potential Generalization Opportunities 183
10.3 Compiler Support 184
10.4 Contributions over Prior Art 185
10.4.1 PIM-Enabled Instructions 185
10.4.2 Parallel Reduction in Caches 187
10.4.3 Row Buffer Locality of DRAM Writes 188
10.5 Target Applications 188
10.6 Evaluation Methodology 190
10.6.1 Simulation Configuration 190
10.6.2 Hardware Overhead 191
10.6.3 Workloads 192
10.7 Evaluation Results 192
10.7.1 Energy Consumption and Performance 192
10.7.2 Dynamic Energy Breakdown 196
10.7.3 Comparison with Aggressive Writeback 197
10.7.4 Multiprogrammed Workloads 198
10.7.5 Comparison with Intrinsic-based Code 198
10.8 Summary 199
Chapter 11 Conclusion 201
11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202
11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203
Bibliography 205
요약 227Docto
Recommended from our members
A Statistical View of Architecture Design
Computer architectures are becoming more and more complicated to meet the continuouslyincreasing demand on performance, security and sustainability from applications. Many factorsexist in the design and engineering space of various components and policies in the architectures,and it is not intuitive how these factors interact with each other and how they make impactson the architecture behaviors. Seeking for the best architectures for specific applicationsand requirements automatically is even more challenging. Meanwhile, the architecture designneed to deal with more and more non-determinism from lower level technologies. Emergingtechnologies exhibit statistical properties inherently, such as the wearout phenomenon inNEMs, PCM, ReRAM, etc. Due to the manufacturing and processing variations, there alsoexists variability among different devices or within the same device (e.g. different cells onthe same memory chip). Hence, to better understand and control the architecture behaviors,we introduce the statistical perspective of architecture design: by specifying the architecturaldesign goals and the desired statistical properties, we guide the architecture design with thesestatistical properties and exploit a series of techniques to achieve these properties.In the first part of the thesis, we introduce Herniated Hash Tables. Our architectural designgoal is that the hash table implementation is highly scalable in both storage efficiency andperformance, while the desired statistical property is to achieve as good storage efficiencyand performance as with uniform distributions given non-uniform distributions across hashbuckets. Herniated Hash Tables exploit multi-level phase change memory (PCM) to in-placeexpand storage for each hash bucket to accommodate asymmetrically chained entries. Theorganization, coupled with an addressing and prefetching scheme, also improves performancesignificantly by creating more memory parallelism.In the second part of the thesis, we introduce Lemonade from Lemons, harnessing devicewearout to create limited-use security architectures. The architectural design goal is tocreate hardware security architectures that resist attacks by statistically enforcing an upperbound on hardware uses, and consequently attacks. The desired statistical property is that thesystem-level minimum and maximum uses can be guaranteed with high probabilities despite ofdevice-level variability. We introduce techniques for architecturally controlling these boundsand explore the cost in area, energy and latency of using these techniques to achieve systemlevelusage targets given device-level wearout distributions.In the third part of the thesis, we demonstrate Memory Cocktail Therapy: A General,Learning-Based Framework to Optimize Dynamic Tradeoffs in NVMs. Limited write enduranceand long latencies remain the primary challenges of building practical memory systems fromNVMs. Researchers have proposed a variety of architectural techniques to achieve differenttradeoffs between lifetime, performance and energy efficiency; however, no individual techniquecan satisfy requirements for all applications and different objectives. Our architecturaldesign goal is that NVM systems can achieve optimal tradeoffs for specific applications andobjectives, and the statistical goal is that the selected NVM configuration is nearly optimal.Memory Cocktail Therapy uses machine learning techniques to model the architecture behaviorsin terms of all the configurable parameters based on a small number of sample configurations.Then, it selects the optimal configuration according to user-defined objectives whichleads to the desired tradeoff between performance, lifetime and energy efficiency
A Scalable and Adaptive Network on Chip for Many-Core Architectures
In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced
SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation
Graph neural networks (GNNs) have emerged as a powerful tool to process
graph-based data in fields like communication networks, molecular interactions,
chemistry, social networks, and neuroscience. GNNs are characterized by the
ultra-sparse nature of their adjacency matrix that necessitates the development
of dedicated hardware beyond general-purpose sparse matrix multipliers. While
there has been extensive research on designing dedicated hardware accelerators
for GNNs, few have extensively explored the impact of the sparse storage format
on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the
novel sparse compressed vectors (SCV) format optimized for the aggregation
operation. We use Z-Morton ordering to derive a data-locality-based computation
ordering and partitioning scheme. The paper also presents how the proposed
SCV-GNN is scalable on a vector processing system. Experimental results over
various datasets show that the proposed method achieves a geometric mean
speedup of and over CSC and CSR aggregation
operations, respectively. The proposed method also reduces the memory traffic
by a factor of and over compressed sparse column
(CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel
aggregation format reduces the latency and memory access for GNN inference
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Effective data parallel computing on multicore processors
The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores
- …