209 research outputs found

    Doctor of Philosophy in Computing

    Get PDF
    dissertatio

    Doctor of Philosophy

    Get PDF
    dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

    Doctor of Philosophy

    Get PDF
    dissertationThe computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a web-based information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehouse- scale computers), e.g., cloud computing and other web services, or in many-core high-performance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of cost-per-bit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation of novel technologies, and a fundamental rethinking of certain conventional design decisions. In this dissertation, we study every major element of the memory system - the memory chip, the processor-memory channel, the memory access mechanism, and memory reliability, and identify the key bottlenecks to efficiency. Based on this, we propose a novel main memory system with the following innovative features: (i) overfetch-aware re-organized chips, (ii) low-cost silicon photonic memory channels, (iii) largely autonomous memory modules with a packet-based interface to the proces- sor, and (iv) a RAID-based reliability mechanism. Such a system is energy-efficient, high-performance, low-complexity, reliable, and cost-effective, making it ideally suited to meet the requirements of future large-scale computing systems

    ์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๊ฐ„์„ญ ์˜ค๋ฅ˜ ์™„ํ™” ๋ฐ RMW ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ดํ˜์žฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems. Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM. The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably. Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ๋Š”(PCM) ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ƒˆ๋กœ์šด ์‹œ๋Œ€์˜ ์‹œ์ž‘์„ ์•Œ๋ ธ๋‹ค. ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ์ œํ’ˆ ์ œ์กฐ์—…์ฒด(์˜ˆ : ์ธํ…”, SK ํ•˜์ด๋‹‰์Šค, ์‚ผ์„ฑ)๊ฐ€ ๊ด€๋ จ ์ œํ’ˆ ๊ฐœ๋ฐœ์— ๋ฐ•์ฐจ๋ฅผ ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. PCM์€ ๋‹จ์ˆœํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—๋งŒ ๊ตญํ•œ๋˜์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, PCM์€ ๋น„ํœ˜๋ฐœ์„ฑ์œผ๋กœ ์ธํ•ด ๋Œ€๊ธฐ ์ „๋ ฅ์ด ๋‚ฎ๋‹ค. ๋”ฐ๋ผ์„œ ๊ณ„์‚ฐ ์ง‘์•ฝ์ ์ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋˜๋Š” ๋ชจ๋ฐ”์ผ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€(์ฆ‰, ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์œ ํœด ์‹œ๊ฐ„) PCM ๊ธฐ๋ฐ˜ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์—์„œ ์‹คํ–‰ํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. PCM์˜ ์ด๋Ÿฌํ•œ ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  PCM์€ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ๊ณผ ๊ธด ๋Œ€๊ธฐ ์‹œ๊ฐ„์œผ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ์ผ๋ฐ˜ ์‚ฐ์—… ์‹œ์žฅ์—์„œ๋Š” DRAM๊ณผ ๋‹ค์†Œ ๊ฒฉ์ฐจ๊ฐ€ ์žˆ๋‹ค. ํŠนํžˆ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ์€ ์ง€๋‚œ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ PCM ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์„ ์ €ํ•ดํ•˜๋Š” ๋ฌธ์ œ๋‹ค. ๋ฐ˜๋„์ฒด ๊ณต์ • ๊ธฐ์ˆ ์ด ์ˆ˜๋…„์— ๊ฑธ์ณ ๋น ๋ฅด๊ฒŒ ์ถ•์†Œ๋จ์— ๋”ฐ๋ผ DRAM์€ 10nm ๊ธ‰ ๊ณต์ • ๊ธฐ์ˆ ์— ๋„๋‹ฌํ•˜์˜€๋‹ค. ์ด์–ด์„œ, ์“ฐ๊ธฐ ๋ฐฉํ•ด ์˜ค๋ฅ˜ (WDE)๊ฐ€ 54nm ๋“ฑ๊ธ‰ ํ”„๋กœ์„ธ์Šค ๊ธฐ์ˆ  ์•„๋ž˜๋กœ ์ถ•์†Œ๋˜๋ฉด PCM์— ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ๋  ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ, WDE ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์€ PCM์ด DRAM๊ณผ ๋™๋“ฑํ•œ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ–์ถ”๋„๋ก ํ•˜๋Š” ๋ฐ ์žˆ์–ด ํ•„์ˆ˜์ ์ด๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” 2-๋ ˆ๋ฒจ SRAM ๊ธฐ๋ฐ˜ ํ…Œ์ด๋ธ”์„ ํ™œ์šฉํ•˜์—ฌ WDE ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ํ•„์š”์— ๋”ฐ๋ผ ์ค€ ์•ˆ์ • ์…€์„ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ์›๋ž˜ SRAM์—์„œ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฝ๊ธฐ ํฌํŠธ๊ฐ€ ํ•„์š”ํ•œ ๋Œ€์ฒด ์ •์ฑ…์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋žœ๋ค ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. PCM์˜ ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” DRAM์— ๋น„ํ•ด ์ง€์—ฐ ์‹œ๊ฐ„์ด ๊ธธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํŠนํžˆ PCM์€ ๋” ํฐ ํŠธ๋žœ์žญ์…˜ ๋‹จ์œ„๋ฅผ ์ฑ„ํƒํ•˜์—ฌ ๋‹จ์œ„์‹œ๊ฐ„ ๋‹น ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ์„ ๋„๋ชจํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฒ”์šฉ ํ”„๋กœ์„ธ์„œ ์บ์‹œ ๋ผ์ธ๊ณผ ๋‹ค๋ฅธ ์œ ๋‹› ํฌ๊ธฐ๋Š” ์ฝ๊ธฐ-์ˆ˜์ •-์“ฐ๊ธฐ (RMW) ๋ชจ๋“ˆ์˜ ๋„์ž…์œผ๋กœ ์ธํ•ด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ €ํ•˜ํ•˜๊ฒŒ ๋œ๋‹ค. PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ RMW ๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์€ RMW ๋ชจ๋“ˆ์„ ํƒ‘์žฌ ํ•œ PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ „๋ฐ˜์ ์ธ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ๊ณผ ์‹ ๋ขฐ์„ฑ์„ ํ–ฅ์ƒํ•˜๊ฒŒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•„ํ‚คํ…์ฒ˜๋Š” ์ถ”๊ฐ€ ์Šคํ† ๋ฆฌ์ง€ ๋ฆฌ์†Œ์Šค๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ช…๋ น ์œ ํ˜•๊ณผ ๊ด€๊ณ„์—†์ด ๋ช…๋ น์„ ๋ณ‘ํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ๋Š” PCM์„ ์œ„ํ•œ ์™„์ „ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”Œ๋žซํผ์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. PCM ๊ด€๋ จ ์ œํ’ˆ(์˜ˆ : Intel Optane)์— ๋Œ€ํ•ด ๋ฐœํ‘œ๋œ ์ •๋ณด๋Š” ๋Œ€์™ธ๋น„ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ถ€์กฑํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์•Œ๋ ค์ ธ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ ์ ˆํžˆ ์ทจํ•ฉํ•˜๋ฉด ์‹œ์ค‘ ์ œํ’ˆ๊ณผ ์œ ์‚ฌํ•œ ์•„ํ‚คํ…์ฒ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ PCM ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ชจ๋“ˆ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ–ฅํ›„ ์ด์™€ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ์—์„œ ์ถฉ๋ถ„ํžˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ „์šฉ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.1 INTRODUCTION 1 1.1 Limitation of Traditional Main Memory Systems 1 1.2 Phase-Change Memory as Main Memory 3 1.2.1 Opportunities of PCM-based System 3 1.2.2 Challenges of PCM-based System 4 1.3 Dissertation Overview 7 2 BACKGROUND AND PREVIOUS WORK 8 2.1 Phase-Change Memory 8 2.2 Mitigation Schemes for Write Disturbance Errors 10 2.2.1 Write Disturbance Errors 10 2.2.2 Verification and Correction 12 2.2.3 Lazy Correction 13 2.2.4 Data Encoding-based Schemes 14 2.2.5 Sparse-Insertion Write Cache 16 2.3 Performance Enhancement for Read-Modify-Write 17 2.3.1 Traditional Read-Modify-Write 17 2.3.2 Write Coalescing for RMW 19 2.4 Architecture Simulators for PCM 21 2.4.1 NVMain 21 2.4.2 Ramulator 22 2.4.3 DRAMsim3 22 3 IN-MODULE DISTURBANCE BARRIER 24 3.1 Motivation 25 3.2 IMDB: In Module-Disturbance Barrier 29 3.2.1 Architectural Overview 29 3.2.2 Implementation of Data Structures 30 3.2.3 Modification of Media Controller 36 3.3 Replacement Policy 38 3.3.1 Replacement Policy for IMDB 38 3.3.2 Approximate Lowest Number Estimator 40 3.4 Putting All Together: Case Studies 43 3.5 Evaluation 45 3.5.1 Configuration 45 3.5.2 Architectural Exploration 47 3.5.3 Effectiveness of the Replacement Policy 48 3.5.4 Sensitivity to Main Table Configuration 49 3.5.5 Sensitivity to Barrier Buffer Size 51 3.5.6 Sensitivity to AppLE Group Size 52 3.5.7 Comparison with Other Studies 54 3.6 Discussion 59 3.7 Summary 63 4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64 4.1 Motivation 65 4.2 Utilization of DRAM Cache for RMW 67 4.2.1 Architectural Design 67 4.2.2 Algorithm 70 4.3 Typeless Command Merging 73 4.3.1 Architectural Design 73 4.3.2 Algorithm 74 4.4 An Alternative Implementation: SRC-RMW 78 4.4.1 Implementation of SRC-RMW 78 4.4.2 Design Constraint 80 4.5 Case Study 82 4.6 Evaluation 85 4.6.1 Configuration 85 4.6.2 Speedup 88 4.6.3 Read Reliability 91 4.6.4 Energy Consumption: Selecting a Proper Page Size 93 4.6.5 Comparison with Other Studies 95 4.7 Discussion 97 4.8 Summary 99 5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100 5.1 Motivation 101 5.2 PCMCsim: PCM Controller Simulator 103 5.2.1 Architectural Overview 103 5.2.2 Underlying Classes of PCMCsim 104 5.2.3 Implementation of Contention Behavior 108 5.2.4 Modules of PCMCsim 109 5.3 Evaluation 116 5.3.1 Correctness of the Simulator 116 5.3.2 Comparison with Other Simulators 117 5.4 Summary 119 6 Conclusion 120 Abstract (In Korean) 141 Acknowledgment 143๋ฐ•

    Architecting Memory Systems for Emerging Technologies

    Full text link
    The advance of traditional dynamic random access memory (DRAM) technology has slowed down, while the capacity and performance needs of memory system have continued to increase. This is a result of increasing data volume from emerging applications, such as machine learning and big data analytics. In addition to such demands, increasing energy consumption is becoming a major constraint on the capabilities of computer systems. As a result, emerging non-volatile memories, for example, Spin Torque Transfer Magnetic RAM (STT-MRAM), and new memory interfaces, for example, High Bandwidth Memory (HBM), have been developed as an alternative. Thus far, most previous studies have retained a DRAM-like memory architecture and management policy. This preserves compatibility but hides the true benefits of those new memory technologies. In this research, we proposed the co-design of memory architectures and their management policies for emerging technologies. First, we introduced a new memory architecture for an STT-MRAM main memory. In particular, we defined a new page mode operation for efficient activation and sensing. By fully exploiting the non-destructive nature of STT- MRAM, our design achieved higher performance, lower energy consumption, and a smaller area than the traditional designs. Second, we developed a cost-effective technique to improve load balancing for HBM memory channels. We showed that the proposed technique was capable of efficiently redistributing memory requests across multiple memory channels to improve the channel utilization, resulting in improved performance.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145988/1/bcoh_1.pd
    • โ€ฆ
    corecore