306 research outputs found

    Flash Cache Hybrid Storage System For Video-On-Demand Server

    Get PDF
    A video-on-demand (VOD) system allows users to access any video at any time. This system enables users to view videos in the real-time streaming mode for extended durations. One of the important components of the VOD systems is the hybrid storage server. This component consists of HDD, SSD, and RAM, collectively fulfilling the requirement of simultaneous fast data access and large data distribution to numerous users. However, current hybrid storage systems still pose numerous challenges. (1) The integration and roles of the HDD, SSD, and RAM are relatively weak in terms of optimizing fast access prior to streaming to a large number of simultaneous users. (2) The HDD and SSD exhibit poor data layout and streaming controller in supporting the production of a high number of simultaneous streams. This thesis proposes (1) an enhanced hybrid storage system (EHSS) for the VOD servers. The HDD, SSD, and RAM are managed and integrated using an effective mechanism, in which the top 10% most popular videos are utilized to achieve fast data access and service to a large number of simultaneous users. (2) The flash cache hybrid storage system (FCHSS) is also proposed. Unlike the EHSS architecture, the FCHSS architecture has no RAM, which is removed and replaced by a flash-based SSD. The flash-based SSD is used instead of the RAM as a cache for the HDD to achieve fast data access and service to a large number of simultaneous users. (3) The new data layout stores thousands of video segments in the HDD and SSD

    SSD์˜ ๊ธด ๊ผฌ๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„ ๋ฌธ์ œ ์™„ํ™”๋ฅผ ์œ„ํ•œ ๊ฐ•ํ™”ํ•™์Šต์˜ ์ ์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์œ ์Šน์ฃผ.NAND flash memory is widely used in a variety of systems, from realtime embedded systems to high-performance enterprise server systems. Flash memory has (1) erase-before-write (write-once) and (2) endurance problems. To handle the erase-before-write feature, apply a flash-translation layer (FTL). Currently, the page-level mapping method is mainly used to reduce the latency increase caused by the write-once and block erase characteristics of flash memory. Garbage collection (GC) is one of the leading causes of long-tail latency, which increases more than 100 times the average latency at 99th percentile. Therefore, real-time systems or quality-critical systems cannot satisfy given requirements such as QoS restrictions. As flash memory capacity increases, GC latency also tends to increase. This is because the block size (the number of pages included in one block) of the flash memory increases as the capacity of the flash memory increases. GC latency is determined by valid page copy and block erase time. Therefore, as block size increases, GC latency also increases. Especially, the block size gets increased from 2D to 3D NAND flash memory, e.g., 256 pages/block in 2D planner NAND flash memory and 768 pages/block in 3D NAND flash memory. Even in 3D NAND flash memory, the block size is expected to continue to increase. Thus, the long write latency problem incurred by GC can become more serious in 3D NAND flash memory-based storage. In this dissertation, we propose three versions of the novel GC scheduling method based on reinforcement learning. The purpose of this method is to reduce the long tail latency caused by GC by utilizing the idle time of the storage system. Also, we perform a quantitative analysis for the RL-assisted GC solution. RL-assisted GC scheduling technique was proposed which learns the storage access behavior online and determines the number of GC operations to exploit the idle time. We also presented aggressive methods, which helps in further reducing the long tail latency by aggressively performing fine-grained GC operations. We also proposed a technique that dynamically manages key states in RL-assisted GC to reduce the long-tail latency. This technique uses many fine-grained pieces of information as state candidates and manages key states that suitably represent the characteristics of the workload using a relatively small amount of memory resource. Thus, the proposed method can reduce the long-tail latency even further. In addition, we presented a Q-value prediction network that predicts the initial Q-value of a newly inserted state in the Q-table cache. The integrated solution of the Q-table cache and Q-value prediction network can exploit the short-term history of the system with a low-cost Q-table cache. It is also equipped with a small network called Q-value prediction network to make use of the long-term history and provide good Q-value initialization for the Q-table cache. The experiments show that our proposed method reduces by 25%-37% the long tail latency compared to the state-of-the-art method.๋‚ธ๋“œ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ๋Š” ์‹ค์‹œ๊ฐ„ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์œผ๋กœ๋ถ€ํ„ฐ ๊ณ ์„ฑ๋Šฅ์˜ ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ์„œ๋ฒ„ ์‹œ์Šคํ…œ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์‹œ์Šคํ…œ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ ๋˜๊ณ  ์žˆ๋‹ค. ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ๋Š” (1) erase-before-write (write-once)์™€ (2) endurance ๋ฌธ์ œ๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค. Erase-before-write ํŠน์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด flash-translation layer (FTL)์„ ์ ์šฉ ํ•œ๋‹ค. ํ˜„์žฌ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ write-once ํŠน์„ฑ๊ณผ block eraseํŠน์„ฑ์œผ๋กœ ์ธํ•œ latency ์ฆ๊ฐ€๋ฅผ ๊ฐ์†Œ ์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ page-level mapping๋ฐฉ์‹์ด ์ฃผ๋กœ ์‚ฌ์šฉ ๋œ๋‹ค. Garbage collection (GC)์€ 99th percentile์—์„œ ํ‰๊ท  ์ง€์—ฐ์‹œ๊ฐ„์˜ 100๋ฐฐ ์ด์ƒ ์ฆ๊ฐ€ํ•˜๋Š” long tail latency๋ฅผ ์œ ๋ฐœ์‹œํ‚ค๋Š” ์ฃผ์š” ์›์ธ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋”ฐ๋ผ์„œ ์‹ค์‹œ๊ฐ„ ์‹œ์Šคํ…œ์ด๋‚˜ quality-critical system์—์„œ๋Š” Quality of Service (QoS) ์ œํ•œ๊ณผ ๊ฐ™์€ ์ฃผ์–ด์ง„ ์š”๊ตฌ ์กฐ๊ฑด์„ ๋งŒ์กฑ ์‹œํ‚ฌ ์ˆ˜ ์—†๋‹ค. ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ GC latency๋„ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. ์ด๊ฒƒ์€ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ ํ•จ์— ๋”ฐ๋ผ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ธ”๋ก ํฌ๊ธฐ (ํ•˜๋‚˜์˜ ๋ธ”๋ก์ด ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ํŽ˜์ด์ง€์˜ ์ˆ˜)๊ฐ€ ์ฆ๊ฐ€ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. GC latency๋Š” valid page copy์™€ block erase ์‹œ๊ฐ„์— ์˜ํ•ด ๊ฒฐ์ • ๋œ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ธ”๋ก ํฌ๊ธฐ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด, GC latency๋„ ์ฆ๊ฐ€ ํ•œ๋‹ค. ํŠนํžˆ, ์ตœ๊ทผ 2D planner ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์—์„œ 3D vertical ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋กœ ์ „ํ™˜๋จ์— ๋”ฐ๋ผ ๋ธ”๋ก ํฌ๊ธฐ๋Š” ์ฆ๊ฐ€ ํ•˜์˜€๋‹ค. ์‹ฌ์ง€์–ด 3D vertical ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์—์„œ๋„ ๋ธ”๋ก ํฌ๊ธฐ๊ฐ€ ์ง€์†์ ์œผ๋กœ ์ฆ๊ฐ€ ํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ 3D vertical ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์—์„œ long tail latency ๋ฌธ์ œ๋Š” ๋”์šฑ ์‹ฌ๊ฐํ•ด ์ง„๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ•ํ™”ํ•™์Šต(Reinforcement learning, RL)์„ ์ด์šฉํ•œ ์„ธ ๊ฐ€์ง€ ๋ฒ„์ „์˜ ์ƒˆ๋กœ์šด GC scheduling ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ์ˆ ์˜ ๋ชฉ์ ์€ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์˜ idle ์‹œ๊ฐ„์„ ํ™œ์šฉํ•˜์—ฌ GC์— ์˜ํ•ด ๋ฐœ์ƒ๋œ long tail latency๋ฅผ ๊ฐ์†Œ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” RL-assisted GC ์†”๋ฃจ์…˜์„ ์œ„ํ•œ ์ •๋Ÿ‰ ๋ถ„์„ ํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” ์Šคํ† ๋ฆฌ์ง€์˜ access behavior๋ฅผ ์˜จ๋ผ์ธ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , idle ์‹œ๊ฐ„์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” GC operation์˜ ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” RL-assisted GC scheduling ๊ธฐ์ˆ ์„ ์ œ์•ˆ ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” ๊ณต๊ฒฉ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œ ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ž‘์€ ๋‹จ์œ„์˜ GC operation๋“ค์„ ๊ณต๊ฒฉ์ ์œผ๋กœ ์ˆ˜ํ–‰ ํ•จ์œผ๋กœ์จ, long tail latency๋ฅผ ๋”์šฑ ๊ฐ์†Œ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ๋„์›€์„ ์ค€๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” long tail latency๋ฅผ ๋”์šฑ ๊ฐ์†Œ์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ RL-assisted GC์˜ key state๋“ค์„ ๋™์ ์œผ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” Q-table cache ๊ธฐ์ˆ ์„ ์ œ์•ˆ ํ•˜์˜€๋‹ค. ์ด ๊ธฐ์ˆ ์€ state ํ›„๋ณด๋กœ ๋งค์šฐ ๋งŽ์€ ์ˆ˜์˜ ์„ธ๋ฐ€ํ•œ ์ •๋ณด๋“ค์„ ์‚ฌ์šฉ ํ•˜๊ณ , ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ์ด์šฉํ•˜์—ฌ workload์˜ ํŠน์„ฑ์„ ์ ์ ˆํ•˜๊ฒŒ ํ‘œํ˜„ ํ•  ์ˆ˜ ์žˆ๋Š” key state๋“ค์„ ๊ด€๋ฆฌ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ long tail latency๋ฅผ ๋”์šฑ ๊ฐ์†Œ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” Q-table cache์— ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋˜๋Š” state์˜ ์ดˆ๊ธฐ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” Q-value prediction network (QP Net)๋ฅผ ์ œ์•ˆ ํ•˜์˜€๋‹ค. Q-table cache์™€ QP Net์˜ ํ†ตํ•ฉ ์†”๋ฃจ์…˜์€ ์ € ๋น„์šฉ์˜ Q-table cache๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ๊ธฐ๊ฐ„์˜ ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ํ™œ์šฉ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ด๊ฒƒ์€ QP Net์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ์ž‘์€ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•œ ์žฅ๊ธฐ๊ฐ„์˜ ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Q-table cache์— ์ƒˆ๋กญ๊ฒŒ ์‚ฝ์ž…๋˜๋Š” state์— ๋Œ€ํ•ด ์ข‹์€ Q-value ์ดˆ๊ธฐ๊ฐ’์„ ์ œ๊ณตํ•œ๋‹ค. ์‹คํ—˜๊ฒฐ๊ณผ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์ด state-of-the-art ๋ฐฉ๋ฒ•์— ๋น„๊ตํ•˜์—ฌ 25%-37%์˜ long tail latency๋ฅผ ๊ฐ์†Œ ์‹œ์ผฐ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 System Level Tail Latency 6 2.2 Solid State Drive 10 2.2.1 Flash Storage Architecture and Garbage Collection 10 2.3 Reinforcement Learning 13 Chapter 3 Related Work 17 Chapter 4 Small Q-table based Solution to Reduce Long Tail Latency 23 4.1 Problem and Motivation 23 4.1.1 Long Tail Problem in Flash Storage Access Latency 23 4.1.2 Idle Time in Flash Storage 24 4.2 Design and Implementation 26 4.2.1 Solution Overview 26 4.2.2 RL-assisted Garbage Collection Scheduling 27 4.2.3 Aggressive RL-assisted Garbage Collection Scheduling 33 4.3 Evaluation 35 4.3.1 Evaluation Setup 35 4.3.2 Results and Discussion 39 Chapter 5 Q-table Cache to Exploit a Large Number of States at Small Cost 52 5.1 Motivation 52 5.2 Design and Implementation 56 5.2.1 Solution Overview 56 5.2.2 Dynamic Key States Management 61 5.3 Evaluation 67 5.3.1 Evaluation Setup 67 5.3.2 Results and Discussion 67 Chapter 6 Combining Q-table cache and Neural Network to Exploit both Long and Short-term History 73 6.1 Motivation and Problem 73 6.1.1 More State Information can Further Reduce Long Tail Latency 73 6.1.2 Locality Behavior of Workload 74 6.1.3 Zero Initialization Problem 75 6.2 Design and Implementation 77 6.2.1 Solution Overview 77 6.2.2 Q-table Cache for Action Selection 80 6.2.3 Q-value Prediction 83 6.3 Evaluation 87 6.3.1 Evaluation Setup 87 6.3.2 Storage-Intensive Workloads 89 6.3.3 Latency Comparison: Overall 92 6.3.4 Q-value Prediction Network Effects on Latency 97 6.3.5 Q-table Cache Analysis 110 6.3.6 Immature State Analysis 113 6.3.7 Miscellaneous Analysis 116 6.3.8 Multi Channel Analysis 121 Chapter 7 Conculsion and Future Work 138 7.1 Conclusion 138 7.2 Future Work 140 Bibliography 143 ๊ตญ๋ฌธ์ดˆ๋ก 154Docto

    Data Deduplication Technology for Cloud Storage

    Get PDF
    With the explosive growth of information data, the data storage system has stepped into the cloud storage era. Although the core of the cloud storage system is distributed file system in solving the problem of mass data storage, a large number of duplicate data exist in all storage system. File systems are designed to control how files are stored and retrieved. Fewer studies focus on the cloud file system deduplication technologies at the application level, especially for the Hadoop distributed file system. In this paper, we design a file deduplication framework on Hadoop distributed file system for cloud application developer. Proposed RFD-HDFS and FD-HDFS two data deduplication solutions process data deduplication online, which improves storage space utilisation and reduces the redundancy. In the end of the paper, we test the disk utilisation and the file upload performance on RFD-HDFS and FD-HDFS, and compare HDFS with the disk utilisation of two system frameworks. The results show that the two-system framework not only implements data deduplication function but also effectively reduces the disk utilisation of duplicate files. So, the proposed framework can indeed reduce the storage space by eliminating redundant HDFS file

    A (Nearly) Free Lunch:Extending NAND Flash Lifetime by Exploiting Neglected Physical Properties

    Get PDF
    NAND flash is a key storage technology in modern computing systems. Without it, many devices would probably not exist today or would at least not benefit from as many features. The very large success of this technology motivated massive efforts to scale it down in order to increase its density further. However, NAND flash is currently facing physical limitations that prevent it reaching smaller cell sizes without severely reducing its storage reliability and lifetime. Accordingly, in the present thesis we aim at relieving some constraints from device manufacturing by addressing flash irregularities at a higher level. For example, we acknowledge the fact that process variation plus other factors render some regions of a flash device more sensitive than others. This difference usually leads to sensitive regions exhausting their lifetime early, which then causes the device to become unusable, while the rest of the device is still healthy, yet not exploitable. Consequently, we propose to postpone this exhaustion point with new strategies that require minimal resources to be implemented and effectively extend flash devices lifetime. Sometimes, our strategies involve unconventional methods to access the flash that are not supported by specification document and, therefore, should not be used lightly. Hence, we also present thorough characterization experiments on actual NAND flash chips to validate these methods and model their effect on a flash device. Finally, we evaluate the performance of our methods by implementing a trace-driven flash device simulator and execute a large set of realistic disk traces. Overall, we exploit properties that are either neglected or not understood to propose methods that are nearly free to implement and systematically extend NAND flash lifetime. We are convinced that future NAND flash architectures will regularly bring radical physical changes, which will inevitably come together with a new set of physical properties to investigate and to exploit

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    ์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๊ฐ„์„ญ ์˜ค๋ฅ˜ ์™„ํ™” ๋ฐ RMW ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ดํ˜์žฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems. Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM. The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably. Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ๋Š”(PCM) ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ƒˆ๋กœ์šด ์‹œ๋Œ€์˜ ์‹œ์ž‘์„ ์•Œ๋ ธ๋‹ค. ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ์ œํ’ˆ ์ œ์กฐ์—…์ฒด(์˜ˆ : ์ธํ…”, SK ํ•˜์ด๋‹‰์Šค, ์‚ผ์„ฑ)๊ฐ€ ๊ด€๋ จ ์ œํ’ˆ ๊ฐœ๋ฐœ์— ๋ฐ•์ฐจ๋ฅผ ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. PCM์€ ๋‹จ์ˆœํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—๋งŒ ๊ตญํ•œ๋˜์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, PCM์€ ๋น„ํœ˜๋ฐœ์„ฑ์œผ๋กœ ์ธํ•ด ๋Œ€๊ธฐ ์ „๋ ฅ์ด ๋‚ฎ๋‹ค. ๋”ฐ๋ผ์„œ ๊ณ„์‚ฐ ์ง‘์•ฝ์ ์ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋˜๋Š” ๋ชจ๋ฐ”์ผ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€(์ฆ‰, ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์œ ํœด ์‹œ๊ฐ„) PCM ๊ธฐ๋ฐ˜ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์—์„œ ์‹คํ–‰ํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. PCM์˜ ์ด๋Ÿฌํ•œ ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  PCM์€ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ๊ณผ ๊ธด ๋Œ€๊ธฐ ์‹œ๊ฐ„์œผ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ์ผ๋ฐ˜ ์‚ฐ์—… ์‹œ์žฅ์—์„œ๋Š” DRAM๊ณผ ๋‹ค์†Œ ๊ฒฉ์ฐจ๊ฐ€ ์žˆ๋‹ค. ํŠนํžˆ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ์€ ์ง€๋‚œ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ PCM ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์„ ์ €ํ•ดํ•˜๋Š” ๋ฌธ์ œ๋‹ค. ๋ฐ˜๋„์ฒด ๊ณต์ • ๊ธฐ์ˆ ์ด ์ˆ˜๋…„์— ๊ฑธ์ณ ๋น ๋ฅด๊ฒŒ ์ถ•์†Œ๋จ์— ๋”ฐ๋ผ DRAM์€ 10nm ๊ธ‰ ๊ณต์ • ๊ธฐ์ˆ ์— ๋„๋‹ฌํ•˜์˜€๋‹ค. ์ด์–ด์„œ, ์“ฐ๊ธฐ ๋ฐฉํ•ด ์˜ค๋ฅ˜ (WDE)๊ฐ€ 54nm ๋“ฑ๊ธ‰ ํ”„๋กœ์„ธ์Šค ๊ธฐ์ˆ  ์•„๋ž˜๋กœ ์ถ•์†Œ๋˜๋ฉด PCM์— ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ๋  ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ, WDE ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์€ PCM์ด DRAM๊ณผ ๋™๋“ฑํ•œ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ–์ถ”๋„๋ก ํ•˜๋Š” ๋ฐ ์žˆ์–ด ํ•„์ˆ˜์ ์ด๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” 2-๋ ˆ๋ฒจ SRAM ๊ธฐ๋ฐ˜ ํ…Œ์ด๋ธ”์„ ํ™œ์šฉํ•˜์—ฌ WDE ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ํ•„์š”์— ๋”ฐ๋ผ ์ค€ ์•ˆ์ • ์…€์„ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ์›๋ž˜ SRAM์—์„œ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฝ๊ธฐ ํฌํŠธ๊ฐ€ ํ•„์š”ํ•œ ๋Œ€์ฒด ์ •์ฑ…์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋žœ๋ค ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. PCM์˜ ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” DRAM์— ๋น„ํ•ด ์ง€์—ฐ ์‹œ๊ฐ„์ด ๊ธธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํŠนํžˆ PCM์€ ๋” ํฐ ํŠธ๋žœ์žญ์…˜ ๋‹จ์œ„๋ฅผ ์ฑ„ํƒํ•˜์—ฌ ๋‹จ์œ„์‹œ๊ฐ„ ๋‹น ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ์„ ๋„๋ชจํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฒ”์šฉ ํ”„๋กœ์„ธ์„œ ์บ์‹œ ๋ผ์ธ๊ณผ ๋‹ค๋ฅธ ์œ ๋‹› ํฌ๊ธฐ๋Š” ์ฝ๊ธฐ-์ˆ˜์ •-์“ฐ๊ธฐ (RMW) ๋ชจ๋“ˆ์˜ ๋„์ž…์œผ๋กœ ์ธํ•ด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ €ํ•˜ํ•˜๊ฒŒ ๋œ๋‹ค. PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ RMW ๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์€ RMW ๋ชจ๋“ˆ์„ ํƒ‘์žฌ ํ•œ PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ „๋ฐ˜์ ์ธ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ๊ณผ ์‹ ๋ขฐ์„ฑ์„ ํ–ฅ์ƒํ•˜๊ฒŒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•„ํ‚คํ…์ฒ˜๋Š” ์ถ”๊ฐ€ ์Šคํ† ๋ฆฌ์ง€ ๋ฆฌ์†Œ์Šค๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ช…๋ น ์œ ํ˜•๊ณผ ๊ด€๊ณ„์—†์ด ๋ช…๋ น์„ ๋ณ‘ํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ๋Š” PCM์„ ์œ„ํ•œ ์™„์ „ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”Œ๋žซํผ์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. PCM ๊ด€๋ จ ์ œํ’ˆ(์˜ˆ : Intel Optane)์— ๋Œ€ํ•ด ๋ฐœํ‘œ๋œ ์ •๋ณด๋Š” ๋Œ€์™ธ๋น„ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ถ€์กฑํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์•Œ๋ ค์ ธ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ ์ ˆํžˆ ์ทจํ•ฉํ•˜๋ฉด ์‹œ์ค‘ ์ œํ’ˆ๊ณผ ์œ ์‚ฌํ•œ ์•„ํ‚คํ…์ฒ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ PCM ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ชจ๋“ˆ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ–ฅํ›„ ์ด์™€ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ์—์„œ ์ถฉ๋ถ„ํžˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ „์šฉ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.1 INTRODUCTION 1 1.1 Limitation of Traditional Main Memory Systems 1 1.2 Phase-Change Memory as Main Memory 3 1.2.1 Opportunities of PCM-based System 3 1.2.2 Challenges of PCM-based System 4 1.3 Dissertation Overview 7 2 BACKGROUND AND PREVIOUS WORK 8 2.1 Phase-Change Memory 8 2.2 Mitigation Schemes for Write Disturbance Errors 10 2.2.1 Write Disturbance Errors 10 2.2.2 Verification and Correction 12 2.2.3 Lazy Correction 13 2.2.4 Data Encoding-based Schemes 14 2.2.5 Sparse-Insertion Write Cache 16 2.3 Performance Enhancement for Read-Modify-Write 17 2.3.1 Traditional Read-Modify-Write 17 2.3.2 Write Coalescing for RMW 19 2.4 Architecture Simulators for PCM 21 2.4.1 NVMain 21 2.4.2 Ramulator 22 2.4.3 DRAMsim3 22 3 IN-MODULE DISTURBANCE BARRIER 24 3.1 Motivation 25 3.2 IMDB: In Module-Disturbance Barrier 29 3.2.1 Architectural Overview 29 3.2.2 Implementation of Data Structures 30 3.2.3 Modification of Media Controller 36 3.3 Replacement Policy 38 3.3.1 Replacement Policy for IMDB 38 3.3.2 Approximate Lowest Number Estimator 40 3.4 Putting All Together: Case Studies 43 3.5 Evaluation 45 3.5.1 Configuration 45 3.5.2 Architectural Exploration 47 3.5.3 Effectiveness of the Replacement Policy 48 3.5.4 Sensitivity to Main Table Configuration 49 3.5.5 Sensitivity to Barrier Buffer Size 51 3.5.6 Sensitivity to AppLE Group Size 52 3.5.7 Comparison with Other Studies 54 3.6 Discussion 59 3.7 Summary 63 4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64 4.1 Motivation 65 4.2 Utilization of DRAM Cache for RMW 67 4.2.1 Architectural Design 67 4.2.2 Algorithm 70 4.3 Typeless Command Merging 73 4.3.1 Architectural Design 73 4.3.2 Algorithm 74 4.4 An Alternative Implementation: SRC-RMW 78 4.4.1 Implementation of SRC-RMW 78 4.4.2 Design Constraint 80 4.5 Case Study 82 4.6 Evaluation 85 4.6.1 Configuration 85 4.6.2 Speedup 88 4.6.3 Read Reliability 91 4.6.4 Energy Consumption: Selecting a Proper Page Size 93 4.6.5 Comparison with Other Studies 95 4.7 Discussion 97 4.8 Summary 99 5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100 5.1 Motivation 101 5.2 PCMCsim: PCM Controller Simulator 103 5.2.1 Architectural Overview 103 5.2.2 Underlying Classes of PCMCsim 104 5.2.3 Implementation of Contention Behavior 108 5.2.4 Modules of PCMCsim 109 5.3 Evaluation 116 5.3.1 Correctness of the Simulator 116 5.3.2 Comparison with Other Simulators 117 5.4 Summary 119 6 Conclusion 120 Abstract (In Korean) 141 Acknowledgment 143๋ฐ•

    Sisรคkkรคiset virtuaaliympรคristรถt

    Get PDF
    Virtual Machines have been a common computation platform in areas of cloud computing for some time now. VMs offer a decent amount of isolation for security and system resources, and from application perspective they behave much like native environments. Software containers are gaining popularity, as a new application delivery technology. Just like VMs, applications started inside containers are running in isolated environments but without the performance overhead caused by virtualization of system resources. This makes containers seem like a more effient option for VMs. In this thesis, different combinations of containers and VMs are benchmarked. For each benchmark, host environment is also measured, to understand the overhead caused by the underlying virtuel environment technology. Benchmarks used include storage and network access benchmarks, and also an application benchmark of compiling Linux kernel. As another part of the thesis, a CPU intensive workload is run on the virtualization host server. Then the benchmarks are repeated, in order to determine how much the given workload effects the benchmark score, and also if this effect can be observed from the virtualization guest side by measuring CPU steal time. Results show that containers are slightly slower in the application benchmark than the host. The main difference is expected to come from the way docker handles storage accesses. With default network configuration, the container is losing in terms of performance to the host. In every benchmark we did, VMs always lost to host and containers in performance.Virtuaalikoneista on tullut yleinen laskenta-alusta pilvitietokoneille. Ne eristรคvรคt virtuaaliympรคristรถn muista palveluista samalla fyysisellรค koneella ja sovellusten nรคkรถkulmasta ne toimivat lรคhes samalla tavalla kuin natiivit ympรคristรถt. Ohjelmistokontit ovat nousseet suosioon tehokkaana sovellusten toimitusteknologiana. Molemmat, sekรค virtuaalikoneet, ettรค ohjelmistokontit tarjoavat niiden sisรคllรค suoritettaville sovelluksille eristetyn virtuaaliympรคristรถn. Ohjelmistokontit eivรคt pyri virtualisoimaan kaikkia jรคrjestelmรคn resursseja vaan kรคyttรคvรคt alla olevaa kรคyttรถjรคrjestelmรคn ydintรค hyvรคkseen. Tรคmรค tekee ohjelmistokonteista houkuttelevan vaihtoehdon virtuaalikoneille. Tรคssรค diplomityรถssรค suoritettiin erilaisia suorituskykymittauksia ohjelmistokonttien ja virtuaalikoneiden avulla luoduissa ympรคristรถissรค. Myรถs alla olevan isรคntรคkoneen natiivisuorituskyky mitattiin, josta saatiin hyvรค arvo erilaisten virtuaaliympรคristรถjen vertailuun. Mittasimme pysyvรคn muistin, verkon ja sovelluksen suorituskyvyn. Sovelluksena toimi Linuxin kรครคntรคminen lรคhdekoodista toimivaksi kรคyttรถjรคrjestelmรคksi. Tuloksemme osoittavat, ettรค sovellussuorituskykytestissรค kontit hรคviรคvรคt natiivijรคrjestelmรคn suorituskyvylle vain vรคhรคn. Eron oletetaan johtuvan tavasta, jolla valitsemamme konttiteknologia hoitaa pysyvรคn muistin lukemisen ja kirjoittamisen. Oletusverkkoasetuksilla, kontit hรคvisivรคt natiivijรคrjestelmรคlle myรถs. Kaikissa tekemissรคmme suorituskykymittauksissa virtuaalikoneet hรคvisivรคt natiivijรคrjestelmรคlle sekรค ohjelmistokonteille
    • โ€ฆ
    corecore