207 research outputs found

    Exploring Dynamic Compilation and Cross-Layer Object Management Policies for Managed Language Applications

    Get PDF
    Recent years have witnessed the widespread adoption of managed programming languages that are designed to execute on virtual machines. Virtual machine architectures provide several powerful software engineering advantages over statically compiled binaries, such as portable program representations, additional safety guarantees, automatic memory and thread management, and dynamic program composition, which have largely driven their success. To support and facilitate the use of these features, virtual machines implement a number of services that adaptively manage and optimize application behavior during execution. Such runtime services often require tradeoffs between efficiency and effectiveness, and different policies can have major implications on the system's performance and energy requirements. In this work, we extensively explore policies for the two runtime services that are most important for achieving performance and energy efficiency: dynamic (or Just-In-Time (JIT)) compilation and memory management. First, we examine the properties of single-tier and multi-tier JIT compilation policies in order to find strategies that realize the best program performance for existing and future machines. Our analysis performs hundreds of experiments with different compiler aggressiveness and optimization levels to evaluate the performance impact of varying if and when methods are compiled. We later investigate the issue of how to optimize program regions to maximize performance in JIT compilation environments. For this study, we conduct a thorough analysis of the behavior of optimization phases in our dynamic compiler, and construct a custom experimental framework to determine the performance limits of phase selection during dynamic compilation. Next, we explore innovative memory management strategies to improve energy efficiency in the memory subsystem. We propose and develop a novel cross-layer approach to memory management that integrates information and analysis in the VM with fine-grained management of memory resources in the operating system. Using custom as well as standard benchmark workloads, we perform detailed evaluation that demonstrates the energy-saving potential of our approach. We implement and evaluate all of our studies using the industry-standard Oracle HotSpot Java Virtual Machine to ensure that our conclusions are supported by widely-used, state-of-the-art runtime technology

    Buffer-On-Board Memory System

    Get PDF
    The design and implementation of the commodity memory architecture has resulted in significant limitations in a system's speed and capacity. To circumvent these limitations, designers and vendors have begun to place intermediate logic between the CPU and DRAM. This additional logic has two functions: to control the DRAM and to communicate with the CPU over a fast and narrow bus. The benefit provided by this logic is a reduction in pin-out to the memory system from the CPU and increased signal integrity seen by the DRAM, granting faster clock rates while increasing capacity. This new design is reminiscent of the FB-DIMM memory system yet makes key changes to its architecture including the use of existing DIMMs to reduce cost, a reduction in power (relative to FB-DIMM), and a more stable request latency. The problem is that the few vendors utilizing this design have the same general approach, yet the implementations vary greatly in their non-trivial details. A hardware verified simulation suite is developed to accurately model and evaluate the behavior of this buffer-on-board memory system. A study of this design space is performed to determine optimal use of the resources involved. This includes DRAM and bus organization, queue storage, and mapping schemes. Various constraints based on implementation costs are placed on simulated configurations to confirm that these optimizations apply to viable systems. Finally, full system simulations are performed with MARSSx86 to better understand how this memory system interacts with a CPU, cache, and operating system executing an application. Full system simulations uncover behaviors not present in simple limit-case simulations such as the impact of address and channel mapping schemes or the organization of ports and the associated buffers. When applying insights gleaned from these simulations, optimal performance can be achieved while still considering outside constraints (i.e., pin-out, power, and fabrication costs)

    Doctor of Philosophy in Computing

    Get PDF
    dissertationThe demand for main memory capacity has been increasing for many years and will continue to do so. In the past, Dynamic Random Access Memory (DRAM) process scaling has enabled this increase in memory capacity. Along with continued DRAM scaling, the emergence of new technologies like 3D-stacking, buffered Dual Inline Memory Modules (DIMMs), and crosspoint nonvolatile memory promise to continue this trend in the years ahead. However, these technologies will bring with them their own gamut of problems. In this dissertation, I look at the problems facing these technologies from a current delivery perspective. 3D-stacking increases memory capacity available per package, but the increased current requirement means that more pins on the package have to be now dedicated to provide Vdd/Vss, hence increasing cost. At the system level, using buffered DIMMs to increase the number of DRAM ranks increases the peak current requirements of the system if all the DRAM chips in the system are Refreshed simultaneously. Crosspoint memories promise to greatly increase bit densities but have long read latencies because of sneak currents in the cross-bar. In this dissertation, I provide architectural solutions to each of these problems. We observe that smart data placement by the architecture and the Operating System (OS) is a vital ingredient in all of these solutions. We thereby mitigate major bottlenecks in these technologies, hence enabling higher memory densities

    RowPress: Amplifying Read Disturbance in Modern DRAM Chips

    Full text link
    Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is a well-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows. This paper experimentally demonstrates and analyzes another widespread read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks memory isolation by keeping a DRAM row open for a long period of time, which disturbs physically nearby rows enough to cause bitflips. We show that RowPress amplifies DRAM's vulnerability to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once. Our detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1) affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM technology scales down to smaller node sizes, and 3) affects a different set of DRAM cells from RowHammer and behaves differently from RowHammer as temperature and access pattern changes. We demonstrate in a real DDR4-based system with RowHammer protection that 1) a user-level program induces bitflips by leveraging RowPress while conventional RowHammer cannot do so, and 2) a memory controller that adaptively keeps the DRAM row open for a longer period of time based on access pattern can facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we describe and evaluate a new methodology that adapts existing RowHammer mitigation techniques to also mitigate RowPress with low additional performance overhead. We open source all our code and data to facilitate future research on RowPress.Comment: Extended version of the paper "RowPress: Amplifying Read Disturbance in Modern DRAM Chips" at the 50th Annual International Symposium on Computer Architecture (ISCA), 202

    ์„ฑ๋Šฅ๊ณผ ์šฉ๋Ÿ‰ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2019. 2. ์•ˆ์ •ํ˜ธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โˆผ30% while improving the throughput of an I/O-intensive applications by โˆผ39%, compared with DRAM without 3D-XPath. Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies. We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ œ์กฐ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์€ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๋Š” ๋ฐ˜๋ฉด DRAM์˜ ๋ฐ€๋„ ๋ฐ ์„ฑ๋Šฅ ์š”๊ตฌ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์š”๊ตฌ๋กœ ์ธํ•ด ์ƒˆ๋กœ์šด ๋น„ ํœ˜๋ฐœ์„ฑ ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ€๋„ DRAM(์˜ˆ: Managed asymmetric latency DRAM Solution)์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ณ ๋ฐ€๋„ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์€ ๊ธด ๋ ˆ์ดํ„ด์‹œ, ๋‚ฎ์€ ๋Œ€์—ญํญ ๋˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•„, ํ•ซ ํŽ˜์ด์ง€๋ฅผ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: ์ผ๋ฐ˜ DRAM)๋กœ ์Šค์™‘๋˜๋Š” ์ €์šฉ๋Ÿ‰์˜ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋™์‹œ์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์Šค์™‘ ๊ณผ์ •์—์„œ ๋น ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ํŽ˜์ด์ง€ ์ „์†ก์ด ์ผ๋ฐ˜์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์˜ค๋žซ๋™์•ˆ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋Œ€๊ธฐ ์‹œ๊ฐ„์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œ์ผœ, ์‘๋‹ต ์‹œ๊ฐ„์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์ € ์ง€์—ฐ์‹œ๊ฐ„ ๋ฐ ๊ณ ์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌํ•˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•ด 3D-XPath, ์ฆ‰ ๊ณ ๋ฐ€๋„ ๊ด€๋ฆฌ DRAM ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-ํ†”์†Œ๋ฅผ ์ง‘์ ํ•œ DRAM์€ ์ €์†์˜ ๊ณ ๋ฐ€๋„ DRAM ๋‹ค์ด๋ฅผ ๊ธฐ์กด์˜ ์ผ๋ฐ˜์ ์ธ DRAM ๋‹ค์ด์™€ ๋™์‹œ์— ํ•œ ์นฉ์— ์ ์ธตํ•˜๊ณ , DRAM ๋‹ค์ด๋ผ๋ฆฌ๋Š” ์ œ์•ˆํ•˜๋Š” 3D-XPath ํ•˜๋“œ์›จ์–ด๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ๋œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath๋Š” ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์ด ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์ฐจ๋‹จํ•˜์ง€ ์•Š๊ณ  ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„๋กœ ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์„ ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ, ๋ฐ์ดํ„ฐ ์ง‘์ค‘ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์ œ์•ˆํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ถ”๊ฐ€์ ์œผ๋กœ O/S ์ปค๋„๊ณผ ์œ ์ € ์ŠคํŽ˜์ด์Šค ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ์ž์ฃผ ๋ณต์‚ฌํ•˜๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath DRAM์€ 3D-XPath๊ฐ€ ์—†๋Š” DRAM์— ๋น„ํ•ด I/O ์ง‘์•ฝ์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ตœ๋Œ€ 39 % ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ ๋ ˆ์ดํ„ด์‹œ์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋†’์€ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ์ตœ๋Œ€ 30 %๊นŒ์ง€ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ตœ๊ทผ์˜ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์€ ๋ณด๋‹ค ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ๊ณผ ์šฉ๋Ÿ‰์„ ํ•„์š”๋กœํ•˜๋Š” ๋” ๋งŽ์€ CPU ์ฝ”์–ด๋ฅผ ๋‹จ์ผ ์†Œ์ผ“์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์†Œ์ผ“ ๋‹น ์ฑ„๋„ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋Œ€์—ญํญ ์š”๊ตฌ์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ํ•ด๊ฒฐ์ฑ…์ด๋ฉฐ, ์ตœ์‹ ์˜ DRAM ์ธํ„ฐํŽ˜์ด์Šค์˜ ๋ฐœ์ „ ์–‘์ƒ์€ ์ฆ๊ฐ€ํ•œ ์ฑ„๋„์„ ๋ณด๋‹ค ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ๋ฒ„์Šค ํญ์ด ๊ฐ์†Œ๋˜๊ณ  ๋ฒ„์ŠคํŠธ ๊ธธ์ด๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธธ์–ด์ง„ ๋ฒ„์ŠคํŠธ ๊ธธ์ด๋Š” DRAM ์•ก์„ธ์Šค ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์ตœ์‹ ์˜ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์€ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์š”๊ตฌํ•˜๋ฉฐ, ๋ฏธ์„ธ ๊ณต์ •์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์€ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, 20 nm ์ดํ•˜์˜ ๋ฏธ์„ธ๊ณต์ •์—์„œ๋Š” ๋” ์ด์ƒ ๊ณต์ • ๋ฏธ์„ธํ™”๋ฅผ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ๊ฐ€ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋ฉฐ, ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ, ์‹ค์ œ ์ตœ์‹ ์˜ ๋ฉ€ํ‹ฐ์ฝ”์–ด ๋จธ์‹ ์—์„œ SPEC CPU 2006 ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ๋ฉ€ํ‹ฐ์ฝ”์–ด์—์„œ ์‹คํ–‰ํ•˜์˜€์„ ๋•Œ, ํ•ญ์ƒ ์‹œ์Šคํ…œ์˜ ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์™„์ „ํžˆ ํ™œ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ด€์ฐฐํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ํœด ์ฑ„๋„์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„์˜ ํ”ผํฌ ๋Œ€์—ญํญ์„ ๋†’์ด๊ณ  3D ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฒ„์ŠคํŠธ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„ ๊ณต์œ  ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ฑ„๋„ ๊ณต์œ ๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ ํ”„๋กœ๊ทธ๋žจ ๋œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ๋ฐ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์„ฑ๋Šฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋กœ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ํ‰๊ท  ์ฝ๊ธฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ 8.22 % ๋ฐ 10.18 %๋กœ ๊ฐ์†Œํ•˜์˜€๋‹ค.Contents Abstract i Contents iv List of Figures vi List of Tables viii Introduction 1 1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5 1.2 Boosting Bandwidth โ€“ Dynamic Channel Sharing on 3D Stacked Memory 9 1.3 Research contribution 13 1.4 Outline 14 3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17 2.1 Background 17 2.1.1 Heterogeneous Main Memory Systems 17 2.1.2 Specialized DRAM 19 2.1.3 3D-stacked Memory 22 2.2 HIGH-DENSITY DRAM ARCHITECTURE 27 2.2.1 Key Design Challenges 29 2.2.2 Plausible High-density DRAM Designs 33 2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37 2.3.1 3D-XPath Architecture 41 2.3.2 3D-XPath Management 46 2.4 EXPERIMENTAL METHODOLOGY 52 2.5 EVALUATION 56 2.5.1 OLDI Workloads 56 2.5.2 Non-OLDI Workloads 61 2.5.3 Sensitivity Analysis 66 2.6 RELATED WORK 70 Boosting bandwidth โ€“Dynamic Channel Sharing on 3D Stacked Memory 72 3.1 Background: Memory Operations 72 3.1.1. Memory Controller 72 3.1.2 DRAM column access sequence 73 3.2 Related Work 74 3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76 3.3.1 Hardware Requirements 78 3.3.2 Operation Sequence 81 3.4 Analysis 87 3.4.1 Experiment Environment 87 3.4.2 Performance 88 3.4.3 Overhead 90 CONCLUSION 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 107Docto

    ์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๊ฐ„์„ญ ์˜ค๋ฅ˜ ์™„ํ™” ๋ฐ RMW ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ดํ˜์žฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems. Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM. The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably. Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ๋Š”(PCM) ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ƒˆ๋กœ์šด ์‹œ๋Œ€์˜ ์‹œ์ž‘์„ ์•Œ๋ ธ๋‹ค. ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ์ œํ’ˆ ์ œ์กฐ์—…์ฒด(์˜ˆ : ์ธํ…”, SK ํ•˜์ด๋‹‰์Šค, ์‚ผ์„ฑ)๊ฐ€ ๊ด€๋ จ ์ œํ’ˆ ๊ฐœ๋ฐœ์— ๋ฐ•์ฐจ๋ฅผ ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. PCM์€ ๋‹จ์ˆœํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—๋งŒ ๊ตญํ•œ๋˜์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, PCM์€ ๋น„ํœ˜๋ฐœ์„ฑ์œผ๋กœ ์ธํ•ด ๋Œ€๊ธฐ ์ „๋ ฅ์ด ๋‚ฎ๋‹ค. ๋”ฐ๋ผ์„œ ๊ณ„์‚ฐ ์ง‘์•ฝ์ ์ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋˜๋Š” ๋ชจ๋ฐ”์ผ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€(์ฆ‰, ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์œ ํœด ์‹œ๊ฐ„) PCM ๊ธฐ๋ฐ˜ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์—์„œ ์‹คํ–‰ํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. PCM์˜ ์ด๋Ÿฌํ•œ ๋งค๋ ฅ์ ์ธ ํŠน์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  PCM์€ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ๊ณผ ๊ธด ๋Œ€๊ธฐ ์‹œ๊ฐ„์œผ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ์ผ๋ฐ˜ ์‚ฐ์—… ์‹œ์žฅ์—์„œ๋Š” DRAM๊ณผ ๋‹ค์†Œ ๊ฒฉ์ฐจ๊ฐ€ ์žˆ๋‹ค. ํŠนํžˆ ๋‚ฎ์€ ์‹ ๋ขฐ์„ฑ์€ ์ง€๋‚œ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ PCM ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์„ ์ €ํ•ดํ•˜๋Š” ๋ฌธ์ œ๋‹ค. ๋ฐ˜๋„์ฒด ๊ณต์ • ๊ธฐ์ˆ ์ด ์ˆ˜๋…„์— ๊ฑธ์ณ ๋น ๋ฅด๊ฒŒ ์ถ•์†Œ๋จ์— ๋”ฐ๋ผ DRAM์€ 10nm ๊ธ‰ ๊ณต์ • ๊ธฐ์ˆ ์— ๋„๋‹ฌํ•˜์˜€๋‹ค. ์ด์–ด์„œ, ์“ฐ๊ธฐ ๋ฐฉํ•ด ์˜ค๋ฅ˜ (WDE)๊ฐ€ 54nm ๋“ฑ๊ธ‰ ํ”„๋กœ์„ธ์Šค ๊ธฐ์ˆ  ์•„๋ž˜๋กœ ์ถ•์†Œ๋˜๋ฉด PCM์— ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ๋  ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ, WDE ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์€ PCM์ด DRAM๊ณผ ๋™๋“ฑํ•œ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ–์ถ”๋„๋ก ํ•˜๋Š” ๋ฐ ์žˆ์–ด ํ•„์ˆ˜์ ์ด๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” 2-๋ ˆ๋ฒจ SRAM ๊ธฐ๋ฐ˜ ํ…Œ์ด๋ธ”์„ ํ™œ์šฉํ•˜์—ฌ WDE ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ํ•„์š”์— ๋”ฐ๋ผ ์ค€ ์•ˆ์ • ์…€์„ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ์›๋ž˜ SRAM์—์„œ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฝ๊ธฐ ํฌํŠธ๊ฐ€ ํ•„์š”ํ•œ ๋Œ€์ฒด ์ •์ฑ…์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋žœ๋ค ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. PCM์˜ ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” DRAM์— ๋น„ํ•ด ์ง€์—ฐ ์‹œ๊ฐ„์ด ๊ธธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํŠนํžˆ PCM์€ ๋” ํฐ ํŠธ๋žœ์žญ์…˜ ๋‹จ์œ„๋ฅผ ์ฑ„ํƒํ•˜์—ฌ ๋‹จ์œ„์‹œ๊ฐ„ ๋‹น ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ์„ ๋„๋ชจํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฒ”์šฉ ํ”„๋กœ์„ธ์„œ ์บ์‹œ ๋ผ์ธ๊ณผ ๋‹ค๋ฅธ ์œ ๋‹› ํฌ๊ธฐ๋Š” ์ฝ๊ธฐ-์ˆ˜์ •-์“ฐ๊ธฐ (RMW) ๋ชจ๋“ˆ์˜ ๋„์ž…์œผ๋กœ ์ธํ•ด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ €ํ•˜ํ•˜๊ฒŒ ๋œ๋‹ค. PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ RMW ๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์€ RMW ๋ชจ๋“ˆ์„ ํƒ‘์žฌ ํ•œ PCM ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์ „๋ฐ˜์ ์ธ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ๊ณผ ์‹ ๋ขฐ์„ฑ์„ ํ–ฅ์ƒํ•˜๊ฒŒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•„ํ‚คํ…์ฒ˜๋Š” ์ถ”๊ฐ€ ์Šคํ† ๋ฆฌ์ง€ ๋ฆฌ์†Œ์Šค๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ช…๋ น ์œ ํ˜•๊ณผ ๊ด€๊ณ„์—†์ด ๋ช…๋ น์„ ๋ณ‘ํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ œ์•ˆํ•œ๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ๋Š” PCM์„ ์œ„ํ•œ ์™„์ „ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”Œ๋žซํผ์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. PCM ๊ด€๋ จ ์ œํ’ˆ(์˜ˆ : Intel Optane)์— ๋Œ€ํ•ด ๋ฐœํ‘œ๋œ ์ •๋ณด๋Š” ๋Œ€์™ธ๋น„ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ถ€์กฑํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์•Œ๋ ค์ ธ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ ์ ˆํžˆ ์ทจํ•ฉํ•˜๋ฉด ์‹œ์ค‘ ์ œํ’ˆ๊ณผ ์œ ์‚ฌํ•œ ์•„ํ‚คํ…์ฒ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ PCM ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ชจ๋“ˆ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ–ฅํ›„ ์ด์™€ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ์—์„œ ์ถฉ๋ถ„ํžˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ „์šฉ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.1 INTRODUCTION 1 1.1 Limitation of Traditional Main Memory Systems 1 1.2 Phase-Change Memory as Main Memory 3 1.2.1 Opportunities of PCM-based System 3 1.2.2 Challenges of PCM-based System 4 1.3 Dissertation Overview 7 2 BACKGROUND AND PREVIOUS WORK 8 2.1 Phase-Change Memory 8 2.2 Mitigation Schemes for Write Disturbance Errors 10 2.2.1 Write Disturbance Errors 10 2.2.2 Verification and Correction 12 2.2.3 Lazy Correction 13 2.2.4 Data Encoding-based Schemes 14 2.2.5 Sparse-Insertion Write Cache 16 2.3 Performance Enhancement for Read-Modify-Write 17 2.3.1 Traditional Read-Modify-Write 17 2.3.2 Write Coalescing for RMW 19 2.4 Architecture Simulators for PCM 21 2.4.1 NVMain 21 2.4.2 Ramulator 22 2.4.3 DRAMsim3 22 3 IN-MODULE DISTURBANCE BARRIER 24 3.1 Motivation 25 3.2 IMDB: In Module-Disturbance Barrier 29 3.2.1 Architectural Overview 29 3.2.2 Implementation of Data Structures 30 3.2.3 Modification of Media Controller 36 3.3 Replacement Policy 38 3.3.1 Replacement Policy for IMDB 38 3.3.2 Approximate Lowest Number Estimator 40 3.4 Putting All Together: Case Studies 43 3.5 Evaluation 45 3.5.1 Configuration 45 3.5.2 Architectural Exploration 47 3.5.3 Effectiveness of the Replacement Policy 48 3.5.4 Sensitivity to Main Table Configuration 49 3.5.5 Sensitivity to Barrier Buffer Size 51 3.5.6 Sensitivity to AppLE Group Size 52 3.5.7 Comparison with Other Studies 54 3.6 Discussion 59 3.7 Summary 63 4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64 4.1 Motivation 65 4.2 Utilization of DRAM Cache for RMW 67 4.2.1 Architectural Design 67 4.2.2 Algorithm 70 4.3 Typeless Command Merging 73 4.3.1 Architectural Design 73 4.3.2 Algorithm 74 4.4 An Alternative Implementation: SRC-RMW 78 4.4.1 Implementation of SRC-RMW 78 4.4.2 Design Constraint 80 4.5 Case Study 82 4.6 Evaluation 85 4.6.1 Configuration 85 4.6.2 Speedup 88 4.6.3 Read Reliability 91 4.6.4 Energy Consumption: Selecting a Proper Page Size 93 4.6.5 Comparison with Other Studies 95 4.7 Discussion 97 4.8 Summary 99 5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100 5.1 Motivation 101 5.2 PCMCsim: PCM Controller Simulator 103 5.2.1 Architectural Overview 103 5.2.2 Underlying Classes of PCMCsim 104 5.2.3 Implementation of Contention Behavior 108 5.2.4 Modules of PCMCsim 109 5.3 Evaluation 116 5.3.1 Correctness of the Simulator 116 5.3.2 Comparison with Other Simulators 117 5.4 Summary 119 6 Conclusion 120 Abstract (In Korean) 141 Acknowledgment 143๋ฐ•

    ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ์ปจ๋ณผ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ํ•˜์ˆœํšŒ.์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ๋Š” ๋Œ€๊ฐœ ๊ณ„์‚ฐ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ, ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋“ฑ์˜ ์ œ์•ฝ ์‚ฌํ•ญ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๋ชจ๋ฐ”์ผ GPU, ๋””์ง€ํ„ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ํ”„๋กœ์„ธ์„œ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋˜๋Š” ์ƒˆ๋กœ์šด ๋‰ด๋Ÿด ํ”„๋กœ์„ธ์„œ ์นฉ์„ ๋งŒ๋“œ๋ ค๋Š” ํ•˜๋“œ์›จ์–ด ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ฐ˜๋ฉด์— ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์—์„œ๋Š” ์ƒˆ๋กœ์šด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋”ฅ ๋Ÿฌ๋‹์˜ ํ†ต๊ณ„์ ์ธ ํŠน์„ฑ์„ ์ด์šฉํ•œ ๊ทผ์‚ฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋จผ์ € ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ๋ถ€๋ถ„์„ ์ฐพ๊ณ , ์ผ์„ ๋™๋“ฑํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ณ„์‚ฐ ์ž์›์— ๋ถ„๋ฐฐํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, LPIRC ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•œ ๊ฒฝํ—˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž„๋ฒ ๋””๋“œ ๋”ฅ ๋Ÿฌ๋‹ ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ๊ณ ์•ˆํ•˜๊ณ , ๊ทธ ๋ฐฉ๋ฒ•๋ก ์— ๋”ฐ๋ฅธ C-GOOD์ด๋ผ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. C-GOOD์€ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ปดํŒŒ์ผ, ์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•œ C ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์˜ต์…˜๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์„ Jetson TX2, Odroid XU4, SRP ๋“ฑ์˜ ์„œ๋กœ ๋‹ค๋ฅธ 3๊ฐœ์˜ ๊ธฐ๊ธฐ์— ์ ์šฉํ•ด ๋ด„์œผ๋กœ์จ, ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์ด ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์ด๋ฉฐ C-GOOD์„ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ตœ๊ทผ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์ด ๋งŽ์ด ํƒ‘์žฌ๋˜๊ณ  ์žˆ๊ณ , ๋™์‹œ์— ์ž์œจ ์ฃผํ–‰ ์ž๋™์ฐจ์™€ ์Šค๋งˆํŠธํฐ ๋“ฑ์˜ ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์„ ํƒ‘์žฌํ•œ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์Šค์ผ€์ค„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜๊ณ , ์Šค์ผ€์ค„๋ง ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์€ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ์˜ ํ”„๋กœํŒŒ์ผ๋ง๋ถ€ํ„ฐ ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ํ™•์ธํ•˜๋Š” ๊ณผ์ •๊นŒ์ง€ ํฌํ•จํ•˜๋ฉฐ, ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด์Šˆ๋“ค์ธ DVFS, CPU Hot-plug ๋“ฑ์„ ๊ณ ๋ คํ•˜์˜€๋‹ค. ์ด์ข… ํ”„๋กœ์„ธ์„œ๋กœ์˜ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํŠนํžˆ, ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์™€ ์ƒ๋Œ€ ์˜คํ”„์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์—ฌ๋Ÿฌ ์‘์šฉ์„ ๋™์‹œ์— ์Šค์ผ€์ค„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋“  ํƒœ์Šคํฌ๋“ค์˜ ์Šค์ผ€์ค„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์Šค์ผ€์ค„ํ•˜์˜€๋‹ค. ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ, ACL์˜ ์ฝ”์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋”ฅ ๋Ÿฌ๋‹ ์ถ”๋ก  ์‘์šฉ์„ ๊ตฌํ˜„ํ•˜์˜€์œผ๋ฉฐ, ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด ๊ฐ ๋ ˆ์ด์–ด๋“ค์„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ ๋งคํ•‘ํ•˜๋„๋ก ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๊ฐค๋Ÿญ์‹œ S9 ์Šค๋งˆํŠธํฐ๊ณผ Hikey 970 ๋ณด๋“œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์ „ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ํ”„๋กœ์„ธ์„œ๋“ค์— ์ง‘์ค‘ํ•˜์˜€๋Š”๋ฐ, ๋”ฅ ๋Ÿฌ๋‹ ๊ฐ€์†๊ธฐ ๋˜๋Š” NPU์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ƒ๊ธฐ๋Š” ์›์ธ์€ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์™€ ์˜จ ์นฉ ์‚ฌ์ด์˜ ํ†ต์‹ ์ด๋‹ค. ๋”์šฑ์ด ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ DRAM ์ ‘๊ทผ์€ NPU์˜ ์ „๋ ฅ์†Œ๋ชจ์˜ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด์™€ ๊ฐ™์€ ์˜คํ”„ ์นฉ DRAM ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ NPU์˜ ์„ฑ๋Šฅ๊ณผ ์—๋„ˆ์ง€ ์˜ํ–ฅ์„ ์ค„์ด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ์—ฐ์‚ฐ ๋„์ค‘์— ์ธํ’‹ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๋กœ๋“œํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ๊ณผ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹์„ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์˜ ์ฒ˜๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ 5๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์ดํด ๋ ˆ๋ฒจ NPU ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‘ ๋ชฉ์  ํ•จ์ˆ˜์— ๋”ฐ๋ฅธ ์ ˆ์ถฉ (Trade-off) ๊ด€๊ณ„ ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๋ ˆ์ด์–ด ๊ฐ„ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ์—๋Š” ์ค‘๋ณต ๊ณ„์‚ฐํ•˜๋Š” ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์ถ”๊ฐ€์ ์ธ ํ•„ํ„ฐ ์›จ์ดํŠธ ๋กœ๋“œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ• ์‚ฌ์ด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋‘ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption, while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. To cope with increasing computation complexity, it is common to use an energy-efficient accelerator, such as a mobile GPU or digital signal processor (DSP) array, or to develop a customized neural processor chip called neural processing unit (NPU). In the application domain, many optimization techniques have been proposed to change the application algorithm in order to reduce the computational amount and memory usage by developing new deep learning networks or software optimization techniques that take advantage of the statistical nature of deep learning algorithms. Another approach is hardware-ware software optimization, which finds the performance bottleneck first and then distributes the workload evenly by scheduling the workloads. This dissertation covers hardware-aware software optimization, which is based on a hardware processor or platform. First, we devise a systematic optimization methodology through the experience of participating in the Low Power Image Recognition Challenge (LPIRC) and build a deep learning framework called C-GOOD (C-code Generation Framework for Optimized On-device Deep Learning) based on the devised methodology. For hardware independence, C-GOOD generates a C code that can be compiled for and run on any embedded device. Also, C-GOOD is facilitated with various options for application domain optimization that can be performed according to the devised methodology. By applying the devised methodology to three hardware platforms, NVIDIA Jetson TX2, Odroid XU4, and the Samsung Reconfigurable Processor (SRP), we demonstrate that the devised methodology is independent of the hardware platforms and application domain optimizations can be performed easily with C-GOOD. Recently, embedded devices are equipped with heterogeneous processing elements (PEs), and the need for running multiple deep learning applications concurrently in the embedded systems such as self-driving cars and smartphones is increasing at the same time. In those systems, we devise an end-to-end methodology to schedule deep learning applications onto heterogeneous PEs and implement a scheduling framework according to the methodology. It covers from profiling on real embedded devices to verifying the schedule results on the devices. In this methodology, we use a genetic algorithm (GA)-based scheduling technique for scheduling deep learning applications onto heterogeneous PEs and consider several practical issues in the profile step. Furthermore, we schedule multiple applications with different throughput constraints considering the schedulability of mapped tasks on each processor. After implementing a deep learning inference engine that can utilize heterogeneous PEs using a low-level library of the ARM compute library (ACL), we verify the devised methodology by running two widely used convolution neural networks (CNNs) on a Galaxy S9 smartphones and a Hikey970 board. While the previous optimization methods focus on the computation and processing elements, the performance bottleneck of deep learning accelerators is the communication between off-chip and on-chip memory. Moreover, the off-chip DRAM access volume has a significant effect on the energy consumption of an NPU. To reduce the impact of off-chip DRAM access on the performance and energy of an NPU, we devise compiler techniques for an NPU to manage multi-bank on-chip memory with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. The main idea is that by organizing on-chip memory into multiple banks, we may hide the off-chip DRAM access delay by prefetching data into unused banks during computation and reduce the off-chip DRAM access volume by storing the output feature map data of each layer to on-chip memory. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. The devised multi-bank on-chip memory management (MOMM) techniques are extended to consider layer fusion that aims to reuse feature maps between layers maximally. Since the pure layer fusion technique incurs extra computation overhead and increases DRAM access for filter weights, a hybrid fusion technique is presented between a per-layer processing technique and the pure layer fusion techniques, based on the devised MOMM techniques with two different objectives. Experiment results confirm the superiority of the hybrid fusion technique to the per-layer processing technique and the pure layer fusion technique.Abstract Contents List of Figures List of Tables List of Algorithms Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 7 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 Target Hardware 9 2.1.1 Commodity Hardware Platform 9 2.1.2 Application-specific Hardware Accelerator 10 2.2 Convolutional Neural Network 11 2.2.1 Convolution 11 2.2.2 Optimization Methods for Convolutional Neural Network 11 Chapter 3 Optimization for a Commodity Hardware Platform 14 3.1 Joint Optimization Method of Multiple Objectives 15 3.1.1 Hardware Platform 16 3.1.2 Deep Neural Network and Software Framework 17 3.1.3 Software Optimization Techniques 19 3.2 C-code Generation Framework for Optimized On-device Deep Learning 29 3.2.1 C-GOOD Framework 29 3.2.2 Experiments 36 3.3 Scheduling Deep Learning Applications Onto Heterogeneous Processors 44 3.3.1 Search Space Size 45 3.3.2 Hardware Platform and System Model 45 3.3.3 Proposed Scheduling Framework and Profiling 48 3.3.4 Scheduling a Single Deep Learning Application 53 3.3.5 Scheduling Multiple Deep Learning Applications 61 3.3.6 Verification with Real Hardware Platforms 65 3.4 Related Work 69 3.4.1 Deep Learning Framework 69 3.4.2 Deep Learning Compiler 70 3.4.3 Scheduling Deep Learning Application 70 3.4.4 Scheduling Multiple Applications on Heterogeneous Processors 72 Chapter 4 Optimization for an Application-specific Hardware Accelerator 75 4.1 Multi-Bank On-chip Memory Management Problem 75 4.1.1 Main Idea 75 4.1.2 Assumed Dataflow 76 4.1.3 Multi-bank On-chip Memory Management Problem 79 4.2 Proposed Multi-bank On-chip Memory Management Techniques 83 4.2.1 DRAM-first Storing Policy 84 4.2.2 DRAM Access Minimization Policy (MIN policy) 85 4.2.3 DRAM Access Hiding Policy (HIDE policy) 89 4.2.4 Multiple Path Consideration 91 4.3 Layer Fusion Technique 92 4.3.1 Layer Fusion Technique 92 4.3.2 Hybrid Fusion Technique 94 4.4 Experiments 96 4.4.1 Setup 96 4.4.2 Performance Comparison of MOMM Techniques 98 4.4.3 Multiple Path 100 4.4.4 Design Space Exploration of NPU Architecture 101 4.4.5 Hybrid Fusion Technique 104 4.5 Related Work 106 Chapter 5 Conclusion 108 Bibliography 111 Appendix 120 A Proposed Multi-bank On-chip Memory Management Algorithm 120 A.1 Multi-bank On-chip Memory (MOM) Manager 120 A.2 MIN policy 122 A.3 HIDE policy 124 ์š” ์•ฝ 126Docto

    FBsim and the Fully Buffered DIMM Memory System Architecture

    Get PDF
    As DRAM device data rates increase in chase of ever increasing memory request rates, parallel bus limitations and cost constraints require a sharp decrease in load on the multi-drop buses between the devices and the memory controller, thus limiting the memory system's scalability and failing to meet the capacity requirements of modern server and workstation applications. A new technology, the Fully Buffered DIMM architecture is currently being introduced to address these challenges. FB-DIMM uses narrower, faster, buffered point to point channels to meet memory capacity and throughput requirements at the price of latency. This study provides a detailed look at the proposed architecture and its adoption, introduces an FB-DIMM simulation model - the FBSim simulator - and uses it to explore the design space of this new technology - identifying and experimentally proving some of its strengths, weaknesses and limitations, and uncovering future paths of academic research into the field

    Making non-volatile memory programmable

    Get PDF
    Byte-addressable, non-volatile memory (NVM) is emerging as a revolutionary memory technology that provides persistence, near-DRAM performance, and scalable capacity. By using NVM, applications can directly create and manipulate durable data in place without the need for serialization out to SSDs. Ideally, through NVM, persistent applications will be able to maintain crash-consistency at a minimal cost. However, before this is possible, improvements must be made at both the hardware and software level to support persistent applications. Currently, software support for NVM places too high of a burden on the developer, introducing many opportunities for mistakes while also being too rigid for compiler optimizations. Likewise, at the hardware level, too little information is passed to the processor about the instruction-level ordering requirements of persistent applications; this forces the hardware to require the use of coarse fences, which significantly slow down execution. To help realize the promise of NVM, this thesis proposes both new software and hardware support that make NVM programmable. From the software side, this thesis proposes a new NVM programming model which relieves the programmer from performing much of the accounting work in persistent applications, instead relying on the runtime to perform error-prone tasks. Specifically, within the proposed model, the user only needs to provide minimal markings to identify the persistent data set and to ensure data is updated in a crash-consistent manner. Given this new NVM programming model, this thesis next presents an implementation of the model in Java. I call my implementation AutoPersist and build my support into the Maxine research Java Virtual Machine (JVM). In this thesis I describe how the JVM can be changed to support the proposed NVM programming model, including adding new Java libraries, adding new JVM runtime features, and augmenting the behavior of existing Java bytecodes. In addition to being easy-to-use, another advantage of the proposed model is that it is amenable to compiler optimizations. In this thesis I highlight two profile-guided optimizations: eagerly allocating objects directly into NVM and speculatively pruning control flow to only include expected-to-be taken paths. I also describe how to apply these optimizations to AutoPersist and show they have a substantial performance impact. While designing AutoPersist, I often observed that dependency information known by the compiler cannot be passed down to the underlying hardware; instead, the compiler must insert coarse-grain fences to enforce needed dependencies. This is because current instruction set architectures (ISA) cannot describe arbitrary instruction-level execution ordering constraints. To fix this limitation, I introduce the Execution Dependency Extension (EDE), and describe how EDE can be added to an existing ISA as well as be implemented in current processor pipelines. Overall, emerging NVM technologies can deliver programmer-friendly high performance. However, for this to happen, both software and hardware improvements are necessary. This thesis takes steps to address current the software and hardware gaps: I propose new software support to assist in the development of persistent applications and also introduce new instructions which allow for arbitrary instruction-level dependencies to be conveyed and enforced by the underlying hardware. With these improvements, hopefully the dream of programmable high-performance NVM is one step closer to being realized
    • โ€ฆ
    corecore