381 research outputs found
Improving Phase Change Memory Performance with Data Content Aware Access
A prominent characteristic of write operation in Phase-Change Memory (PCM) is
that its latency and energy are sensitive to the data to be written as well as
the content that is overwritten. We observe that overwriting unknown memory
content can incur significantly higher latency and energy compared to
overwriting known all-zeros or all-ones content. This is because all-zeros or
all-ones content is overwritten by programming the PCM cells only in one
direction, i.e., using either SET or RESET operations, not both. In this paper,
we propose data content aware PCM writes (DATACON), a new mechanism that
reduces the latency and energy of PCM writes by redirecting these requests to
overwrite memory locations containing all-zeros or all-ones. DATACON operates
in three steps. First, it estimates how much a PCM write access would benefit
from overwriting known content (e.g., all-zeros, or all-ones) by
comprehensively considering the number of set bits in the data to be written,
and the energy-latency trade-offs for SET and RESET operations in PCM. Second,
it translates the write address to a physical address within memory that
contains the best type of content to overwrite, and records this translation in
a table for future accesses. We exploit data access locality in workloads to
minimize the address translation overhead. Third, it re-initializes unused
memory locations with known all-zeros or all-ones content in a manner that does
not interfere with regular read and write accesses. DATACON overwrites unknown
content only when it is absolutely necessary to do so. We evaluate DATACON with
workloads from state-of-the-art machine learning applications, SPEC CPU2017,
and NAS Parallel Benchmarks. Results demonstrate that DATACON significantly
improves system performance and memory system energy consumption compared to
the best of performance-oriented state-of-the-art techniques.Comment: 18 pages, 21 figures, accepted at ACM SIGPLAN International Symposium
on Memory Management (ISMM
Model-Based Performance Prediction for Concurrent Software on Multicore Architectures
Model-based performance prediction is a well-known concept to ensure the quality of software.Current approaches are based on a single-metric model, which leads to inaccurate predictions for modern architectures.
This thesis presents a multi-strategies approach to extend performance prediction models to support multicore architectures.We implemented the strategies into Palladio and significantly increased the performance prediction power
Doctor of Philosophy
dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures
Aging-Aware Request Scheduling for Non-Volatile Main Memory
Modern computing systems are embracing non-volatile memory (NVM) to implement
high-capacity and low-cost main memory. Elevated operating voltages of NVM
accelerate the aging of CMOS transistors in the peripheral circuitry of each
memory bank. Aggressive device scaling increases power density and temperature,
which further accelerates aging, challenging the reliable operation of
NVM-based main memory. We propose HEBE, an architectural technique to mitigate
the circuit aging-related problems of NVM-based main memory. HEBE is built on
three contributions. First, we propose a new analytical model that can
dynamically track the aging in the peripheral circuitry of each memory bank
based on the bank's utilization. Second, we develop an intelligent memory
request scheduler that exploits this aging model at run time to de-stress the
peripheral circuitry of a memory bank only when its aging exceeds a critical
threshold. Third, we introduce an isolation transistor to decouple parts of a
peripheral circuit operating at different voltages, allowing the decoupled
logic blocks to undergo long-latency de-stress operations independently and off
the critical path of memory read and write accesses, improving performance. We
evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite. Our results
show that HEBE significantly improves both performance and lifetime of
NVM-based main memory.Comment: To appear in ASP-DAC 202
์ฑ๋ฅ๊ณผ ์ฉ๋ ํฅ์์ ์ํ ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ(์ง๋ฅํ์ตํฉ์์คํ
์ ๊ณต), 2019. 2. ์์ ํธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โผ30% while improving the throughput of an I/O-intensive applications by โผ39%, compared with DRAM without 3D-XPath.
Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies.
We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ ์กฐ ๊ธฐ์ ์ ๋ฐ์ ์ ์๋๊ฐ ๋๋ ค์ง๋ ๋ฐ๋ฉด DRAM์ ๋ฐ๋ ๋ฐ ์ฑ๋ฅ ์๊ตฌ๋ ๊ณ์ ์ฆ๊ฐํ๊ณ ์๋ค. ์ด๋ฌํ ์๊ตฌ๋ก ์ธํด ์๋ก์ด ๋น ํ๋ฐ์ฑ ๋ฉ๋ชจ๋ฆฌ(์: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ๋ DRAM(์: Managed asymmetric latency DRAM Solution)์ด ๋ฑ์ฅํ์๋ค. ์ด๋ฌํ ๊ณ ๋ฐ๋ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ ๊ธด ๋ ์ดํด์, ๋ฎ์ ๋์ญํญ ๋๋ ๋ ๊ฐ์ง ๋ชจ๋๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ์์ผ๋ก ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ ๋๋ฌธ์ ์ฑ๋ฅ์ด ์ข์ง ์์, ํซ ํ์ด์ง๋ฅผ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ(์: ์ผ๋ฐ DRAM)๋ก ์ค์๋๋ ์ ์ฉ๋์ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋์์ ์ฌ์ฉ๋๋ ๊ฒ์ด ์ผ๋ฐ์ ์ด๋ค. ์ด๋ฌํ ์ค์ ๊ณผ์ ์์ ๋น ๋ฅธ ๋ฉ๋ชจ๋ฆฌ๋ก์ ํ์ด์ง ์ ์ก์ด ์ผ๋ฐ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ค๋ซ๋์ ์ฒ๋ฆฌํ์ง ๋ชปํ๋๋ก ํ๊ธฐ ๋๋ฌธ์, ๋๊ธฐ ์๊ฐ์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ํฌ๊ฒ ์ฆ๊ฐ์์ผ, ์๋ต ์๊ฐ์ ํ์ค ํธ์ฐจ๋ฅผ ์ฆ๊ฐ์ํจ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ ์ง์ฐ์๊ฐ ๋ฐ ๊ณ ์ฉ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์๊ตฌํ๋ ์ ํ๋ฆฌ์ผ์ด์
์ ์ํด 3D-XPath, ์ฆ ๊ณ ๋ฐ๋ ๊ด๋ฆฌ DRAM ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ด๋ฌํ 3D-ํ์๋ฅผ ์ง์ ํ DRAM์ ์ ์์ ๊ณ ๋ฐ๋ DRAM ๋ค์ด๋ฅผ ๊ธฐ์กด์ ์ผ๋ฐ์ ์ธ DRAM ๋ค์ด์ ๋์์ ํ ์นฉ์ ์ ์ธตํ๊ณ , DRAM ๋ค์ด๋ผ๋ฆฌ๋ ์ ์ํ๋ 3D-XPath ํ๋์จ์ด๋ฅผ ํตํด ์ฐ๊ฒฐ๋๋ค. ์ด๋ฌํ 3D-XPath๋ ํซ ํ์ด์ง ์ค์์ด ์ผ์ด๋๋ ๋์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ฐจ๋จํ์ง ์๊ณ ์ฌ์ฉ๋์ด ์ ์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋๋ก ํซ ํ์ด์ง ์ค์์ ์ฒ๋ฆฌ ํ ์ ์๋๋ก ํ์ฌ, ๋ฐ์ดํฐ ์ง์ค ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ๊ฐ์ ์ํจ๋ค. ๋ํ ์ ์ํ๋ ํ๋์จ์ด ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ์ฌ, ์ถ๊ฐ์ ์ผ๋ก O/S ์ปค๋๊ณผ ์ ์ ์คํ์ด์ค ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ ๋ธ๋ก์ ์์ฃผ ๋ณต์ฌํ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ํฅ์์ํฌ ์ ์๋ค. ์ด๋ฌํ 3D-XPath DRAM์ 3D-XPath๊ฐ ์๋ DRAM์ ๋นํด I/O ์ง์ฝ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ์ต๋ 39 % ํฅ์์ํค๋ฉด์ ๋ ์ดํด์์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ์ต๋ 30 %๊น์ง ๊ฐ์์ํฌ ์ ์๋ค.
๋ํ ์ต๊ทผ์ ์ปดํจํฐ ์์คํ
์ ๋ณด๋ค ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ๊ณผ ์ฉ๋์ ํ์๋กํ๋ ๋ ๋ง์ CPU ์ฝ์ด๋ฅผ ๋จ์ผ ์์ผ์ผ๋ก ํตํฉํ๋ ๋ฐฉํฅ์ผ๋ก ์งํํ๊ณ ์๋ค. ์ด๋ฌํ ์์ผ ๋น ์ฑ๋ ์๋ฅผ ๋๋ฆฌ๋ ๊ฒ์ ๋์ญํญ ์๊ตฌ์ ๋ํ ์ผ๋ฐ์ ์ธ ํด๊ฒฐ์ฑ
์ด๋ฉฐ, ์ต์ ์ DRAM ์ธํฐํ์ด์ค์ ๋ฐ์ ์์์ ์ฆ๊ฐํ ์ฑ๋์ ๋ณด๋ค ์ ํ์ฉํ๊ธฐ ์ํด ๋ฐ์ดํฐ ๋ฒ์ค ํญ์ด ๊ฐ์๋๊ณ ๋ฒ์คํธ ๊ธธ์ด๊ฐ ์ฆ๊ฐํ๋ค. ๊ทธ๋ฌ๋ ๊ธธ์ด์ง ๋ฒ์คํธ ๊ธธ์ด๋ DRAM ์ก์ธ์ค ๋๊ธฐ ์๊ฐ์ ์ฆ๊ฐ์ํจ๋ค. ์ถ๊ฐ์ ์ผ๋ก ์ต์ ์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ ๋ง์ ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์๊ตฌํ๋ฉฐ, ๋ฏธ์ธ ๊ณต์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ๋ก ์ ์์ญ ๋
๋์ ์ฌ์ฉ๋์์ง๋ง, 20 nm ์ดํ์ ๋ฏธ์ธ๊ณต์ ์์๋ ๋ ์ด์ ๊ณต์ ๋ฏธ์ธํ๋ฅผ ํตํด ๋ฉ๋ชจ๋ฆฌ ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ๊ฐ ์ด๋ ค์ด ์ํฉ์ด๋ฉฐ, ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค.
์ด๋ฌํ ์ํฉ์์, ์ค์ ์ต์ ์ ๋ฉํฐ์ฝ์ด ๋จธ์ ์์ SPEC CPU 2006 ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉํฐ์ฝ์ด์์ ์คํํ์์ ๋, ํญ์ ์์คํ
์ ๋ชจ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ๊ฐ ์์ ํ ํ์ฉ๋์ง ์๋๋ค๋ ์ฌ์ค์ ๊ด์ฐฐํ๋ค. ์ด๋ฌํ ์ ํด ์ฑ๋์ ์ฌ์ฉํ๊ธฐ ์ํด ํ๋์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋์ ํผํฌ ๋์ญํญ์ ๋์ด๊ณ 3D ์คํ ๋ฉ๋ชจ๋ฆฌ์ ๋ฒ์คํธ ๋๊ธฐ ์๊ฐ์ ์ค์ด๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋ ๊ณต์ ์ํคํ
์ฒ๋ฅผ ์ ์ํ์์ผ๋ฉฐ, ํ๋์จ์ด ๋ธ๋ก์ ์ ์ํ์๋ค. ์ด๋ฌํ ์ฑ๋ ๊ณต์ ๋ฅผ ํตํด ๋ฉํฐ ํ๋ก๊ทธ๋จ ๋ ์์ฉํ๋ก๊ทธ๋จ ๋ฐ ๋ค์ค ์ค๋ ๋ ์์ฉํ๋ก๊ทธ๋จ ์ฑ๋ฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋ก ํฅ์๋์์ผ๋ฉฐ ํ๊ท ์ฝ๊ธฐ ๋๊ธฐ ์๊ฐ์ 8.22 % ๋ฐ 10.18 %๋ก ๊ฐ์ํ์๋ค.Contents
Abstract i
Contents iv
List of Figures vi
List of Tables viii
Introduction 1
1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5
1.2 Boosting Bandwidth โ Dynamic Channel Sharing on 3D Stacked Memory 9
1.3 Research contribution 13
1.4 Outline 14
3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17
2.1 Background 17
2.1.1 Heterogeneous Main Memory Systems 17
2.1.2 Specialized DRAM 19
2.1.3 3D-stacked Memory 22
2.2 HIGH-DENSITY DRAM ARCHITECTURE 27
2.2.1 Key Design Challenges 29
2.2.2 Plausible High-density DRAM Designs 33
2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37
2.3.1 3D-XPath Architecture 41
2.3.2 3D-XPath Management 46
2.4 EXPERIMENTAL METHODOLOGY 52
2.5 EVALUATION 56
2.5.1 OLDI Workloads 56
2.5.2 Non-OLDI Workloads 61
2.5.3 Sensitivity Analysis 66
2.6 RELATED WORK 70
Boosting bandwidth โDynamic Channel Sharing on 3D Stacked Memory 72
3.1 Background: Memory Operations 72
3.1.1. Memory Controller 72
3.1.2 DRAM column access sequence 73
3.2 Related Work 74
3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76
3.3.1 Hardware Requirements 78
3.3.2 Operation Sequence 81
3.4 Analysis 87
3.4.1 Experiment Environment 87
3.4.2 Performance 88
3.4.3 Overhead 90
CONCLUSION 92
REFERENCES 94
๊ตญ๋ฌธ์ด๋ก 107Docto
- โฆ