21,305 research outputs found
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories
Modern computing systems are embracing hybrid memory comprising of DRAM and
non-volatile memory (NVM) to combine the best properties of both memory
technologies, achieving low latency, high reliability, and high density. A
prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access
latency much higher than DRAM access latency. We call this inter-memory
asymmetry. We observe that parasitic components on a long bitline are a major
source of high latency in both DRAM and NVM, and a significant factor
contributing to high-voltage operations in NVM, which impact their reliability.
We propose an architectural change, where each long bitline in DRAM and NVM is
split into two segments by an isolation transistor. One segment can be accessed
with lower latency and operating voltage than the other. By introducing tiers,
we enable non-uniform accesses within each memory type (which we call
intra-memory asymmetry), leading to performance and reliability trade-offs in
DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we
exploit both inter- and intra-memory asymmetries to allocate and migrate memory
pages between the tiers in DRAM and NVM. Second, we improve the OS's page
allocation decisions by predicting the access intensity of a newly-referenced
memory page in a program and placing it to a matching tier during its initial
allocation. This minimizes page migrations during program execution, lowering
the performance overhead. Third, we propose a solution to migrate pages between
the tiers of the same memory without transferring data over the memory channel,
minimizing channel occupancy and improving performance. Our overall approach,
which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid
tiered memory improves both performance and reliability for both single-core
and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium
on Memory Managemen
์ฑ๋ฅ๊ณผ ์ฉ๋ ํฅ์์ ์ํ ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ(์ง๋ฅํ์ตํฉ์์คํ
์ ๊ณต), 2019. 2. ์์ ํธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โผ30% while improving the throughput of an I/O-intensive applications by โผ39%, compared with DRAM without 3D-XPath.
Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies.
We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ ์กฐ ๊ธฐ์ ์ ๋ฐ์ ์ ์๋๊ฐ ๋๋ ค์ง๋ ๋ฐ๋ฉด DRAM์ ๋ฐ๋ ๋ฐ ์ฑ๋ฅ ์๊ตฌ๋ ๊ณ์ ์ฆ๊ฐํ๊ณ ์๋ค. ์ด๋ฌํ ์๊ตฌ๋ก ์ธํด ์๋ก์ด ๋น ํ๋ฐ์ฑ ๋ฉ๋ชจ๋ฆฌ(์: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ๋ DRAM(์: Managed asymmetric latency DRAM Solution)์ด ๋ฑ์ฅํ์๋ค. ์ด๋ฌํ ๊ณ ๋ฐ๋ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ ๊ธด ๋ ์ดํด์, ๋ฎ์ ๋์ญํญ ๋๋ ๋ ๊ฐ์ง ๋ชจ๋๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ์์ผ๋ก ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ ๋๋ฌธ์ ์ฑ๋ฅ์ด ์ข์ง ์์, ํซ ํ์ด์ง๋ฅผ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ(์: ์ผ๋ฐ DRAM)๋ก ์ค์๋๋ ์ ์ฉ๋์ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋์์ ์ฌ์ฉ๋๋ ๊ฒ์ด ์ผ๋ฐ์ ์ด๋ค. ์ด๋ฌํ ์ค์ ๊ณผ์ ์์ ๋น ๋ฅธ ๋ฉ๋ชจ๋ฆฌ๋ก์ ํ์ด์ง ์ ์ก์ด ์ผ๋ฐ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ค๋ซ๋์ ์ฒ๋ฆฌํ์ง ๋ชปํ๋๋ก ํ๊ธฐ ๋๋ฌธ์, ๋๊ธฐ ์๊ฐ์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ํฌ๊ฒ ์ฆ๊ฐ์์ผ, ์๋ต ์๊ฐ์ ํ์ค ํธ์ฐจ๋ฅผ ์ฆ๊ฐ์ํจ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ ์ง์ฐ์๊ฐ ๋ฐ ๊ณ ์ฉ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์๊ตฌํ๋ ์ ํ๋ฆฌ์ผ์ด์
์ ์ํด 3D-XPath, ์ฆ ๊ณ ๋ฐ๋ ๊ด๋ฆฌ DRAM ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ด๋ฌํ 3D-ํ์๋ฅผ ์ง์ ํ DRAM์ ์ ์์ ๊ณ ๋ฐ๋ DRAM ๋ค์ด๋ฅผ ๊ธฐ์กด์ ์ผ๋ฐ์ ์ธ DRAM ๋ค์ด์ ๋์์ ํ ์นฉ์ ์ ์ธตํ๊ณ , DRAM ๋ค์ด๋ผ๋ฆฌ๋ ์ ์ํ๋ 3D-XPath ํ๋์จ์ด๋ฅผ ํตํด ์ฐ๊ฒฐ๋๋ค. ์ด๋ฌํ 3D-XPath๋ ํซ ํ์ด์ง ์ค์์ด ์ผ์ด๋๋ ๋์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ฐจ๋จํ์ง ์๊ณ ์ฌ์ฉ๋์ด ์ ์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋๋ก ํซ ํ์ด์ง ์ค์์ ์ฒ๋ฆฌ ํ ์ ์๋๋ก ํ์ฌ, ๋ฐ์ดํฐ ์ง์ค ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ๊ฐ์ ์ํจ๋ค. ๋ํ ์ ์ํ๋ ํ๋์จ์ด ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ์ฌ, ์ถ๊ฐ์ ์ผ๋ก O/S ์ปค๋๊ณผ ์ ์ ์คํ์ด์ค ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ ๋ธ๋ก์ ์์ฃผ ๋ณต์ฌํ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ํฅ์์ํฌ ์ ์๋ค. ์ด๋ฌํ 3D-XPath DRAM์ 3D-XPath๊ฐ ์๋ DRAM์ ๋นํด I/O ์ง์ฝ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ์ต๋ 39 % ํฅ์์ํค๋ฉด์ ๋ ์ดํด์์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ์ต๋ 30 %๊น์ง ๊ฐ์์ํฌ ์ ์๋ค.
๋ํ ์ต๊ทผ์ ์ปดํจํฐ ์์คํ
์ ๋ณด๋ค ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ๊ณผ ์ฉ๋์ ํ์๋กํ๋ ๋ ๋ง์ CPU ์ฝ์ด๋ฅผ ๋จ์ผ ์์ผ์ผ๋ก ํตํฉํ๋ ๋ฐฉํฅ์ผ๋ก ์งํํ๊ณ ์๋ค. ์ด๋ฌํ ์์ผ ๋น ์ฑ๋ ์๋ฅผ ๋๋ฆฌ๋ ๊ฒ์ ๋์ญํญ ์๊ตฌ์ ๋ํ ์ผ๋ฐ์ ์ธ ํด๊ฒฐ์ฑ
์ด๋ฉฐ, ์ต์ ์ DRAM ์ธํฐํ์ด์ค์ ๋ฐ์ ์์์ ์ฆ๊ฐํ ์ฑ๋์ ๋ณด๋ค ์ ํ์ฉํ๊ธฐ ์ํด ๋ฐ์ดํฐ ๋ฒ์ค ํญ์ด ๊ฐ์๋๊ณ ๋ฒ์คํธ ๊ธธ์ด๊ฐ ์ฆ๊ฐํ๋ค. ๊ทธ๋ฌ๋ ๊ธธ์ด์ง ๋ฒ์คํธ ๊ธธ์ด๋ DRAM ์ก์ธ์ค ๋๊ธฐ ์๊ฐ์ ์ฆ๊ฐ์ํจ๋ค. ์ถ๊ฐ์ ์ผ๋ก ์ต์ ์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ ๋ง์ ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์๊ตฌํ๋ฉฐ, ๋ฏธ์ธ ๊ณต์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ๋ก ์ ์์ญ ๋
๋์ ์ฌ์ฉ๋์์ง๋ง, 20 nm ์ดํ์ ๋ฏธ์ธ๊ณต์ ์์๋ ๋ ์ด์ ๊ณต์ ๋ฏธ์ธํ๋ฅผ ํตํด ๋ฉ๋ชจ๋ฆฌ ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ๊ฐ ์ด๋ ค์ด ์ํฉ์ด๋ฉฐ, ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค.
์ด๋ฌํ ์ํฉ์์, ์ค์ ์ต์ ์ ๋ฉํฐ์ฝ์ด ๋จธ์ ์์ SPEC CPU 2006 ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉํฐ์ฝ์ด์์ ์คํํ์์ ๋, ํญ์ ์์คํ
์ ๋ชจ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ๊ฐ ์์ ํ ํ์ฉ๋์ง ์๋๋ค๋ ์ฌ์ค์ ๊ด์ฐฐํ๋ค. ์ด๋ฌํ ์ ํด ์ฑ๋์ ์ฌ์ฉํ๊ธฐ ์ํด ํ๋์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋์ ํผํฌ ๋์ญํญ์ ๋์ด๊ณ 3D ์คํ ๋ฉ๋ชจ๋ฆฌ์ ๋ฒ์คํธ ๋๊ธฐ ์๊ฐ์ ์ค์ด๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋ ๊ณต์ ์ํคํ
์ฒ๋ฅผ ์ ์ํ์์ผ๋ฉฐ, ํ๋์จ์ด ๋ธ๋ก์ ์ ์ํ์๋ค. ์ด๋ฌํ ์ฑ๋ ๊ณต์ ๋ฅผ ํตํด ๋ฉํฐ ํ๋ก๊ทธ๋จ ๋ ์์ฉํ๋ก๊ทธ๋จ ๋ฐ ๋ค์ค ์ค๋ ๋ ์์ฉํ๋ก๊ทธ๋จ ์ฑ๋ฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋ก ํฅ์๋์์ผ๋ฉฐ ํ๊ท ์ฝ๊ธฐ ๋๊ธฐ ์๊ฐ์ 8.22 % ๋ฐ 10.18 %๋ก ๊ฐ์ํ์๋ค.Contents
Abstract i
Contents iv
List of Figures vi
List of Tables viii
Introduction 1
1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5
1.2 Boosting Bandwidth โ Dynamic Channel Sharing on 3D Stacked Memory 9
1.3 Research contribution 13
1.4 Outline 14
3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17
2.1 Background 17
2.1.1 Heterogeneous Main Memory Systems 17
2.1.2 Specialized DRAM 19
2.1.3 3D-stacked Memory 22
2.2 HIGH-DENSITY DRAM ARCHITECTURE 27
2.2.1 Key Design Challenges 29
2.2.2 Plausible High-density DRAM Designs 33
2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37
2.3.1 3D-XPath Architecture 41
2.3.2 3D-XPath Management 46
2.4 EXPERIMENTAL METHODOLOGY 52
2.5 EVALUATION 56
2.5.1 OLDI Workloads 56
2.5.2 Non-OLDI Workloads 61
2.5.3 Sensitivity Analysis 66
2.6 RELATED WORK 70
Boosting bandwidth โDynamic Channel Sharing on 3D Stacked Memory 72
3.1 Background: Memory Operations 72
3.1.1. Memory Controller 72
3.1.2 DRAM column access sequence 73
3.2 Related Work 74
3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76
3.3.1 Hardware Requirements 78
3.3.2 Operation Sequence 81
3.4 Analysis 87
3.4.1 Experiment Environment 87
3.4.2 Performance 88
3.4.3 Overhead 90
CONCLUSION 92
REFERENCES 94
๊ตญ๋ฌธ์ด๋ก 107Docto
Cache-aware Parallel Programming for Manycore Processors
With rapidly evolving technology, multicore and manycore processors have
emerged as promising architectures to benefit from increasing transistor
numbers. The transition towards these parallel architectures makes today an
exciting time to investigate challenges in parallel computing. The TILEPro64 is
a manycore accelerator, composed of 64 tiles interconnected via multiple 8x8
mesh networks. It contains per-tile caches and supports cache-coherent shared
memory by default. In this paper we present a programming technique to take
advantages of distributed caching facilities in manycore processors. However,
unlike other work in this area, our approach does not use architecture-specific
libraries. Instead, we provide the programmer with a novel technique on how to
program future Non-Uniform Cache Architecture (NUCA) manycore systems, bearing
in mind their caching organisation. We show that our localised programming
approach can result in a significant improvement of the parallelisation
efficiency (speed-up).Comment: This work was presented at the international symposium on Highly-
Efficient Accelerators and Reconfigurable Technologies (HEART2013),
Edinburgh, Scotland, June 13-14, 201
Stochastic Modeling of Hybrid Cache Systems
In recent years, there is an increasing demand of big memory systems so to
perform large scale data analytics. Since DRAM memories are expensive, some
researchers are suggesting to use other memory systems such as non-volatile
memory (NVM) technology to build large-memory computing systems. However,
whether the NVM technology can be a viable alternative (either economically and
technically) to DRAM remains an open question. To answer this question, it is
important to consider how to design a memory system from a "system
perspective", that is, incorporating different performance characteristics and
price ratios from hybrid memory devices.
This paper presents an analytical model of a "hybrid page cache system" so to
understand the diverse design space and performance impact of a hybrid cache
system. We consider (1) various architectural choices, (2) design strategies,
and (3) configuration of different memory devices. Using this model, we provide
guidelines on how to design hybrid page cache to reach a good trade-off between
high system throughput (in I/O per sec or IOPS) and fast cache reactivity which
is defined by the time to fill the cache. We also show how one can configure
the DRAM capacity and NVM capacity under a fixed budget. We pick PCM as an
example for NVM and conduct numerical analysis. Our analysis indicates that
incorporating PCM in a page cache system significantly improves the system
performance, and it also shows larger benefit to allocate more PCM in page
cache in some cases. Besides, for the common setting of performance-price ratio
of PCM, "flat architecture" offers as a better choice, but "layered
architecture" outperforms if PCM write performance can be significantly
improved in the future.Comment: 14 pages; mascots 201
- โฆ