71 research outputs found
Doctor of Philosophy
dissertationThe computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a web-based information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehouse- scale computers), e.g., cloud computing and other web services, or in many-core high-performance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of cost-per-bit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation of novel technologies, and a fundamental rethinking of certain conventional design decisions. In this dissertation, we study every major element of the memory system - the memory chip, the processor-memory channel, the memory access mechanism, and memory reliability, and identify the key bottlenecks to efficiency. Based on this, we propose a novel main memory system with the following innovative features: (i) overfetch-aware re-organized chips, (ii) low-cost silicon photonic memory channels, (iii) largely autonomous memory modules with a packet-based interface to the proces- sor, and (iv) a RAID-based reliability mechanism. Such a system is energy-efficient, high-performance, low-complexity, reliable, and cost-effective, making it ideally suited to meet the requirements of future large-scale computing systems
Doctor of Philosophy in Computing
dissertatio
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips
To understand and improve DRAM performance, reliability, security and energy
efficiency, prior works study characteristics of commodity DRAM chips.
Unfortunately, state-of-the-art open source infrastructures capable of
conducting such studies are obsolete, poorly supported, or difficult to use, or
their inflexibility limit the types of studies they can conduct.
We propose DRAM Bender, a new FPGA-based infrastructure that enables
experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three
key features at the same time. First, DRAM Bender enables directly interfacing
with a DRAM chip through its low-level interface. This allows users to issue
DRAM commands in arbitrary order and with finer-grained time intervals compared
to other open source infrastructures. Second, DRAM Bender exposes easy-to-use
C++ and Python programming interfaces, allowing users to quickly and easily
develop different types of DRAM experiments. Third, DRAM Bender is easily
extensible. The modular design of DRAM Bender allows extending it to (i)
support existing and emerging DRAM interfaces, and (ii) run on new commercial
or custom FPGA boards with little effort.
To demonstrate that DRAM Bender is a versatile infrastructure, we conduct
three case studies, two of which lead to new observations about the DRAM
RowHammer vulnerability. In particular, we show that data patterns supported by
DRAM Bender uncovers a larger set of bit-flips on a victim row compared to the
data patterns commonly used by prior work. We demonstrate the extensibility of
DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3
support. DRAM Bender is freely and openly available at
https://github.com/CMU-SAFARI/DRAM-Bender.Comment: To appear in TCAD 202
์ฑ๋ฅ๊ณผ ์ฉ๋ ํฅ์์ ์ํ ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ(์ง๋ฅํ์ตํฉ์์คํ
์ ๊ณต), 2019. 2. ์์ ํธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โผ30% while improving the throughput of an I/O-intensive applications by โผ39%, compared with DRAM without 3D-XPath.
Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies.
We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ ์กฐ ๊ธฐ์ ์ ๋ฐ์ ์ ์๋๊ฐ ๋๋ ค์ง๋ ๋ฐ๋ฉด DRAM์ ๋ฐ๋ ๋ฐ ์ฑ๋ฅ ์๊ตฌ๋ ๊ณ์ ์ฆ๊ฐํ๊ณ ์๋ค. ์ด๋ฌํ ์๊ตฌ๋ก ์ธํด ์๋ก์ด ๋น ํ๋ฐ์ฑ ๋ฉ๋ชจ๋ฆฌ(์: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ๋ DRAM(์: Managed asymmetric latency DRAM Solution)์ด ๋ฑ์ฅํ์๋ค. ์ด๋ฌํ ๊ณ ๋ฐ๋ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ ๊ธด ๋ ์ดํด์, ๋ฎ์ ๋์ญํญ ๋๋ ๋ ๊ฐ์ง ๋ชจ๋๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ์์ผ๋ก ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ ๋๋ฌธ์ ์ฑ๋ฅ์ด ์ข์ง ์์, ํซ ํ์ด์ง๋ฅผ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ(์: ์ผ๋ฐ DRAM)๋ก ์ค์๋๋ ์ ์ฉ๋์ ๊ณ ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋์์ ์ฌ์ฉ๋๋ ๊ฒ์ด ์ผ๋ฐ์ ์ด๋ค. ์ด๋ฌํ ์ค์ ๊ณผ์ ์์ ๋น ๋ฅธ ๋ฉ๋ชจ๋ฆฌ๋ก์ ํ์ด์ง ์ ์ก์ด ์ผ๋ฐ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ค๋ซ๋์ ์ฒ๋ฆฌํ์ง ๋ชปํ๋๋ก ํ๊ธฐ ๋๋ฌธ์, ๋๊ธฐ ์๊ฐ์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ํฌ๊ฒ ์ฆ๊ฐ์์ผ, ์๋ต ์๊ฐ์ ํ์ค ํธ์ฐจ๋ฅผ ์ฆ๊ฐ์ํจ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ ์ง์ฐ์๊ฐ ๋ฐ ๊ณ ์ฉ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์๊ตฌํ๋ ์ ํ๋ฆฌ์ผ์ด์
์ ์ํด 3D-XPath, ์ฆ ๊ณ ๋ฐ๋ ๊ด๋ฆฌ DRAM ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ด๋ฌํ 3D-ํ์๋ฅผ ์ง์ ํ DRAM์ ์ ์์ ๊ณ ๋ฐ๋ DRAM ๋ค์ด๋ฅผ ๊ธฐ์กด์ ์ผ๋ฐ์ ์ธ DRAM ๋ค์ด์ ๋์์ ํ ์นฉ์ ์ ์ธตํ๊ณ , DRAM ๋ค์ด๋ผ๋ฆฌ๋ ์ ์ํ๋ 3D-XPath ํ๋์จ์ด๋ฅผ ํตํด ์ฐ๊ฒฐ๋๋ค. ์ด๋ฌํ 3D-XPath๋ ํซ ํ์ด์ง ์ค์์ด ์ผ์ด๋๋ ๋์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉ๋ชจ๋ฆฌ ์์ฒญ์ ์ฐจ๋จํ์ง ์๊ณ ์ฌ์ฉ๋์ด ์ ์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋๋ก ํซ ํ์ด์ง ์ค์์ ์ฒ๋ฆฌ ํ ์ ์๋๋ก ํ์ฌ, ๋ฐ์ดํฐ ์ง์ค ์์ฉ ํ๋ก๊ทธ๋จ์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ๊ฐ์ ์ํจ๋ค. ๋ํ ์ ์ํ๋ ํ๋์จ์ด ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ์ฌ, ์ถ๊ฐ์ ์ผ๋ก O/S ์ปค๋๊ณผ ์ ์ ์คํ์ด์ค ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ ๋ธ๋ก์ ์์ฃผ ๋ณต์ฌํ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ํฅ์์ํฌ ์ ์๋ค. ์ด๋ฌํ 3D-XPath DRAM์ 3D-XPath๊ฐ ์๋ DRAM์ ๋นํด I/O ์ง์ฝ์ ์ธ ์์ฉํ๋ก๊ทธ๋จ์ ์ฒ๋ฆฌ๋์ ์ต๋ 39 % ํฅ์์ํค๋ฉด์ ๋ ์ดํด์์ ๋ฏผ๊ฐํ ์์ฉ ํ๋ก๊ทธ๋จ์ ๋์ ๋ฐฑ๋ถ์ ์๋ต ์๊ฐ์ ์ต๋ 30 %๊น์ง ๊ฐ์์ํฌ ์ ์๋ค.
๋ํ ์ต๊ทผ์ ์ปดํจํฐ ์์คํ
์ ๋ณด๋ค ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ๊ณผ ์ฉ๋์ ํ์๋กํ๋ ๋ ๋ง์ CPU ์ฝ์ด๋ฅผ ๋จ์ผ ์์ผ์ผ๋ก ํตํฉํ๋ ๋ฐฉํฅ์ผ๋ก ์งํํ๊ณ ์๋ค. ์ด๋ฌํ ์์ผ ๋น ์ฑ๋ ์๋ฅผ ๋๋ฆฌ๋ ๊ฒ์ ๋์ญํญ ์๊ตฌ์ ๋ํ ์ผ๋ฐ์ ์ธ ํด๊ฒฐ์ฑ
์ด๋ฉฐ, ์ต์ ์ DRAM ์ธํฐํ์ด์ค์ ๋ฐ์ ์์์ ์ฆ๊ฐํ ์ฑ๋์ ๋ณด๋ค ์ ํ์ฉํ๊ธฐ ์ํด ๋ฐ์ดํฐ ๋ฒ์ค ํญ์ด ๊ฐ์๋๊ณ ๋ฒ์คํธ ๊ธธ์ด๊ฐ ์ฆ๊ฐํ๋ค. ๊ทธ๋ฌ๋ ๊ธธ์ด์ง ๋ฒ์คํธ ๊ธธ์ด๋ DRAM ์ก์ธ์ค ๋๊ธฐ ์๊ฐ์ ์ฆ๊ฐ์ํจ๋ค. ์ถ๊ฐ์ ์ผ๋ก ์ต์ ์ ์์ฉํ๋ก๊ทธ๋จ์ ๋ ๋ง์ ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์๊ตฌํ๋ฉฐ, ๋ฏธ์ธ ๊ณต์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ๋ก ์ ์์ญ ๋
๋์ ์ฌ์ฉ๋์์ง๋ง, 20 nm ์ดํ์ ๋ฏธ์ธ๊ณต์ ์์๋ ๋ ์ด์ ๊ณต์ ๋ฏธ์ธํ๋ฅผ ํตํด ๋ฉ๋ชจ๋ฆฌ ๋ฐ๋๋ฅผ ์ฆ๊ฐ์ํค๊ธฐ๊ฐ ์ด๋ ค์ด ์ํฉ์ด๋ฉฐ, ์ ์ธตํ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ์ฉ๋์ ์ฆ๊ฐ์ํค๋ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค.
์ด๋ฌํ ์ํฉ์์, ์ค์ ์ต์ ์ ๋ฉํฐ์ฝ์ด ๋จธ์ ์์ SPEC CPU 2006 ์์ฉํ๋ก๊ทธ๋จ์ ๋ฉํฐ์ฝ์ด์์ ์คํํ์์ ๋, ํญ์ ์์คํ
์ ๋ชจ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ๊ฐ ์์ ํ ํ์ฉ๋์ง ์๋๋ค๋ ์ฌ์ค์ ๊ด์ฐฐํ๋ค. ์ด๋ฌํ ์ ํด ์ฑ๋์ ์ฌ์ฉํ๊ธฐ ์ํด ํ๋์ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋์ ํผํฌ ๋์ญํญ์ ๋์ด๊ณ 3D ์คํ ๋ฉ๋ชจ๋ฆฌ์ ๋ฒ์คํธ ๋๊ธฐ ์๊ฐ์ ์ค์ด๊ธฐ ์ํด ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ๋ฉ๋ชจ๋ฆฌ ์ฑ๋ ๊ณต์ ์ํคํ
์ฒ๋ฅผ ์ ์ํ์์ผ๋ฉฐ, ํ๋์จ์ด ๋ธ๋ก์ ์ ์ํ์๋ค. ์ด๋ฌํ ์ฑ๋ ๊ณต์ ๋ฅผ ํตํด ๋ฉํฐ ํ๋ก๊ทธ๋จ ๋ ์์ฉํ๋ก๊ทธ๋จ ๋ฐ ๋ค์ค ์ค๋ ๋ ์์ฉํ๋ก๊ทธ๋จ ์ฑ๋ฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋ก ํฅ์๋์์ผ๋ฉฐ ํ๊ท ์ฝ๊ธฐ ๋๊ธฐ ์๊ฐ์ 8.22 % ๋ฐ 10.18 %๋ก ๊ฐ์ํ์๋ค.Contents
Abstract i
Contents iv
List of Figures vi
List of Tables viii
Introduction 1
1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5
1.2 Boosting Bandwidth โ Dynamic Channel Sharing on 3D Stacked Memory 9
1.3 Research contribution 13
1.4 Outline 14
3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17
2.1 Background 17
2.1.1 Heterogeneous Main Memory Systems 17
2.1.2 Specialized DRAM 19
2.1.3 3D-stacked Memory 22
2.2 HIGH-DENSITY DRAM ARCHITECTURE 27
2.2.1 Key Design Challenges 29
2.2.2 Plausible High-density DRAM Designs 33
2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37
2.3.1 3D-XPath Architecture 41
2.3.2 3D-XPath Management 46
2.4 EXPERIMENTAL METHODOLOGY 52
2.5 EVALUATION 56
2.5.1 OLDI Workloads 56
2.5.2 Non-OLDI Workloads 61
2.5.3 Sensitivity Analysis 66
2.6 RELATED WORK 70
Boosting bandwidth โDynamic Channel Sharing on 3D Stacked Memory 72
3.1 Background: Memory Operations 72
3.1.1. Memory Controller 72
3.1.2 DRAM column access sequence 73
3.2 Related Work 74
3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76
3.3.1 Hardware Requirements 78
3.3.2 Operation Sequence 81
3.4 Analysis 87
3.4.1 Experiment Environment 87
3.4.2 Performance 88
3.4.3 Overhead 90
CONCLUSION 92
REFERENCES 94
๊ตญ๋ฌธ์ด๋ก 107Docto
- โฆ