276 research outputs found

    Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories

    Full text link
    Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic components on a long bitline are a major source of high latency in both DRAM and NVM, and a significant factor contributing to high-voltage operations in NVM, which impact their reliability. We propose an architectural change, where each long bitline in DRAM and NVM is split into two segments by an isolation transistor. One segment can be accessed with lower latency and operating voltage than the other. By introducing tiers, we enable non-uniform accesses within each memory type (which we call intra-memory asymmetry), leading to performance and reliability trade-offs in DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we exploit both inter- and intra-memory asymmetries to allocate and migrate memory pages between the tiers in DRAM and NVM. Second, we improve the OS's page allocation decisions by predicting the access intensity of a newly-referenced memory page in a program and placing it to a matching tier during its initial allocation. This minimizes page migrations during program execution, lowering the performance overhead. Third, we propose a solution to migrate pages between the tiers of the same memory without transferring data over the memory channel, minimizing channel occupancy and improving performance. Our overall approach, which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid tiered memory improves both performance and reliability for both single-core and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium on Memory Managemen

    ์„ฑ๋Šฅ๊ณผ ์šฉ๋Ÿ‰ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2019. 2. ์•ˆ์ •ํ˜ธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โˆผ30% while improving the throughput of an I/O-intensive applications by โˆผ39%, compared with DRAM without 3D-XPath. Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies. We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ œ์กฐ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์€ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๋Š” ๋ฐ˜๋ฉด DRAM์˜ ๋ฐ€๋„ ๋ฐ ์„ฑ๋Šฅ ์š”๊ตฌ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์š”๊ตฌ๋กœ ์ธํ•ด ์ƒˆ๋กœ์šด ๋น„ ํœ˜๋ฐœ์„ฑ ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ€๋„ DRAM(์˜ˆ: Managed asymmetric latency DRAM Solution)์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ณ ๋ฐ€๋„ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์€ ๊ธด ๋ ˆ์ดํ„ด์‹œ, ๋‚ฎ์€ ๋Œ€์—ญํญ ๋˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•„, ํ•ซ ํŽ˜์ด์ง€๋ฅผ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: ์ผ๋ฐ˜ DRAM)๋กœ ์Šค์™‘๋˜๋Š” ์ €์šฉ๋Ÿ‰์˜ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋™์‹œ์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์Šค์™‘ ๊ณผ์ •์—์„œ ๋น ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ํŽ˜์ด์ง€ ์ „์†ก์ด ์ผ๋ฐ˜์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์˜ค๋žซ๋™์•ˆ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋Œ€๊ธฐ ์‹œ๊ฐ„์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œ์ผœ, ์‘๋‹ต ์‹œ๊ฐ„์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์ € ์ง€์—ฐ์‹œ๊ฐ„ ๋ฐ ๊ณ ์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌํ•˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•ด 3D-XPath, ์ฆ‰ ๊ณ ๋ฐ€๋„ ๊ด€๋ฆฌ DRAM ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-ํ†”์†Œ๋ฅผ ์ง‘์ ํ•œ DRAM์€ ์ €์†์˜ ๊ณ ๋ฐ€๋„ DRAM ๋‹ค์ด๋ฅผ ๊ธฐ์กด์˜ ์ผ๋ฐ˜์ ์ธ DRAM ๋‹ค์ด์™€ ๋™์‹œ์— ํ•œ ์นฉ์— ์ ์ธตํ•˜๊ณ , DRAM ๋‹ค์ด๋ผ๋ฆฌ๋Š” ์ œ์•ˆํ•˜๋Š” 3D-XPath ํ•˜๋“œ์›จ์–ด๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ๋œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath๋Š” ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์ด ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์ฐจ๋‹จํ•˜์ง€ ์•Š๊ณ  ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„๋กœ ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์„ ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ, ๋ฐ์ดํ„ฐ ์ง‘์ค‘ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์ œ์•ˆํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ถ”๊ฐ€์ ์œผ๋กœ O/S ์ปค๋„๊ณผ ์œ ์ € ์ŠคํŽ˜์ด์Šค ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ์ž์ฃผ ๋ณต์‚ฌํ•˜๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath DRAM์€ 3D-XPath๊ฐ€ ์—†๋Š” DRAM์— ๋น„ํ•ด I/O ์ง‘์•ฝ์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ตœ๋Œ€ 39 % ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ ๋ ˆ์ดํ„ด์‹œ์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋†’์€ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ์ตœ๋Œ€ 30 %๊นŒ์ง€ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ตœ๊ทผ์˜ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์€ ๋ณด๋‹ค ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ๊ณผ ์šฉ๋Ÿ‰์„ ํ•„์š”๋กœํ•˜๋Š” ๋” ๋งŽ์€ CPU ์ฝ”์–ด๋ฅผ ๋‹จ์ผ ์†Œ์ผ“์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์†Œ์ผ“ ๋‹น ์ฑ„๋„ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋Œ€์—ญํญ ์š”๊ตฌ์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ํ•ด๊ฒฐ์ฑ…์ด๋ฉฐ, ์ตœ์‹ ์˜ DRAM ์ธํ„ฐํŽ˜์ด์Šค์˜ ๋ฐœ์ „ ์–‘์ƒ์€ ์ฆ๊ฐ€ํ•œ ์ฑ„๋„์„ ๋ณด๋‹ค ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ๋ฒ„์Šค ํญ์ด ๊ฐ์†Œ๋˜๊ณ  ๋ฒ„์ŠคํŠธ ๊ธธ์ด๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธธ์–ด์ง„ ๋ฒ„์ŠคํŠธ ๊ธธ์ด๋Š” DRAM ์•ก์„ธ์Šค ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์ตœ์‹ ์˜ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์€ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์š”๊ตฌํ•˜๋ฉฐ, ๋ฏธ์„ธ ๊ณต์ •์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์€ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, 20 nm ์ดํ•˜์˜ ๋ฏธ์„ธ๊ณต์ •์—์„œ๋Š” ๋” ์ด์ƒ ๊ณต์ • ๋ฏธ์„ธํ™”๋ฅผ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ๊ฐ€ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋ฉฐ, ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ, ์‹ค์ œ ์ตœ์‹ ์˜ ๋ฉ€ํ‹ฐ์ฝ”์–ด ๋จธ์‹ ์—์„œ SPEC CPU 2006 ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ๋ฉ€ํ‹ฐ์ฝ”์–ด์—์„œ ์‹คํ–‰ํ•˜์˜€์„ ๋•Œ, ํ•ญ์ƒ ์‹œ์Šคํ…œ์˜ ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์™„์ „ํžˆ ํ™œ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ด€์ฐฐํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ํœด ์ฑ„๋„์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„์˜ ํ”ผํฌ ๋Œ€์—ญํญ์„ ๋†’์ด๊ณ  3D ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฒ„์ŠคํŠธ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„ ๊ณต์œ  ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ฑ„๋„ ๊ณต์œ ๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ ํ”„๋กœ๊ทธ๋žจ ๋œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ๋ฐ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์„ฑ๋Šฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋กœ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ํ‰๊ท  ์ฝ๊ธฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ 8.22 % ๋ฐ 10.18 %๋กœ ๊ฐ์†Œํ•˜์˜€๋‹ค.Contents Abstract i Contents iv List of Figures vi List of Tables viii Introduction 1 1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5 1.2 Boosting Bandwidth โ€“ Dynamic Channel Sharing on 3D Stacked Memory 9 1.3 Research contribution 13 1.4 Outline 14 3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17 2.1 Background 17 2.1.1 Heterogeneous Main Memory Systems 17 2.1.2 Specialized DRAM 19 2.1.3 3D-stacked Memory 22 2.2 HIGH-DENSITY DRAM ARCHITECTURE 27 2.2.1 Key Design Challenges 29 2.2.2 Plausible High-density DRAM Designs 33 2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37 2.3.1 3D-XPath Architecture 41 2.3.2 3D-XPath Management 46 2.4 EXPERIMENTAL METHODOLOGY 52 2.5 EVALUATION 56 2.5.1 OLDI Workloads 56 2.5.2 Non-OLDI Workloads 61 2.5.3 Sensitivity Analysis 66 2.6 RELATED WORK 70 Boosting bandwidth โ€“Dynamic Channel Sharing on 3D Stacked Memory 72 3.1 Background: Memory Operations 72 3.1.1. Memory Controller 72 3.1.2 DRAM column access sequence 73 3.2 Related Work 74 3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76 3.3.1 Hardware Requirements 78 3.3.2 Operation Sequence 81 3.4 Analysis 87 3.4.1 Experiment Environment 87 3.4.2 Performance 88 3.4.3 Overhead 90 CONCLUSION 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 107Docto

    Design Guidelines for High-Performance SCM Hierarchies

    Full text link
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

    Principled Approaches to Last-Level Cache Management

    Get PDF
    Memory is a critical component of all computing systems. It represents a fundamental performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance, and the cost of implementing management techniques would scale together with the size of all different computing systems; unfortunately this is not the case. With the upcoming trends in applications, new memory technologies, etc., scaling becomes a bigger a problem, aggravating the performance bottleneck that memory represents. A memory hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have a decreasing cost per bit, an increased capacity, and a higher access time compared to its previous level. Preferably all data will be stored in the fastest level of memory, unfortunately, faster memory technologies tend to be associated with a higher manufacturing cost, which often limits their capacity. The design challenge is, to determine which is the frequently used data, and store it in the faster levels of memory. A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory can be stored in the cache. For many programs, a typical behavior is to access data that has been accessed previously. Taking advantage of this behavior, a copy of frequently accessed data is kept in the cache, in order to provide a faster access time next time is requested. Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in the cache, because of this, cache management policies decide which data is to be kept in the cache, and which in other levels of the memory hierarchy. Under an efficient cache management policy, an encouraging amount of memory requests will be serviced from a fast on-chip cache. The disparity in access latency between the last-level cache and main memory motivates the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance way possible. Related work focus on optimizing the performance of caches focusing on different possible solutions, e.g. reduce miss rate, consume less power, reducing storage overhead, reduce access latency, etc. Our work focus on improving the performance of last-level caches by designing policies based on principles adapted from other areas of interest. In this dissertation, we focus on several aspects of cache management policies, we first introduce a space-efficient placement and promotion policy which goal is to minimize the updates to the replacement policy state on each cache access. We further introduce a mechanism that predicts whether a block in the cache will be reused, it feeds different features from a block to the predictor in order to increase the correlation of a previous access to a future access. We later introduce a technique that tweaks traditional cache indexing, providing fast accesses to a vast majority of requests in the presence of a slow access memory technology such as DRAM

    HW-Flow: A Multi-Abstraction Level HW-CNN Codesign Pruning Methodology

    Get PDF
    Convolutional neural networks (CNNs) have produced unprecedented accuracy for many computer vision problems in the recent past. In power and compute-constrained embedded platforms, deploying modern CNNs can present many challenges. Most CNN architectures do not run in real-time due to the high number of computational operations involved during the inference phase. This emphasizes the role of CNN optimization techniques in early design space exploration. To estimate their efficacy in satisfying the target constraints, existing techniques are either hardware (HW) agnostic, pseudo-HW-aware by considering parameter and operation counts, or HW-aware through inflexible hardware-in-the-loop (HIL) setups. In this work, we introduce HW-Flow, a framework for optimizing and exploring CNN models based on three levels of hardware abstraction: Coarse, Mid and Fine. Through these levels, CNN design and optimization can be iteratively refined towards efficient execution on the target hardware platform. We present HW-Flow in the context of CNN pruning by augmenting a reinforcement learning agent with key metrics to understand the influence of its pruning actions on the inference hardware. With 2ร— reduction in energy and latency, we prune ResNet56, ResNet50, and DeepLabv3 with minimal accuracy degradation on the CIFAR-10, ImageNet, and CityScapes datasets, respectively
    • โ€ฆ
    corecore