462 research outputs found

    Doctor of Philosophy in Computing

    Get PDF
    dissertatio

    Doctor of Philosophy

    Get PDF
    dissertationThe computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a web-based information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehouse- scale computers), e.g., cloud computing and other web services, or in many-core high-performance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of cost-per-bit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation of novel technologies, and a fundamental rethinking of certain conventional design decisions. In this dissertation, we study every major element of the memory system - the memory chip, the processor-memory channel, the memory access mechanism, and memory reliability, and identify the key bottlenecks to efficiency. Based on this, we propose a novel main memory system with the following innovative features: (i) overfetch-aware re-organized chips, (ii) low-cost silicon photonic memory channels, (iii) largely autonomous memory modules with a packet-based interface to the proces- sor, and (iv) a RAID-based reliability mechanism. Such a system is energy-efficient, high-performance, low-complexity, reliable, and cost-effective, making it ideally suited to meet the requirements of future large-scale computing systems

    Parallel convolution processing using an integrated photonic tensor core

    Get PDF
    With the proliferation of ultra-high-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence, the world is generating exponentially increasing amounts of data - data that needs to be processed in a fast, efficient and smart way. These developments are pushing the limits of existing computing paradigms, and highly parallelized, fast and scalable hardware concepts are becoming progressively more important. Here, we demonstrate a computational specific integrated photonic tensor core - the optical analog of an ASIC-capable of operating at Tera-Multiply-Accumulate per second (TMAC/s) speeds. The photonic core achieves parallelized photonic in-memory computing using phase-change memory arrays and photonic chip-based optical frequency combs (soliton microcombs). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant passive components and can operate at a bandwidth exceeding 14 GHz, limited only by the speed of the modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates, ultra-low loss silicon nitride waveguides, and high speed on-chip detectors and modulators, our approach provides a path towards full CMOS wafer-scale integration of the photonic tensor core. While we focus on convolution processing, more generally our results indicate the major potential of integrated photonics for parallel, fast, and efficient computational hardware in demanding AI applications such as autonomous driving, live video processing, and next generation cloud computing services

    The impact of global communication latency at extreme scales on Krylov methods

    Get PDF
    Krylov Subspace Methods (KSMs) are popular numerical tools for solving large linear systems of equations. We consider their role in solving sparse systems on future massively parallel distributed memory machines, by estimating future performance of their constituent operations. To this end we construct a model that is simple, but which takes topology and network acceleration into account as they are important considerations. We show that, as the number of nodes of a parallel machine increases to very large numbers, the increasing latency cost of reductions may well become a problematic bottleneck for traditional formulations of these methods. Finally, we discuss how pipelined KSMs can be used to tackle the potential problem, and appropriate pipeline depths

    ์„ฑ๋Šฅ๊ณผ ์šฉ๋Ÿ‰ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2019. 2. ์•ˆ์ •ํ˜ธ.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by โˆผ30% while improving the throughput of an I/O-intensive applications by โˆผ39%, compared with DRAM without 3D-XPath. Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies. We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM ์ œ์กฐ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์€ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๋Š” ๋ฐ˜๋ฉด DRAM์˜ ๋ฐ€๋„ ๋ฐ ์„ฑ๋Šฅ ์š”๊ตฌ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์š”๊ตฌ๋กœ ์ธํ•ด ์ƒˆ๋กœ์šด ๋น„ ํœ˜๋ฐœ์„ฑ ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: 3D-XPoint) ๋ฐ ๊ณ ๋ฐ€๋„ DRAM(์˜ˆ: Managed asymmetric latency DRAM Solution)์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ณ ๋ฐ€๋„ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์€ ๊ธด ๋ ˆ์ดํ„ด์‹œ, ๋‚ฎ์€ ๋Œ€์—ญํญ ๋˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•„, ํ•ซ ํŽ˜์ด์ง€๋ฅผ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ(์˜ˆ: ์ผ๋ฐ˜ DRAM)๋กœ ์Šค์™‘๋˜๋Š” ์ €์šฉ๋Ÿ‰์˜ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋™์‹œ์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์Šค์™‘ ๊ณผ์ •์—์„œ ๋น ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ํŽ˜์ด์ง€ ์ „์†ก์ด ์ผ๋ฐ˜์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์˜ค๋žซ๋™์•ˆ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋Œ€๊ธฐ ์‹œ๊ฐ„์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œ์ผœ, ์‘๋‹ต ์‹œ๊ฐ„์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์ € ์ง€์—ฐ์‹œ๊ฐ„ ๋ฐ ๊ณ ์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌํ•˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•ด 3D-XPath, ์ฆ‰ ๊ณ ๋ฐ€๋„ ๊ด€๋ฆฌ DRAM ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-ํ†”์†Œ๋ฅผ ์ง‘์ ํ•œ DRAM์€ ์ €์†์˜ ๊ณ ๋ฐ€๋„ DRAM ๋‹ค์ด๋ฅผ ๊ธฐ์กด์˜ ์ผ๋ฐ˜์ ์ธ DRAM ๋‹ค์ด์™€ ๋™์‹œ์— ํ•œ ์นฉ์— ์ ์ธตํ•˜๊ณ , DRAM ๋‹ค์ด๋ผ๋ฆฌ๋Š” ์ œ์•ˆํ•˜๋Š” 3D-XPath ํ•˜๋“œ์›จ์–ด๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ๋œ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath๋Š” ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์ด ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์ฐจ๋‹จํ•˜์ง€ ์•Š๊ณ  ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„๋กœ ํ•ซ ํŽ˜์ด์ง€ ์Šค์™‘์„ ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ, ๋ฐ์ดํ„ฐ ์ง‘์ค‘ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์ œ์•ˆํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ถ”๊ฐ€์ ์œผ๋กœ O/S ์ปค๋„๊ณผ ์œ ์ € ์ŠคํŽ˜์ด์Šค ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ์ž์ฃผ ๋ณต์‚ฌํ•˜๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ 3D-XPath DRAM์€ 3D-XPath๊ฐ€ ์—†๋Š” DRAM์— ๋น„ํ•ด I/O ์ง‘์•ฝ์ ์ธ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ตœ๋Œ€ 39 % ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ ๋ ˆ์ดํ„ด์‹œ์— ๋ฏผ๊ฐํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ๋†’์€ ๋ฐฑ๋ถ„์œ„ ์‘๋‹ต ์‹œ๊ฐ„์„ ์ตœ๋Œ€ 30 %๊นŒ์ง€ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ตœ๊ทผ์˜ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์€ ๋ณด๋‹ค ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ๊ณผ ์šฉ๋Ÿ‰์„ ํ•„์š”๋กœํ•˜๋Š” ๋” ๋งŽ์€ CPU ์ฝ”์–ด๋ฅผ ๋‹จ์ผ ์†Œ์ผ“์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์†Œ์ผ“ ๋‹น ์ฑ„๋„ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋Œ€์—ญํญ ์š”๊ตฌ์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ํ•ด๊ฒฐ์ฑ…์ด๋ฉฐ, ์ตœ์‹ ์˜ DRAM ์ธํ„ฐํŽ˜์ด์Šค์˜ ๋ฐœ์ „ ์–‘์ƒ์€ ์ฆ๊ฐ€ํ•œ ์ฑ„๋„์„ ๋ณด๋‹ค ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ๋ฒ„์Šค ํญ์ด ๊ฐ์†Œ๋˜๊ณ  ๋ฒ„์ŠคํŠธ ๊ธธ์ด๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธธ์–ด์ง„ ๋ฒ„์ŠคํŠธ ๊ธธ์ด๋Š” DRAM ์•ก์„ธ์Šค ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์ตœ์‹ ์˜ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์€ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์š”๊ตฌํ•˜๋ฉฐ, ๋ฏธ์„ธ ๊ณต์ •์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์€ ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, 20 nm ์ดํ•˜์˜ ๋ฏธ์„ธ๊ณต์ •์—์„œ๋Š” ๋” ์ด์ƒ ๊ณต์ • ๋ฏธ์„ธํ™”๋ฅผ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ€๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ๊ฐ€ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋ฉฐ, ์ ์ธตํ˜• ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์šฉ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ, ์‹ค์ œ ์ตœ์‹ ์˜ ๋ฉ€ํ‹ฐ์ฝ”์–ด ๋จธ์‹ ์—์„œ SPEC CPU 2006 ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ๋ฉ€ํ‹ฐ์ฝ”์–ด์—์„œ ์‹คํ–‰ํ•˜์˜€์„ ๋•Œ, ํ•ญ์ƒ ์‹œ์Šคํ…œ์˜ ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์™„์ „ํžˆ ํ™œ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ด€์ฐฐํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ํœด ์ฑ„๋„์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„์˜ ํ”ผํฌ ๋Œ€์—ญํญ์„ ๋†’์ด๊ณ  3D ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฒ„์ŠคํŠธ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ฑ„๋„ ๊ณต์œ  ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ฑ„๋„ ๊ณต์œ ๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ ํ”„๋กœ๊ทธ๋žจ ๋œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ๋ฐ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์„ฑ๋Šฅ์ด ๊ฐ๊ฐ 4.3 % ๋ฐ 3.6 %๋กœ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ํ‰๊ท  ์ฝ๊ธฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ 8.22 % ๋ฐ 10.18 %๋กœ ๊ฐ์†Œํ•˜์˜€๋‹ค.Contents Abstract i Contents iv List of Figures vi List of Tables viii Introduction 1 1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5 1.2 Boosting Bandwidth โ€“ Dynamic Channel Sharing on 3D Stacked Memory 9 1.3 Research contribution 13 1.4 Outline 14 3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17 2.1 Background 17 2.1.1 Heterogeneous Main Memory Systems 17 2.1.2 Specialized DRAM 19 2.1.3 3D-stacked Memory 22 2.2 HIGH-DENSITY DRAM ARCHITECTURE 27 2.2.1 Key Design Challenges 29 2.2.2 Plausible High-density DRAM Designs 33 2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37 2.3.1 3D-XPath Architecture 41 2.3.2 3D-XPath Management 46 2.4 EXPERIMENTAL METHODOLOGY 52 2.5 EVALUATION 56 2.5.1 OLDI Workloads 56 2.5.2 Non-OLDI Workloads 61 2.5.3 Sensitivity Analysis 66 2.6 RELATED WORK 70 Boosting bandwidth โ€“Dynamic Channel Sharing on 3D Stacked Memory 72 3.1 Background: Memory Operations 72 3.1.1. Memory Controller 72 3.1.2 DRAM column access sequence 73 3.2 Related Work 74 3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76 3.3.1 Hardware Requirements 78 3.3.2 Operation Sequence 81 3.4 Analysis 87 3.4.1 Experiment Environment 87 3.4.2 Performance 88 3.4.3 Overhead 90 CONCLUSION 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 107Docto

    Resource and thermal management in 3D-stacked multi-/many-core systems

    Full text link
    Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures. 3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead. This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing. This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory. At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads. On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00

    LOT-ECC: LOcalized and tiered reliability mechanisms for commodity memory systems

    Get PDF
    pre-printMemory system reliability is a serious and growing concern in modern servers. Existing chipkill-level mem- ory protection mechanisms suffer from several draw- backs. They activate a large number of chips on ev- ery memory access - this increases energy consump- tion, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase ac- cess granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM de- vices. In this paper, we present LOT-ECC, a local- ized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying imple- mentation. Data and codes are localized to the same DRAM row to improve access efficiency. We use sys- tem firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power con- sumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while work- ing with commodity DRAMs and operating systems. Fi- nally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts
    • โ€ฆ
    corecore