309 research outputs found

    ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์ž…์ถœ๋ ฅ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—„ํ˜„์ƒ.Most I/O traffic in high performance computing (HPC) storage systems is dominated by checkpoints and the restarts of HPC applications. For such a bursty I/O, new all-flash HPC storage systems with an integrated burst buffer (BB) and parallel file system (PFS) have been proposed. However, most of the distributed file systems (DFS) used to configure the storage systems provide a single connection between a compute node and a server node, which hinders users from utilizing the high I/O bandwidth provided by an all-flash server node. To provide multiple connections, DFSs must be modified to increase the number of sockets, which is an extremely difficult and time-consuming task owing to their complicated structures. Users can increase the number of daemons in the DFSs to forcibly increase the number of connections without a DFS modification. Because each daemon has a mount point for its connection, there are multiple mount points in the compute nodes, resulting in significant effort required for users to distribute file I/O requests to multiple mount points. In addition, to avoid access to a PFS composed of low-speed storage devices, such as hard disks, dedicated BB allocation is preferred despite its severe underutilization. However, a BB allocation method may be inappropriate because all-flash HPC storage systems speed up access to the PFS. To handle such problems, we propose an efficient user-transparent I/O management scheme for all-flash HPC storage systems. The first scheme, I/O transfer management, provides multiple connections between a compute node and a server node without additional effort from DFS developers and users. To do so, we modified a mount procedure and I/O processing procedures in a virtual file system (VFS). In the second scheme, data management between BB and PFS, a BB over-subscription allocation method is adopted to improve the BB utilization. Unfortunately, the allocation method aggravates the I/O interference and demotion overhead from the BB to the PFS, resulting in a degraded checkpoint and restart performance. To minimize this degradation, we developed an I/O scheduler and a new data management based on the checkpoint and restart characteristics. To prove the effectiveness of our proposed schemes, we evaluated our I/O transfer and data management schemes between the BB and PFS. The I/O transfer management scheme improves the write and read I/O throughputs for the checkpoint and restart by up to 6- and 3-times, that of a DFS using the original kernel, respectively. Based on the data management scheme, we found that the BB utilization is improved by at least 2.2-fold, and a stabler and higher checkpoint performance is guaranteed. In addition, we achieved up to a 96.4\% hit ratio of the restart requests on the BB and up to a 3.1-times higher restart performance than that of other existing methods.๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์˜ ์ž…์ถœ๋ ฅ ๋Œ€์—ญํญ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์„ฑ๋Šฅ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘์ด ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ณ ์„ฑ๋Šฅ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ํญ๋ฐœ์ ์ธ ์ž…์ถœ๋ ฅ์„ ์›ํ™œํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ์œ„ํ•˜์—ฌ, ๊ณ ๊ธ‰ ํ”Œ๋ž˜์‹œ ์ €์žฅ ์žฅ์น˜์™€ ์ €๊ธ‰ ํ”Œ๋ž˜์‹œ ์ €์žฅ ์žฅ์น˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์™€ PFS๋ฅผ ํ•ฉ์นœ ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉ๋˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ๋“ค์€ ๋…ธ๋“œ๊ฐ„ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์–ด ์„œ๋ฒ„ ๋…ธ๋“œ์—์„œ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋Š” ๋†’์€ ํ”Œ๋ž˜์‹œ๋“ค์˜ ์ž…์ถœ๋ ฅ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์ด ์ˆ˜์ •๋˜์–ด์•ผ ํ•˜๊ฑฐ๋‚˜, ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ํด๋ผ์ด์–ธํŠธ ๋ฐ๋ชฌ๊ณผ ์„œ๋ฒ„ ๋ฐ๋ชฌ์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜์–ด์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์€ ๋งค์šฐ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์ด ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์ž๋“ค์—๊ฒŒ ์š”๊ตฌ๋œ๋‹ค. ๋ฐ๋ชฌ์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐ ๋„คํŠธ์›Œํฌ ์ปค๋„ฅ์…˜๋งˆ๋‹ค ์ƒˆ๋กœ์šด ๋งˆ์šดํŠธ ํฌ์ธํŠธ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ง์ ‘ ํŒŒ์ผ ์ž…์ถœ๋ ฅ ๋ฆฌํ€˜์ŠคํŠธ๋ฅผ ์—ฌ๋Ÿฌ ๋งˆ์šดํŠธ ํฌ์ธํŠธ๋กœ ๋ถ„์‚ฐ์‹œ์ผœ์•ผ ํ•˜๋Š” ์—„์ฒญ๋‚œ ๋…ธ๋ ฅ์ด ์‚ฌ์šฉ์ž์—๊ฒŒ ์š”๊ตฌ๋œ๋‹ค. ์„œ๋ฒ„ ๋ฐ๋ชฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ๋„คํŠธ์›Œํฌ ์ปค๋„ฅ์…˜์˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๊ฒฝ์šฐ์—”, ์„œ๋ฒ„ ๋ฐ๋ชฌ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ผ ์‹œ์Šคํ…œ ๋””๋ ‰ํ† ๋ฆฌ ๊ด€์ ์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„œ๋กœ ๋‹ค๋ฅธ ์„œ๋ฒ„ ๋ฐ๋ชฌ์„ ์ธ์‹ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ถฉ๋Œ์ด ์ผ์–ด๋‚˜์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ธฐ์กด์—๋Š” ์‚ฌ์šฉ์ž๋“ค์ด ํ•˜๋“œ๋””์Šคํฌ์™€ ๊ฐ™์€ ์ €์† ์ €์žฅ ์žฅ์น˜๋กœ ๊ตฌ์„ฑ๋œ PFS๋กœ์˜ ์ ‘๊ทผ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์˜ ํšจ์œจ์„ฑ์„ ํฌ๊ธฐํ•˜๋ฉด์„œ๋„ ์ „์šฉ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ• ๋‹น ๋ฐฉ์‹ (Dedicated BB allocation method)์„ ์„ ํ˜ธํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ์˜ ์ ‘๊ทผ์ด ๋น ๋ฅด๊ธฐ๋•Œ๋ฌธ์—, ํ•ด๋‹น ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ• ๋‹น ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์€ ์ ์ ˆ์น˜ ์•Š๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋‚ด๋ถ€ ์ฒ˜๋ฆฌ๊ณผ์ •์ด ๋…ธ์ถœ ๋˜์ง€์•Š๋Š” ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ๊ธฐ๋ฒ•์ธ ์ž…์ถœ๋ ฅ ์ „์†ก ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์€ ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์ž์™€ ์‚ฌ์šฉ์ž๋“ค์˜ ์ถ”๊ฐ€์ ์ธ ๋…ธ๋ ฅ์—†์ด ์ปดํ“จํŠธ ๋…ธ๋“œ์™€ ์„œ๋ฒ„ ๋…ธ๋“œ ์‚ฌ์ด์— ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ปค๋„ฅ์…˜์„ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๊ฐ€์ƒ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ๋งˆ์šดํŠธ ์ˆ˜ํ–‰ ๊ณผ์ •๊ณผ ์ž…์ถœ๋ ฅ ์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ˆ˜์ •ํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ ๊ธฐ๋ฒ•์ธ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์—์„œ๋Š” ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์˜ ํ™œ์šฉ๋ฅ ์„ ํ–ฅ์ƒ ์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ์ดˆ๊ณผ ํ• ๋‹น ๊ธฐ๋ฒ• (BB over-subscription method)์„ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ํ•ด๋‹น ํ• ๋‹น ๋ฐฉ์‹์€ ์‚ฌ์šฉ์ž ๊ฐ„์˜ ์ž…์ถœ๋ ฅ ๊ฒฝํ•ฉ๊ณผ ๋””๋ชจ์…˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๋ฐœ์ƒํ•˜๊ธฐ๋•Œ๋ฌธ์— ๋‚ฎ์€ ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘์˜ ํŠน์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์™€ ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‹ค์ œ ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๊ณ  ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ถœ๋ ฅ ์ „์†ก ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์ด ๊ธฐ์กด ๊ธฐ๋ฒ•๋ณด๋‹ค ์ตœ๋Œ€ 6๋ฐฐ ๊ทธ๋ฆฌ๊ณ  ์ตœ๋Œ€ 2๋ฐฐ ๋†’์€ ์“ฐ๊ธฐ ๊ทธ๋ฆฌ๊ณ  ์ฝ๊ธฐ ์ž…์ถœ๋ ฅ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ๋น„ํ•ด, ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ™œ์šฉ๋ฅ ์„ 2.2๋ฐฐ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๋†’๊ณ  ์•ˆ์ •์ ์ธ ์ฒดํฌํฌ์ธํŠธ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ ์ตœ๋Œ€ 3.1๋ฐฐ ๋†’์€ ์žฌ์‹œ์ž‘ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ–ˆ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Background 11 2.1 Burst Buffer 11 2.2 Virtual File System 13 2.3 Network Bandwidth 14 2.4 Mean Time Between Failures 16 2.5 Checkpoint/Restart Characteristics 17 Chapter 3 Motivation 19 3.1 I/O Transfer Management for HPC Storage Systems 19 3.1.1 Problems of Existing HPC Storage Systems 19 3.1.2 Limitations of Existing Approaches 23 3.2 Data Management for HPC Storage Systems 26 3.2.1 Problems of Existing HPC Storage Systems 26 3.2.2 Limitations with Existing Approaches 27 Chapter 4 Mulconn: User-Transparent I/O Transfer Management for HPC Storage Systems 31 4.1 Design and Architecture 31 4.1.1 Overview 31 4.1.2 Scale Up Connections 34 4.1.3 I/O Scheduling 36 4.1.4 Automatic Policy Decision 38 4.2 Implementation 41 4.2.1 File Open and Close 41 4.2.2 File Write and Read 45 4.3 Evaluation. 46 4.3.1 Experimental Environment 46 4.3.2 I/O Throughputs Improvement 46 4.3.3 Comparison between TtoS and TtoM 59 4.3.4 Effectiveness of Our System 60 4.4 Summary 63 Chapter 5 BBOS: User-Transparent Data Management for HPC Storage Systems 64 5.1 Design and Architecture 64 5.1.1 Overview 64 5.1.2 DataManagementEngine 66 5.2 Implementation 72 5.2.1 In-memory Key-value Store 72 5.2.2 I/O Engine 72 5.2.3 Data Management Engine 75 5.2.4 Stable Checkpoint and Demotion Performance 77 5.3 Evaluation 78 5.3.1 Experimental Environment 78 5.3.2 Burst Buffer Utilization 81 5.3.3 Checkpoint Performance 82 5.3.4 Restart Performance 86 5.4 Summary 90 Chapter 6 Related Work 91 Chapter 7 Conclusion 94 ์š”์•ฝ 105 ๊ฐ์‚ฌ์˜ ๊ธ€ 107Docto

    Improving Non-Volatile Memory Lifetime through Temporal Wear-Limiting

    Get PDF
    Non-volatile memory technologies provide a low-power, high-density alternative to traditional DRAM main memories, yet all suffer from some degree of limited write endurance. The non-uniformity of write traffic exacerbates this limited endurance, causing write-induced wear to concentrate on a few specific lines. Wear-leveling attempts to mitigate this issue by distributing write-induced wear uniformly across the memory. Orthogonally, wear-limiting attempts to increase memory lifetime by directly reducing wear. In this paper, we present the concept of temporal wear-limiting, in which we exploit the trade-off between write latency and memory lifetime. Using a history of the slack between per-bank write operations, we predict future write latency, allowing for up to a 1.5x memory lifetime improvement. We present two extensions for improving the effectiveness of this history-based mechanism: a method for dynamically determining the optimum history size, and a method for increasing lifetime improvement through address prediction

    Autonomic management of virtualized resources in cloud computing

    Get PDF
    The last five years have witnessed a rapid growth of cloud computing in business, governmental and educational IT deployment. The success of cloud services depends critically on the effective management of virtualized resources. A key requirement of cloud management is the ability to dynamically match resource allocations to actual demands, To this end, we aim to design and implement a cloud resource management mechanism that manages underlying complexity, automates resource provisioning and controls client-perceived quality of service (QoS) while still achieving resource efficiency. The design of an automatic resource management centers on two questions: when to adjust resource allocations and how much to adjust. In a cloud, applications have different definitions on capacity and cloud dynamics makes it difficult to determine a static resource to performance relationship. In this dissertation, we have proposed a generic metric that measures application capacity, designed model-independent and adaptive approaches to manage resources and built a cloud management system scalable to a cluster of machines. To understand web system capacity, we propose to use a metric of productivity index (PI), which is defined as the ratio of yield to cost, to measure the system processing capability online. PI is a generic concept that can be applied to different levels to monitor system progress in order to identify if more capacity is needed. We applied the concept of PI to the problem of overload prevention in multi-tier websites. The overload predictor built on the PI metric shows more accurate and responsive overload prevention compared to conventional approaches. To address the issue of the lack of accurate server model, we propose a model-independent fuzzy control based approach for CPU allocation. For adaptive and stable control performance, we embed the controller with self-tuning output amplification and flexible rule selection. Finally, we build a QoS provisioning framework that supports multi-objective QoS control and service differentiation. Experiments on a virtual cluster with two service classes show the effectiveness of our approach in both performance and power control. To address the problems of complex interplay between resources and process delays in fine-grained multi-resource allocation, we consider capacity management as a decision-making problem and employ reinforcement learning (RL) to optimize the process. The optimization depends on the trial-and-error interactions with the cloud system. In order to improve the initial management performance, we propose a model-based RL algorithm. The neural network based environment model, which is learned from previous management history, generates simulated resource allocations for the RL agent. Experiment results on heterogeneous applications show that our approach makes efficient use of limited interactions and find near optimal resource configurations within 7 steps. Finally, we present a distributed reinforcement learning approach to the cluster-wide cloud resource management. We decompose the cluster-wide resource allocation problem into sub-problems concerning individual VM resource configurations. The cluster-wide allocation is optimized if individual VMs meet their SLA with a high resource utilization. For scalability, we develop an efficient reinforcement learning approach with continuous state space. For adaptability, we use VM low-level runtime statistics to accommodate workload dynamics. Prototyped in a iBalloon system, the distributed learning approach successfully manages 128 VMs on a 16-node close correlated cluster

    Improving Performance and Endurance for Crossbar Resistive Memory

    Get PDF
    Resistive Memory (ReRAM) has emerged as a promising non-volatile memory technology that may replace a significant portion of DRAM in future computer systems. When adopting crossbar architecture, ReRAM cell can achieve the smallest theoretical size in fabrication, ideally for constructing dense memory with large capacity. However, crossbar cell structure suffers from severe performance and endurance degradations, which come from large voltage drops on long wires. In this dissertation, I first study the correlation between the ReRAM cell switching latency and the number of cells in low resistant state (LRS) along bitlines, and propose to dynamically speed up write operations based on bitline data patterns. By leveraging the intrinsic in-memory processing capability of ReRAM crossbars, a low overhead runtime profiler that effectively tracks the data patterns in different bitlines is proposed. To achieve further write latency reduction, data compression and row address dependent memory data layout are employed to reduce the numbers of LRS cells on bitlines. Moreover, two optimization techniques are presented to mitigate energy overhead brought by bitline data patterns tracking. Second, I propose XWL, a novel table-based wear leveling scheme for ReRAM crossbars and study the correlation between write endurance and voltage stress in ReRAM crossbars. By estimating and tracking the effective write stress to different rows at runtime, XWL chooses the ones that are stressed the most to mitigate. Additionally, two extended scenarios are further examined for the performance and endurance issues in neural network accelerators as well as 3D vertical ReRAM (3D-VRAM) arrays. For the ReRAM crossbar-based accelerators, by exploiting the wearing out mechanism of ReRAM cell, a novel comprehensive framework, ReNEW, is proposed to enhance the lifetime of the ReRAM crossbar-based accelerators, particularly for neural network training. To reduce the write latency in 3D-VRAM arrays, a collection of techniques, including an in-memory data encoding scheme, a data pattern estimator for assessing cell resistance distributions, and a write time reduction scheme that opportunistically reduces RESET latency with runtime data patterns, are devised

    FPGA acceleration of structured-mesh-based explicit and implicit numerical solvers using SYCL

    Get PDF
    We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multidimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a wide-range of realworld applications ranging from CFD to financial computing. A general, unified workflow is formulated for synthesizing them on Intel FPGAs together with predictive analytic models to explore the design space to obtain near-optimal performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing better or matching performance to the V100 GPU. However, more importantly the FPGA solutions provide 59%-76% less energy consumption for their largest configurations, making them highly attractive for solving workloads based on these applications in production settings. The performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating their significant utility for design space explorations. With these tools and techniques, we discuss determinants for a given structuredmesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design, how they can be codified using SYCL and the resulting performance
    • โ€ฆ
    corecore