1 research outputs found

    ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์ž…์ถœ๋ ฅ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—„ํ˜„์ƒ.Most I/O traffic in high performance computing (HPC) storage systems is dominated by checkpoints and the restarts of HPC applications. For such a bursty I/O, new all-flash HPC storage systems with an integrated burst buffer (BB) and parallel file system (PFS) have been proposed. However, most of the distributed file systems (DFS) used to configure the storage systems provide a single connection between a compute node and a server node, which hinders users from utilizing the high I/O bandwidth provided by an all-flash server node. To provide multiple connections, DFSs must be modified to increase the number of sockets, which is an extremely difficult and time-consuming task owing to their complicated structures. Users can increase the number of daemons in the DFSs to forcibly increase the number of connections without a DFS modification. Because each daemon has a mount point for its connection, there are multiple mount points in the compute nodes, resulting in significant effort required for users to distribute file I/O requests to multiple mount points. In addition, to avoid access to a PFS composed of low-speed storage devices, such as hard disks, dedicated BB allocation is preferred despite its severe underutilization. However, a BB allocation method may be inappropriate because all-flash HPC storage systems speed up access to the PFS. To handle such problems, we propose an efficient user-transparent I/O management scheme for all-flash HPC storage systems. The first scheme, I/O transfer management, provides multiple connections between a compute node and a server node without additional effort from DFS developers and users. To do so, we modified a mount procedure and I/O processing procedures in a virtual file system (VFS). In the second scheme, data management between BB and PFS, a BB over-subscription allocation method is adopted to improve the BB utilization. Unfortunately, the allocation method aggravates the I/O interference and demotion overhead from the BB to the PFS, resulting in a degraded checkpoint and restart performance. To minimize this degradation, we developed an I/O scheduler and a new data management based on the checkpoint and restart characteristics. To prove the effectiveness of our proposed schemes, we evaluated our I/O transfer and data management schemes between the BB and PFS. The I/O transfer management scheme improves the write and read I/O throughputs for the checkpoint and restart by up to 6- and 3-times, that of a DFS using the original kernel, respectively. Based on the data management scheme, we found that the BB utilization is improved by at least 2.2-fold, and a stabler and higher checkpoint performance is guaranteed. In addition, we achieved up to a 96.4\% hit ratio of the restart requests on the BB and up to a 3.1-times higher restart performance than that of other existing methods.๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์˜ ์ž…์ถœ๋ ฅ ๋Œ€์—ญํญ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์„ฑ๋Šฅ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘์ด ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ณ ์„ฑ๋Šฅ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ํญ๋ฐœ์ ์ธ ์ž…์ถœ๋ ฅ์„ ์›ํ™œํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ์œ„ํ•˜์—ฌ, ๊ณ ๊ธ‰ ํ”Œ๋ž˜์‹œ ์ €์žฅ ์žฅ์น˜์™€ ์ €๊ธ‰ ํ”Œ๋ž˜์‹œ ์ €์žฅ ์žฅ์น˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์™€ PFS๋ฅผ ํ•ฉ์นœ ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉ๋˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ๋“ค์€ ๋…ธ๋“œ๊ฐ„ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์–ด ์„œ๋ฒ„ ๋…ธ๋“œ์—์„œ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋Š” ๋†’์€ ํ”Œ๋ž˜์‹œ๋“ค์˜ ์ž…์ถœ๋ ฅ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์ด ์ˆ˜์ •๋˜์–ด์•ผ ํ•˜๊ฑฐ๋‚˜, ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ํด๋ผ์ด์–ธํŠธ ๋ฐ๋ชฌ๊ณผ ์„œ๋ฒ„ ๋ฐ๋ชฌ์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜์–ด์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ์€ ๋งค์šฐ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์ด ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์ž๋“ค์—๊ฒŒ ์š”๊ตฌ๋œ๋‹ค. ๋ฐ๋ชฌ์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐ ๋„คํŠธ์›Œํฌ ์ปค๋„ฅ์…˜๋งˆ๋‹ค ์ƒˆ๋กœ์šด ๋งˆ์šดํŠธ ํฌ์ธํŠธ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ง์ ‘ ํŒŒ์ผ ์ž…์ถœ๋ ฅ ๋ฆฌํ€˜์ŠคํŠธ๋ฅผ ์—ฌ๋Ÿฌ ๋งˆ์šดํŠธ ํฌ์ธํŠธ๋กœ ๋ถ„์‚ฐ์‹œ์ผœ์•ผ ํ•˜๋Š” ์—„์ฒญ๋‚œ ๋…ธ๋ ฅ์ด ์‚ฌ์šฉ์ž์—๊ฒŒ ์š”๊ตฌ๋œ๋‹ค. ์„œ๋ฒ„ ๋ฐ๋ชฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ๋„คํŠธ์›Œํฌ ์ปค๋„ฅ์…˜์˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๊ฒฝ์šฐ์—”, ์„œ๋ฒ„ ๋ฐ๋ชฌ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ผ ์‹œ์Šคํ…œ ๋””๋ ‰ํ† ๋ฆฌ ๊ด€์ ์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„œ๋กœ ๋‹ค๋ฅธ ์„œ๋ฒ„ ๋ฐ๋ชฌ์„ ์ธ์‹ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ถฉ๋Œ์ด ์ผ์–ด๋‚˜์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ธฐ์กด์—๋Š” ์‚ฌ์šฉ์ž๋“ค์ด ํ•˜๋“œ๋””์Šคํฌ์™€ ๊ฐ™์€ ์ €์† ์ €์žฅ ์žฅ์น˜๋กœ ๊ตฌ์„ฑ๋œ PFS๋กœ์˜ ์ ‘๊ทผ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์˜ ํšจ์œจ์„ฑ์„ ํฌ๊ธฐํ•˜๋ฉด์„œ๋„ ์ „์šฉ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ• ๋‹น ๋ฐฉ์‹ (Dedicated BB allocation method)์„ ์„ ํ˜ธํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ์˜ ์ ‘๊ทผ์ด ๋น ๋ฅด๊ธฐ๋•Œ๋ฌธ์—, ํ•ด๋‹น ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ• ๋‹น ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์€ ์ ์ ˆ์น˜ ์•Š๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋‚ด๋ถ€ ์ฒ˜๋ฆฌ๊ณผ์ •์ด ๋…ธ์ถœ ๋˜์ง€์•Š๋Š” ์ƒˆ๋กœ์šด ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ๊ธฐ๋ฒ•์ธ ์ž…์ถœ๋ ฅ ์ „์†ก ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์€ ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์ž์™€ ์‚ฌ์šฉ์ž๋“ค์˜ ์ถ”๊ฐ€์ ์ธ ๋…ธ๋ ฅ์—†์ด ์ปดํ“จํŠธ ๋…ธ๋“œ์™€ ์„œ๋ฒ„ ๋…ธ๋“œ ์‚ฌ์ด์— ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ปค๋„ฅ์…˜์„ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๊ฐ€์ƒ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ๋งˆ์šดํŠธ ์ˆ˜ํ–‰ ๊ณผ์ •๊ณผ ์ž…์ถœ๋ ฅ ์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ˆ˜์ •ํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ ๊ธฐ๋ฒ•์ธ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์—์„œ๋Š” ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์˜ ํ™œ์šฉ๋ฅ ์„ ํ–ฅ์ƒ ์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ์ดˆ๊ณผ ํ• ๋‹น ๊ธฐ๋ฒ• (BB over-subscription method)์„ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ํ•ด๋‹น ํ• ๋‹น ๋ฐฉ์‹์€ ์‚ฌ์šฉ์ž ๊ฐ„์˜ ์ž…์ถœ๋ ฅ ๊ฒฝํ•ฉ๊ณผ ๋””๋ชจ์…˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๋ฐœ์ƒํ•˜๊ธฐ๋•Œ๋ฌธ์— ๋‚ฎ์€ ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ์ฒดํฌํฌ์ธํŠธ์™€ ์žฌ์‹œ์ž‘์˜ ํŠน์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ์™€ ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‹ค์ œ ํ”Œ๋ž˜์‹œ ๊ธฐ๋ฐ˜์˜ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๊ณ  ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ถœ๋ ฅ ์ „์†ก ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์ด ๊ธฐ์กด ๊ธฐ๋ฒ•๋ณด๋‹ค ์ตœ๋Œ€ 6๋ฐฐ ๊ทธ๋ฆฌ๊ณ  ์ตœ๋Œ€ 2๋ฐฐ ๋†’์€ ์“ฐ๊ธฐ ๊ทธ๋ฆฌ๊ณ  ์ฝ๊ธฐ ์ž…์ถœ๋ ฅ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ๋น„ํ•ด, ๋ฒ„์ŠคํŠธ ๋ฒ„ํผ ํ™œ์šฉ๋ฅ ์„ 2.2๋ฐฐ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๋†’๊ณ  ์•ˆ์ •์ ์ธ ์ฒดํฌํฌ์ธํŠธ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ ์ตœ๋Œ€ 3.1๋ฐฐ ๋†’์€ ์žฌ์‹œ์ž‘ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ–ˆ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Background 11 2.1 Burst Buffer 11 2.2 Virtual File System 13 2.3 Network Bandwidth 14 2.4 Mean Time Between Failures 16 2.5 Checkpoint/Restart Characteristics 17 Chapter 3 Motivation 19 3.1 I/O Transfer Management for HPC Storage Systems 19 3.1.1 Problems of Existing HPC Storage Systems 19 3.1.2 Limitations of Existing Approaches 23 3.2 Data Management for HPC Storage Systems 26 3.2.1 Problems of Existing HPC Storage Systems 26 3.2.2 Limitations with Existing Approaches 27 Chapter 4 Mulconn: User-Transparent I/O Transfer Management for HPC Storage Systems 31 4.1 Design and Architecture 31 4.1.1 Overview 31 4.1.2 Scale Up Connections 34 4.1.3 I/O Scheduling 36 4.1.4 Automatic Policy Decision 38 4.2 Implementation 41 4.2.1 File Open and Close 41 4.2.2 File Write and Read 45 4.3 Evaluation. 46 4.3.1 Experimental Environment 46 4.3.2 I/O Throughputs Improvement 46 4.3.3 Comparison between TtoS and TtoM 59 4.3.4 Effectiveness of Our System 60 4.4 Summary 63 Chapter 5 BBOS: User-Transparent Data Management for HPC Storage Systems 64 5.1 Design and Architecture 64 5.1.1 Overview 64 5.1.2 DataManagementEngine 66 5.2 Implementation 72 5.2.1 In-memory Key-value Store 72 5.2.2 I/O Engine 72 5.2.3 Data Management Engine 75 5.2.4 Stable Checkpoint and Demotion Performance 77 5.3 Evaluation 78 5.3.1 Experimental Environment 78 5.3.2 Burst Buffer Utilization 81 5.3.3 Checkpoint Performance 82 5.3.4 Restart Performance 86 5.4 Summary 90 Chapter 6 Related Work 91 Chapter 7 Conclusion 94 ์š”์•ฝ 105 ๊ฐ์‚ฌ์˜ ๊ธ€ 107Docto
    corecore