18 research outputs found

    Coding for the Clouds: Coding Techniques for Enabling Security, Locality, and Availability in Distributed Storage Systems

    Get PDF
    Cloud systems have become the backbone of many applications such as multimedia streaming, e-commerce, and cluster computing. At the foundation of any cloud architecture lies a large-scale, distributed, data storage system. To accommodate the massive amount of data being stored on the cloud, these distributed storage systems (DSS) have been scaled to contain hundreds to thousands of nodes that are connected through a networking infrastructure. Such data-centers are usually built out of commodity components, which make failures the norm rather than the exception. In order to combat node failures, data is typically stored in a redundant fashion. Due to the exponential data growth rate, many DSS are beginning to resort to error control coding over conventional replication methods, as coding offers high storage space efficiency. This paradigm shift from replication to coding, along with the need to guarantee reliability, efficiency, and security in DSS, has created a new set of challenges and opportunities, opening up a new area of research. This thesis addresses several of these challenges and opportunities by broadly making the following contributions. (i) We design practically amenable, low-complexity coding schemes that guarantee security of cloud systems, ensure quick recovery from failures, and provide high availability for retrieving partial information; and (ii) We analyze fundamental performance limits and optimal trade-offs between the key performance metrics of these coding schemes. More specifically, we first consider the problem of achieving information-theoretic security in DSS against an eavesdropper that can observe a limited number of nodes. We present a framework that enables design of secure repair-efficient codes through a joint construction of inner and outer codes. Then, we consider a practically appealing notion of weakly secure coding, and construct coset codes that can weakly secure a wide class of regenerating codes that reduce the amount of data downloaded during node repair. Second, we consider the problem of meeting repair locality constraints, which specify the number of nodes participating in the repair process. We propose a notion of unequal locality, which enables different locality values for different nodes, ensuring quick recovery for nodes storing important data. We establish tight upper bounds on the minimum distance of linear codes with unequal locality, and present optimal code constructions. Next, we extend the notion of locality from the Hamming metric to the rank and subspace metrics, with the goal of designing codes for efficient data recovery from special types of correlated failures in DSS.We construct a family of locally recoverable rank-metric codes with optimal data recovery properties. Finally, we consider the problem of providing high availability, which is ensured by enabling node repair from multiple disjoint subsets of nodes of small size. We study codes with availability from a queuing-theoretical perspective by analyzing the average time necessary to download a block of data under the Poisson request arrival model when each node takes a random amount of time to fetch its contents. We compare the delay performance of the availability codes with several alternatives such as conventional erasure codes and replication schemes

    Coding for the Clouds: Coding Techniques for Enabling Security, Locality, and Availability in Distributed Storage Systems

    Get PDF
    Cloud systems have become the backbone of many applications such as multimedia streaming, e-commerce, and cluster computing. At the foundation of any cloud architecture lies a large-scale, distributed, data storage system. To accommodate the massive amount of data being stored on the cloud, these distributed storage systems (DSS) have been scaled to contain hundreds to thousands of nodes that are connected through a networking infrastructure. Such data-centers are usually built out of commodity components, which make failures the norm rather than the exception. In order to combat node failures, data is typically stored in a redundant fashion. Due to the exponential data growth rate, many DSS are beginning to resort to error control coding over conventional replication methods, as coding offers high storage space efficiency. This paradigm shift from replication to coding, along with the need to guarantee reliability, efficiency, and security in DSS, has created a new set of challenges and opportunities, opening up a new area of research. This thesis addresses several of these challenges and opportunities by broadly making the following contributions. (i) We design practically amenable, low-complexity coding schemes that guarantee security of cloud systems, ensure quick recovery from failures, and provide high availability for retrieving partial information; and (ii) We analyze fundamental performance limits and optimal trade-offs between the key performance metrics of these coding schemes. More specifically, we first consider the problem of achieving information-theoretic security in DSS against an eavesdropper that can observe a limited number of nodes. We present a framework that enables design of secure repair-efficient codes through a joint construction of inner and outer codes. Then, we consider a practically appealing notion of weakly secure coding, and construct coset codes that can weakly secure a wide class of regenerating codes that reduce the amount of data downloaded during node repair. Second, we consider the problem of meeting repair locality constraints, which specify the number of nodes participating in the repair process. We propose a notion of unequal locality, which enables different locality values for different nodes, ensuring quick recovery for nodes storing important data. We establish tight upper bounds on the minimum distance of linear codes with unequal locality, and present optimal code constructions. Next, we extend the notion of locality from the Hamming metric to the rank and subspace metrics, with the goal of designing codes for efficient data recovery from special types of correlated failures in DSS.We construct a family of locally recoverable rank-metric codes with optimal data recovery properties. Finally, we consider the problem of providing high availability, which is ensured by enabling node repair from multiple disjoint subsets of nodes of small size. We study codes with availability from a queuing-theoretical perspective by analyzing the average time necessary to download a block of data under the Poisson request arrival model when each node takes a random amount of time to fetch its contents. We compare the delay performance of the availability codes with several alternatives such as conventional erasure codes and replication schemes

    Coding Schemes for Distributed Storage Systems

    Get PDF
    This thesis is devoted to problems in error-correcting codes motivated by data integrity problems arising in large-scale distributed storage systems. We study properties and constructions of Maximum Distance Separable (MDS) codes, which are widely used in storage applications since they provide the maximum failure tolerance for a given amount of storage overhead. Among the parameters of the code that are important for storage applications are: the amount of data transferred in the system during node repair (the repair bandwidth), which characterizes the network usage, and the volume of accessed data, which corresponds to the number of disk I/O operations. Therefore, recent research on MDS codes for distributed storage has focused on codes that can minimize these two quantities. A lower bound on the repair bandwidth of a code, called the cut-set bound, was proved by Dimakis et al. in 2010, and codes that attain this bound are said to have the optimal repair property. Explicit optimal-repair low-rate (rate โ‰ค1/2\le 1/2) MDS codes were constructed by Rashmi et al. in 2011. At the same time, large-scale distributed systems such as the Google File System and Hadoop Distributed File System, employ high-rate (rate >1/2> 1/2) MDS codes due to the need of reducing storage overhead. Until recently, except for some particular cases, no general explicit constructions of high-rate optimal-repair MDS codes were known. In this thesis, we present the first explicit constructions of optimal-repair MDS codes, thereby providing a solution to the general construction problem of such codes for the high-rate regime. More specifically, we construct explicit MDS codes that can repair any number of failed nodes from any number of helper nodes with the smallest possible amount of downloaded/accessed data. For the particular case of repairing a single node failure, we further present an explicit family of MDS codes that minimize the amount of accessed data during the repair. This family of codes has an additional favorable property that the node size (the amount of information stored in the node) is also the smallest possible. Reducing the node size directly translates into reducing the complexity of storage systems. While most studies on MDS codes with optimal repair bandwidth focus on array codes, the repair problem of widely used scalar codes such as Reed-Solomon codes has also recently attracted attention of researchers. It has been an open problem whether scalar linear MDS codes can achieve the cut-set bound. In this thesis, we answer this question in the affirmative by giving explicit constructions of Reed-Solomon codes that can be repaired at the cut-set bound. We also prove a lower bound on the node size of optimally repairable scalar MDS codes, showing that the node size of our RS codes is close to the best possible for scalar linear codes. Finally, we extend the concept of repair bandwidth from erasure correction to error correction, which forms a new problem in coding theory. We prove a bound on the amount of downloaded information for this problem and present explicit code families that attain this bound for a wide range of parameters

    ์„ ํ˜• ๋™์ผ ๋ณต๊ตฌ ์žฌ์ƒ ๋ถ€ํ˜ธ์˜ ์ €์žฅ๋Ÿ‰๊ณผ ํ†ต์‹ ๋Ÿ‰ ๊ฐ„ ์ƒ์ถฉ ๊ด€๊ณ„์˜ ์™ธ๋ถ€ ๊ฒฝ๊ณ„์— ๊ด€ํ•œ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 8. ์ด์ •์šฐ.์ตœ๊ทผ SNS๋‚˜ ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค์˜ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€์™€ ๋”๋ถˆ์–ด, ๋Œ€๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋„คํŠธ์›Œํฌ์ƒ์— ํšจ์œจ์ ์ด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ„์‚ฐ ์ €์žฅ ์‹œ์Šคํ…œ(distributed storage system)์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ๋ถ„์‚ฐ ์ €์žฅ ์‹œ์Šคํ…œ์€ ๋Œ€๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ๋„คํŠธ์›Œํฌ๋กœ ์—ฐ๊ฒฐ๋œ ๋‹ค์ˆ˜์˜ ๋…ธ๋“œ์— ๋ถ„์‚ฐ์ ์œผ๋กœ ์ €์žฅํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งํ•œ๋‹ค. ์ผ๋ถ€์˜ ๋…ธ๋“œ๊ฐ€ ์†์‹ค๋˜์—ˆ์„ ๋•Œ, ์†์‹ค๋œ ๋…ธ๋“œ๋Š” ๋‹ค๋ฅธ ์ƒ์กดํ•œ ๋…ธ๋“œ๋“ค๋กœ๋ถ€ํ„ฐ ์ „์†ก๋ฐ›์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณต๊ตฌ๋  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต๊ตฌ ๊ณผ์ •์—์„œ ํ•„์š”ํ•œ ์ด ์ •๋ณด๋Ÿ‰์ธ ๋ณต๊ตฌ ๋Œ€์—ญํญ(repair bandwidth)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ ๋ถ„์‚ฐ ์ €์žฅ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ์„ฑ๋Šฅ ์ง€ํ‘œ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ํ˜‘๋ ฅ ์žฌ์ƒ ๋ถ€ํ˜ธ(Cooperative regenerating codes)๋Š” ๋†’์€ ๋ณต๊ตฌ ๋Œ€์—ญํญ์„ ์ตœ์†Œํ™”ํ•˜๋Š” erasure code์˜ ์ผ์ข…์ด๋‹ค. (n,k,d,r)(n,k,d,r)-ํ˜‘๋ ฅ ์žฌ์ƒ ๋ถ€ํ˜ธ๋Š” ์ด nn๊ฐœ์˜ ์ €์žฅ์†Œ ๋…ธ๋“œ ์ค‘ ์ผ๋ถ€์˜ kk๊ฐœ์˜ ๋…ธ๋“œ์— ์ €์žฅ๋œ ์ •๋ณด๋งŒ์œผ๋กœ ์›๋ž˜์˜ ํŒŒ์ผ์„ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ๊ณผ rr๊ฐœ์˜ ๋…ธ๋“œ ์†์‹ค์ด ๋ฐœ์ƒํ–ˆ์„๋•Œ, ์ž„์˜์˜ dd๊ฐœ์˜ ์ƒ์กดํ•œ ๋…ธ๋“œ๋“ค๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์ „์†ก๋ฐ›์•„ ๋ณต๊ตฌ๋  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ๊ฐ€์ง„๋‹ค. ์ด ๋•Œ, ์žฌ์ƒ ๋ถ€ํ˜ธ์˜ ๊ฐ ๋…ธ๋“œ๋ณ„ ์ €์žฅ๋Ÿ‰ ฮฑ\alpha์™€ ๋ณต๊ตฌ ๋Œ€์—ญํญ ฮณ\gamma๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ์ถฉ๊ด€๊ณ„์— ๋†“์—ฌ ์žˆ์Œ์ด ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ƒˆ๋กญ๊ฒŒ ๋ณต๊ตฌ๋œ ๋…ธ๋“œ๊ฐ€ ๊ธฐ์กด ๋…ธ๋“œ์™€ ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ํ—ˆ์šฉํ•˜๋Š” ๊ธฐ๋Šฅ ๋ณต๊ตฌ(functional repair) ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ์ด ์ƒ์ถฉ๊ด€๊ณ„๊ฐ€ ์™„๋ฒฝํžˆ ๋ฐํ˜€์ ธ ์žˆ์œผ๋‚˜, ์†์‹ค๋˜๊ธฐ ์ „๊ณผ ์™„์ „ํžˆ ๋™์ผํ•œ ๋…ธ๋“œ๋กœ์˜ ๋ณต๊ตฌ๋ฅผ ์š”๊ตฌํ•˜๋Š” ๋™์ผ ๋ณต๊ตฌ(exact repair) ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ์ด ์ƒ์ถฉ๊ด€๊ณ„๊ฐ€ ๋ช…ํ™•ํžˆ ๋ฐํ˜€์ ธ ์žˆ์ง€ ์•Š๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋™์ผ ๋ณต๊ตฌ ๋ชจ๋ธ์˜ ์ƒ์ถฉ ๊ด€๊ณ„์— ๋Œ€ํ•œ ๋‘ ์ข…๋ฅ˜์˜ ์™ธ๋ถ€ ๊ฒฝ๊ณ„(outer bound)๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ƒ์ถฉ ๊ด€๊ณ„์˜ ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” ๊ธฐ๋Šฅ ๋ณต๊ตฌ ๋ถ€ํ˜ธ๋กœ๋Š” ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ๋™์ผ ๋ณต๊ตฌ ๋ถ€ํ˜ธ๋กœ๋Š” ์„ค๊ณ„๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ (ฮฑ,ฮณ)(\alpha,\gamma) ๋™์ž‘์ ๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” ์ผ๋ฐ˜์ ์ธ (n,k,d,r)(n,k,d,r) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ํ˜‘๋ ฅ ์žฌ์ƒ ๋ถ€ํ˜ธ๋ฅผ ๊ฐ€์ •ํ•˜์—ฌ ์œ ๋„๋˜์—ˆ๋‹ค. ์ด ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” d=k=nโˆ’1d=k=n-1, r=1r=1์„ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ์— ํ•œํ•˜์—ฌ ์ตœ์ ์˜ ์ƒ์ถฉ๊ด€๊ณ„๋ฅผ ๋ฐํžŒ Prakash ๋“ฑ์˜ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•œ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” kk๊ฐ€ ํฌ๊ฑฐ๋‚˜ rr์ด ์ž‘๊ฑฐ๋‚˜ kk์™€ dd๊ฐ€ ๋น„์Šทํ•œ ์กฐ๊ฑด ํ•˜์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” ํ•œ ๋ฒˆ์— ํ•œ ๊ฐœ์˜ ์†์‹ค๋œ ๋…ธ๋“œ๋งŒ์„ ๋ณต๊ตฌํ•˜๋Š” ๊ฒฝ์šฐ๋กœ ํ•œ์ •ํ•˜์˜€์„ ๋•Œ๋ฅผ ๊ณ ๋ คํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ ์™ธ๋ถ€ ๊ฒฝ๊ณ„๋Š” ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ถ€๊ฒฝ๊ณ„(sub-bound)์˜ ํ•ฉ์ง‘ํ•ฉ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๋‘ ๊ฐ€์ง€์˜ ๋ถ€๊ฒฝ๊ณ„๋“ค์€ ๊ฐ๊ฐ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š” ์กฐ๊ฑด์ด ๋‹ค๋ฆ„์„ ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ถ€๊ฒฝ๊ณ„๋Š” ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ฒซ ๋ฒˆ์งธ๋กœ ์ œ์•ˆ๋œ ์™ธ๋ถ€ ๊ฒฝ๊ณ„์™€ ๋น„์Šทํ•˜๊ฒŒ k/nk/n์œผ๋กœ ์ •์˜๋˜๋Š” ์ฝ”๋“œ์˜ ๋ถ€ํ˜ธํ™”์œจ์ด 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋‘ ๋ฒˆ์งธ ๋ถ€ ๊ฒฝ๊ณ„๋Š” ๋ฐ˜๋Œ€๋กœ ๋ถ€ํ˜ธํ™”์œจ์ด ๋‚ฎ์•„์งˆ๋–„ ๋‹ค๋ฅธ ๊ธฐ์กด์˜ ์™ธ๋ถ€๊ฒฝ๊ณ„๋“ค๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.Distributed storage systems disperse data to a large number of storage nodes connected in a network. When some of the storage nodes fail, a storage system should be able to repair them by downloading data from other surviving nodes. The amount of data traffic during the repair, called repair bandwidth, is one of the important performance metrics of distributed storage systems. Cooperative regenerating codes are a class of recently developed erasure codes which are optimal in terms of minimizing the repair bandwidth. An (n,k,d,r)(n,k,d,r)-cooperative regenerating code has nn storage nodes, where kk arbitrary nodes are enough to reconstruct the original data, and rr failed nodes can be repaired cooperatively with the help of dd arbitrary surviving nodes. In the regenerating-code framework, there exists a tradeoff between the storage capacity of each node ฮฑ\alpha and the repair bandwidth ฮณ\gamma. The tradeoff of functional repair codes are fully characterized by Shum et al, but the problem of specifying the optimal storage-bandwidth tradeoff of the exact repair codes remains open. In this dissertation, two outer bounds on the storage-bandwidth tradeoff under the exact repair model are proposed. The outer bounds suggest the (ฮฑ,ฮณ)(\alpha,\gamma) pairs that no exact repair codes can achieve but only functional repair codes can. The first outer bound considers general set of parameters (n,k,d,r)(n,k,d,r). This result can be regarded as a generalization of the outer bound proposed by Prakash et al., which specifies the optimal tradeoff of exact-repair regenerating codes for the case of d=k=nโˆ’1d=k=n-1 and r=1r=1. It is verified that the proposed outer bound becomes more effective when kk is large, rr is small, or dย (โ‰ฅk)d~(\geq k) is close to kk. The second outer bound is developed for the case of single node repair (r=1r=1). The bound is union of two independently derived sub-bounds. Each sub-bound has its own condition to be tighter than the other. One sub-bound can be regarded as an extension of the first outer bound for r=1r=1, and becomes more effective in high rates (k/n>12k/n >\frac {1}{2}). The other sub-bound is derived based on the symmetric property of the storage nodes, and is tight in low rates (k/n<12k/n <\frac{1}{2}).1 Introduction 1 1.1 The Family of Regenerating Codes 2 1.2 The Exact Repair Model 5 1.3 Existing Results on the S-B Tradeoff of Exact Repair Codes 7 1.4 Main Contribution 10 2 An Outer Bound on the Storage-Bandwidth Tradeoff of Cooperative Regenerating Codes 14 2.1 Conditions for Parity Check Matrices of Linear Cooperative Regenerating Codes 14 2.1.1 Proof of Lemma 1 24 2.2 An Alternative Proof of Functional Repair Cutset Bound 28 2.2.1 Construction of Hrepair 30 2.2.2 Lower Bounds of rank(Hrepair) 35 2.2.3 Upper Bounds of B 39 2.3 Block Matrices with Full-Rank Diagonal Blocks 39 2.3.1 Definitions 41 2.3.2 Properties of Block Matrices with Full-Rank Diagonal Blocks 43 2.4 An Outer Bound of Linear and Exact-Repair Cooperative Regenerating Codes 55 2.4.1 Construction of Hrepair 56 2.4.2 Lower Bound of rank(Hrepair) 57 2.4.3 Derivation of the Proposed Outer Bound 60 2.5 Evaluation of the Proposed Outer Bound 63 3 An Improved Outer Bound for the Case of Single Node Repair 69 3.1 Symmetric Exact-Repair codes 69 3.2 Conditions for Parity Check Matrices of Single Repair Codes 70 3.3 Construction of Hsingle 75 3.4 Derivation of Two Sub-Bounds 80 3.4.1 Proof of Theorem 2 80 3.4.2 Proof of Theorem 3 83 3.5 Performance Evaluation 86 4 Conclusion 93 Bibilography 95 Abstract (In Korean) 102 Acknowledgements (In Korean) 104Docto

    An erasure-resilient and compute-efficient coding scheme for storage applications

    Get PDF
    Driven by rapid technological advancements, the amount of data that is created, captured, communicated, and stored worldwide has grown exponentially over the past decades. Along with this development it has become critical for many disciplines of science and business to being able to gather and analyze large amounts of data. The sheer volume of the data often exceeds the capabilities of classical storage systems, with the result that current large-scale storage systems are highly distributed and are comprised of a high number of individual storage components. As with any other electronic device, the reliability of storage hardware is governed by certain probability distributions, which in turn are influenced by the physical processes utilized to store the information. The traditional way to deal with the inherent unreliability of combined storage systems is to replicate the data several times. Another popular approach to achieve failure tolerance is to calculate the block-wise parity in one or more dimensions. With better understanding of the different failure modes of storage components, it has become evident that sophisticated high-level error detection and correction techniques are indispensable for the ever-growing distributed systems. The utilization of powerful cyclic error-correcting codes, however, comes with a high computational penalty, since the required operations over finite fields do not map very well onto current commodity processors. This thesis introduces a versatile coding scheme with fully adjustable fault-tolerance that is tailored specifically to modern processor architectures. To reduce stress on the memory subsystem the conventional table-based algorithm for multiplication over finite fields has been replaced with a polynomial version. This arithmetically intense algorithm is better suited to the wide SIMD units of the currently available general purpose processors, but also displays significant benefits when used with modern many-core accelerator devices (for instance the popular general purpose graphics processing units). A CPU implementation using SSE and a GPU version using CUDA are presented. The performance of the multiplication depends on the distribution of the polynomial coefficients in the finite field elements. This property has been used to create suitable matrices that generate a linear systematic erasure-correcting code which shows a significantly increased multiplication performance for the relevant matrix elements. Several approaches to obtain the optimized generator matrices are elaborated and their implications are discussed. A Monte-Carlo-based construction method allows it to influence the specific shape of the generator matrices and thus to adapt them to special storage and archiving workloads. Extensive benchmarks on CPU and GPU demonstrate the superior performance and the future application scenarios of this novel erasure-resilient coding scheme
    corecore