96 research outputs found

    Coding for the Clouds: Coding Techniques for Enabling Security, Locality, and Availability in Distributed Storage Systems

    Get PDF
    Cloud systems have become the backbone of many applications such as multimedia streaming, e-commerce, and cluster computing. At the foundation of any cloud architecture lies a large-scale, distributed, data storage system. To accommodate the massive amount of data being stored on the cloud, these distributed storage systems (DSS) have been scaled to contain hundreds to thousands of nodes that are connected through a networking infrastructure. Such data-centers are usually built out of commodity components, which make failures the norm rather than the exception. In order to combat node failures, data is typically stored in a redundant fashion. Due to the exponential data growth rate, many DSS are beginning to resort to error control coding over conventional replication methods, as coding offers high storage space efficiency. This paradigm shift from replication to coding, along with the need to guarantee reliability, efficiency, and security in DSS, has created a new set of challenges and opportunities, opening up a new area of research. This thesis addresses several of these challenges and opportunities by broadly making the following contributions. (i) We design practically amenable, low-complexity coding schemes that guarantee security of cloud systems, ensure quick recovery from failures, and provide high availability for retrieving partial information; and (ii) We analyze fundamental performance limits and optimal trade-offs between the key performance metrics of these coding schemes. More specifically, we first consider the problem of achieving information-theoretic security in DSS against an eavesdropper that can observe a limited number of nodes. We present a framework that enables design of secure repair-efficient codes through a joint construction of inner and outer codes. Then, we consider a practically appealing notion of weakly secure coding, and construct coset codes that can weakly secure a wide class of regenerating codes that reduce the amount of data downloaded during node repair. Second, we consider the problem of meeting repair locality constraints, which specify the number of nodes participating in the repair process. We propose a notion of unequal locality, which enables different locality values for different nodes, ensuring quick recovery for nodes storing important data. We establish tight upper bounds on the minimum distance of linear codes with unequal locality, and present optimal code constructions. Next, we extend the notion of locality from the Hamming metric to the rank and subspace metrics, with the goal of designing codes for efficient data recovery from special types of correlated failures in DSS.We construct a family of locally recoverable rank-metric codes with optimal data recovery properties. Finally, we consider the problem of providing high availability, which is ensured by enabling node repair from multiple disjoint subsets of nodes of small size. We study codes with availability from a queuing-theoretical perspective by analyzing the average time necessary to download a block of data under the Poisson request arrival model when each node takes a random amount of time to fetch its contents. We compare the delay performance of the availability codes with several alternatives such as conventional erasure codes and replication schemes

    Coding for Security and Reliability in Distributed Systems

    Get PDF
    This dissertation studies the use of coding techniques to improve the reliability and security of distributed systems. The first three parts focus on distributed storage systems, and study schemes that encode a message into n shares, assigned to n nodes, such that any n - r nodes can decode the message (reliability) and any colluding z nodes cannot infer any information about the message (security). The objective is to optimize the computational, implementation, communication and access complexity of the schemes during the process of encoding, decoding and repair. These are the key metrics of the schemes so that when they are applied in practical distributed storage systems, the systems are not only reliable and secure, but also fast and cost-effective. Schemes with highly efficient computation and implementation are studied in Part I. For the practical high rate case of r ≤ 3 and z ≤ 3, we construct schemes that require only r + z XORs to encode and z XORs to decode each message bit, based on practical erasure codes including the B, EVENODD and STAR codes. This encoding and decoding complexity is shown to be optimal. For general r and z, we design schemes over a special ring from Cauchy matrices and Vandermonde matrices. Both schemes can be efficiently encoded and decoded due to the structure of the ring. We also discuss methods to shorten the proposed schemes. Part II studies schemes that are efficient in terms of communication and access complexity. We derive a lower bound on the decoding bandwidth, and design schemes achieving the optimal decoding bandwidth and access. We then design schemes that achieve the optimal bandwidth and access not only for decoding, but also for repair. Furthermore, we present a family of Shamir's schemes with asymptotically optimal decoding bandwidth. Part III studies the problem of secure repair, i.e., reconstructing the share of a (failed) node without leaking any information about the message. We present generic secure repair protocols that can securely repair any linear schemes. We derive a lower bound on the secure repair bandwidth and show that the proposed protocols are essentially optimal in terms of bandwidth. In the final part of the dissertation, we study the use of coding techniques to improve the reliability and security of network communication. Specifically, in Part IV we draw connections between several important problems in network coding. We present reductions that map an arbitrary multiple-unicast network coding instance to a unicast secure network coding instance in which at most one link is eavesdropped, or a unicast network error correction instance in which at most one link is erroneous, such that a rate tuple is achievable in the multiple-unicast network coding instance if and only if a corresponding rate is achievable in the unicast secure network coding instance, or in the unicast network error correction instance. Conversely, we show that an arbitrary unicast secure network coding instance in which at most one link is eavesdropped can be reduced back to a multiple-unicast network coding instance. Additionally, we show that the capacity of a unicast network error correction instance in general is not (exactly) achievable. We derive upper bounds on the secrecy capacity for the secure network coding problem, based on cut-sets and the connectivity of links. Finally, we study optimal coding schemes for the network error correction problem, in the setting that the network and adversary parameters are not known a priori.</p

    Coding for the Clouds: Coding Techniques for Enabling Security, Locality, and Availability in Distributed Storage Systems

    Get PDF
    Cloud systems have become the backbone of many applications such as multimedia streaming, e-commerce, and cluster computing. At the foundation of any cloud architecture lies a large-scale, distributed, data storage system. To accommodate the massive amount of data being stored on the cloud, these distributed storage systems (DSS) have been scaled to contain hundreds to thousands of nodes that are connected through a networking infrastructure. Such data-centers are usually built out of commodity components, which make failures the norm rather than the exception. In order to combat node failures, data is typically stored in a redundant fashion. Due to the exponential data growth rate, many DSS are beginning to resort to error control coding over conventional replication methods, as coding offers high storage space efficiency. This paradigm shift from replication to coding, along with the need to guarantee reliability, efficiency, and security in DSS, has created a new set of challenges and opportunities, opening up a new area of research. This thesis addresses several of these challenges and opportunities by broadly making the following contributions. (i) We design practically amenable, low-complexity coding schemes that guarantee security of cloud systems, ensure quick recovery from failures, and provide high availability for retrieving partial information; and (ii) We analyze fundamental performance limits and optimal trade-offs between the key performance metrics of these coding schemes. More specifically, we first consider the problem of achieving information-theoretic security in DSS against an eavesdropper that can observe a limited number of nodes. We present a framework that enables design of secure repair-efficient codes through a joint construction of inner and outer codes. Then, we consider a practically appealing notion of weakly secure coding, and construct coset codes that can weakly secure a wide class of regenerating codes that reduce the amount of data downloaded during node repair. Second, we consider the problem of meeting repair locality constraints, which specify the number of nodes participating in the repair process. We propose a notion of unequal locality, which enables different locality values for different nodes, ensuring quick recovery for nodes storing important data. We establish tight upper bounds on the minimum distance of linear codes with unequal locality, and present optimal code constructions. Next, we extend the notion of locality from the Hamming metric to the rank and subspace metrics, with the goal of designing codes for efficient data recovery from special types of correlated failures in DSS.We construct a family of locally recoverable rank-metric codes with optimal data recovery properties. Finally, we consider the problem of providing high availability, which is ensured by enabling node repair from multiple disjoint subsets of nodes of small size. We study codes with availability from a queuing-theoretical perspective by analyzing the average time necessary to download a block of data under the Poisson request arrival model when each node takes a random amount of time to fetch its contents. We compare the delay performance of the availability codes with several alternatives such as conventional erasure codes and replication schemes

    Exploration of Erasure-Coded Storage Systems for High Performance, Reliability, and Inter-operability

    Get PDF
    With the unprecedented growth of data and the use of low commodity drives in local disk-based storage systems and remote cloud-based servers has increased the risk of data loss and an overall increase in the user perceived system latency. To guarantee high reliability, replication has been the most popular choice for decades, because of simplicity in data management. With the high volume of data being generated every day, the storage cost of replication is very high and is no longer a viable approach. Erasure coding is another approach of adding redundancy in storage systems, which provides high reliability at a fraction of the cost of replication. However, the choice of erasure codes being used affects the storage efficiency, reliability, and overall system performance. At the same time, the performance and interoperability are adversely affected by the slower device components and complex central management systems and operations. To address the problems encountered in various layers of the erasure coded storage system, in this dissertation, we explore the different aspects of storage and design several techniques to improve the reliability, performance, and interoperability. These techniques range from the comprehensive evaluation of erasure codes, application of erasure codes for highly reliable and high-performance SSD system, to the design of new erasure coding and caching schemes for Hadoop Distributed File System, which is one of the central management systems for distributed storage. Detailed evaluation and results are also provided in this dissertation

    Methods for DNA Methylation Sequencing Analysis and their Application on Cancer Data

    Get PDF
    The fundamental subject of this thesis is the development of tools for the analysis of DNA methylation data as well as their application on bisulfite sequencing data comprising a large number of samples. DNA methylation is one of the major epigenetic modifications. It affects the cytosines of the DNA and is essential for the normal development of cells and tissues. Unusual alterations are associated with a variety of diseases and, specially, in cancergeneous tissues global changes in the DNA methylation level have been detected. To sequence DNA methylation on single nucleotide resolution, the sequences are treated with sodium bisulfite before sequencing, whereby unmethylated cytosines are represented as thymines. Thus, specialized techniques are required to process and analyze these kind of data. Here, the bisulfite analysis toolkit BAT is introduced, that is designed to facilitate an quick analysis of bisulfite treated DNA methylation sequencing data. It covers all steps of processing raw sequencing data up to calling of differential DNA methylation. At the begin of analysis, sodium bisulfite treated sequence data are aligned and DNA methylation rates for each covered cytosine in the reference genome are called. Subsequently, BAT integrates annotation data and performs basic analysis, i. e., methylation rate distribution plots and hierarchical clustering of the samples. In addition, calling of differentially methylated regions is performed and statistics of called regions are automatically created. Finally, DNA methylation and gene expression data integration is covered by the calculation of correlating regions. Secondly, a novel algorithm, metilene, for the calculation of differentially methylated regions (DMRs) between two groups of samples is introduced. Existing methods are limited in terms of detection sensitivity as well as time and memory consumption. Our approach is based on a circular binary segmentation, using a scoring function to detect sub-regions that show a stronger difference between the mean methylation levels of two groups than the surrounding background. These sub-regions are tested using a two-dimensional Kolmogorov Smirnov test (2D-KS test) [Fasano 1987] for significant differences taking all samples of each group into account. The use of the non-parametric 2D-KS test allows to avoid assumptions about a background distribution. Furthermore, the two dimensions of the problem, i. e., (i) the detection of a region, such that (ii) the methylation rates of the samples in the groups are significantly different, are taken into account in a single test. The algorithm calls DMRs in sufficiently short time on single sample comparisons as well as on about 50 samples per group. Furthermore, it works on whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data and is able so estimate missing data points from the methylation rates of other samples in the group. Benchmarks on simulated and real data sets show that metilene outperforms other existing methods and is especially suitable for noisy datasets often found for example in cancer analysis. In the framework of this thesis, the previously introduced methods and algorithms are used to analyze a WGBS dataset of two different subtypes of germinal-center derived B-cell lymphomas and healthy controls. In both lymphoma subgroups genome-wide hypomethylation was found, with an exception for a specific type of promoter regions, i. e., poised promoters, that were frequently found to be hypermethylated. Using the previously presented algorithm, DMRs were called between the three entities. A strong enrichment of DMRs immediately downstream of the transcription start site was observed, indicating the regulatory relevance of this regions. The integration of gene expression data of the same samples, revealed that a considerable amount of the DMRs showed significant correlation between gene expression and DNA methylation. Finally, transcription factor binding sites and mutation data were combined with the methylation and expression data analysis. This identified strongly altered signaling pathways and cancer subtype specific genes. Furthermore, the data integration indicates that mutations and DNA methylation changes may act complementary to another. Finally, findings from the lymphoma study regarding the hypermethylation of poised promoters in cancer were extended to a huge data set comprising a variety of cancers. We could show that the relation of DNA methylation at a small set of frequently poised regions with respect to the background methylation level is sufficient to classify almost all samples based on DNA methylation data from 450k BeadChips into cancer or non-cancer probes. In addition, we found that the increase in methylation co-occurs with upregulated gene expression of several poised promoter regulated genes in almost all fresh cancer samples, implying a de-poising of poised regions. This upregulated gene expression is in contrast to the silencing of those genes in cancer cell lines, indicating that the upregulated gene expression might be a temporary status and possibly contributes to cancerogenesis
    • …
    corecore