87 research outputs found

    HTSC and FH_HTSC: XOR-based codes to reduce access latency in distributed storage systems

    Get PDF
    A massive distributed storage system is the foundation for big data operations. Access latency performance is a key metric in distributed storage systems since it greatly impacts user experience while existing codes mainly focus on improving performance such as storage overhead and repair cost. By generating parity nodes from parity nodes, in this paper we design new XOR-based erasure codes hierarchical tree structure code (HTSC) and high failure tolerant HTSC (FH_HTSC) to reduce access latency in distributed storage systems. By comparing with other popular and representative codes, we show that, under the same repair cost, HTSC and FH.HTSC codes can reduce access latency while maintaining favorable performance in other metrics. In particular, under the same repair cost, FH.HTSC can achieve lower access latency, higher or equal failure tolerance and lower computation cost compared with the representative codes while enjoying similar storage overhead. Accordingly, FH.HTSC is a superior choice for applications requiring low access latency and outstanding failure tolerance capability at the same time.postprin

    Effect of replica placement on the reliability of large scale data storage systems

    Get PDF
    Replication is a widely used method to protect large- scale data storage systems from data loss when storage nodes fail. It is well known that the placement of replicas of the different data blocks across the nodes affects the time to rebuild. Several systems described in the literature are designed based on the premise that minimizing the rebuild times maximizes the system reliability. Our results however indicate that the reliability is essentially unaffected by the replica placement scheme. We show that, for a replication factor of two, all possible placement schemes have mean times to data loss (MTTDLs) within a factor of two for practical values of the failure rate, storage capacity, and rebuild bandwidth of a storage node. The theoretical results are confirmed by means of event-driven simulation. For higher replication factors, an analytical derivation of MTTDL becomes intractable for a general placement scheme. We therefore use one of the alternate measures of reliability that have been proposed in the literature, namely, the probability of data loss during rebuild in the critical mode of the system. Whereas for a replication factor of two this measure can be directly translated into MTTDL, it is only speculative of the MTTDL behavior for higher replication factors. This measure of reliability is shown to lie within a factor of two for all possible placement schemes and any replication factor. We also show that for any replication factor, the clustered placement scheme has the lowest probability of data loss during rebuild in critical mode among all possible placement schemes, whereas the declustered placement scheme has the highest probability. Simulation results reveal however that these properties do not hold for the corresponding MTTDLs for a replication factor greater than two. This indicates that some alternate measures of reliability may not be appropriate for comparing the MTTDL of different placement schemes

    Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

    Get PDF
    The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users. This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF\u27s) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system

    Causal failures and cost-effective edge augmentation in networks

    Get PDF
    Node failures have a terrible effect on the connectivity of the network. In traditional models, the failures of nodes affect their neighbors and may further trigger the failures of their neighbors, and so on. However, it is also possible that node failures would indirectly cause the failure of nodes that are not adjacent to the failed one. In a power grid, generators share the load. Failure of one generator induces extra load on other generators in the network, which could further trigger their failures. We call such failures causal failures. In this dissertation, we consider the impact of causal failures on multiple aspects of one network. More specifically, we list the content as follows. • In Chapter 1, we introduce basic concepts of networks and graphs, classical models of failures and formally define causal failures in a given network. • Chapter 2 addresses the network’s robustness and aims to find the maximum number of causal failures while maintaining a connected component with a size of at least a given integer. More specifically, we are looking into the number of causal node failures we can tolerate yet have most of the system connected with α being used to parametrize. • Chapter 3 deals with vulnerability, wherein we aim to find the minimum number of causal failures such that there are at least k connected components remaining. We are looking for the set of causal failures that will result in the network being disconnected into k or more components. • In Chapter 4, we consider causal node failures occurring in a cascading manner. Cascading causal node failures affect communication within nodes, which is dependent on the paths that connect them. Therefore, in this context of the cascading causal failure model, we study the impact of cascading causal failures on the distance between a pair of nodes in the network. More precisely, given a network G, a set of causal failures (containing possible cascading failures), a pair of nodes s and t, and a constant α ≥ 1, we would like to determine the maximum number of causal failures that can be applied (meaning that the nodes in the causal failures are removed), such that in the resulting network G′, dG′ (s, t) ≤ α × dG(s, t), where dG(s, t) and dG′ (s, t) are the distance between nodes s and t in the networks G and G′, respectively. • In Chapter 5, we consider causal edge failures in flow networks and investigate the impact of causal edge failures on flow transmission. We formulate an optimization problem to find the maximum number of causal edge failures after which the flow network can still deliver d units from source node s to terminal node t. • In Chapter 6, we consider edge-weighted network augmentation when facing causal failures. We look for a set of edges with minimum weight such that the network maintains an α-giant component when applying each causality individually. We show that the optimization problems in these chapters are NP-hard and provide the corresponding mixed integer linear programming models. Moreover, we design polynomial-time heuristic algorithms to solve them approximately. In each chapter, we run experiments on multiple synthetic and real networks to compare the performance of the mixed integer linear programming models and the heuristic algorithms. The results show that the heuristic algorithms show their efficacy and efficiency compared to the mixed-integer linear programming models

    Coding for the Clouds: Coding Techniques for Enabling Security, Locality, and Availability in Distributed Storage Systems

    Get PDF
    Cloud systems have become the backbone of many applications such as multimedia streaming, e-commerce, and cluster computing. At the foundation of any cloud architecture lies a large-scale, distributed, data storage system. To accommodate the massive amount of data being stored on the cloud, these distributed storage systems (DSS) have been scaled to contain hundreds to thousands of nodes that are connected through a networking infrastructure. Such data-centers are usually built out of commodity components, which make failures the norm rather than the exception. In order to combat node failures, data is typically stored in a redundant fashion. Due to the exponential data growth rate, many DSS are beginning to resort to error control coding over conventional replication methods, as coding offers high storage space efficiency. This paradigm shift from replication to coding, along with the need to guarantee reliability, efficiency, and security in DSS, has created a new set of challenges and opportunities, opening up a new area of research. This thesis addresses several of these challenges and opportunities by broadly making the following contributions. (i) We design practically amenable, low-complexity coding schemes that guarantee security of cloud systems, ensure quick recovery from failures, and provide high availability for retrieving partial information; and (ii) We analyze fundamental performance limits and optimal trade-offs between the key performance metrics of these coding schemes. More specifically, we first consider the problem of achieving information-theoretic security in DSS against an eavesdropper that can observe a limited number of nodes. We present a framework that enables design of secure repair-efficient codes through a joint construction of inner and outer codes. Then, we consider a practically appealing notion of weakly secure coding, and construct coset codes that can weakly secure a wide class of regenerating codes that reduce the amount of data downloaded during node repair. Second, we consider the problem of meeting repair locality constraints, which specify the number of nodes participating in the repair process. We propose a notion of unequal locality, which enables different locality values for different nodes, ensuring quick recovery for nodes storing important data. We establish tight upper bounds on the minimum distance of linear codes with unequal locality, and present optimal code constructions. Next, we extend the notion of locality from the Hamming metric to the rank and subspace metrics, with the goal of designing codes for efficient data recovery from special types of correlated failures in DSS.We construct a family of locally recoverable rank-metric codes with optimal data recovery properties. Finally, we consider the problem of providing high availability, which is ensured by enabling node repair from multiple disjoint subsets of nodes of small size. We study codes with availability from a queuing-theoretical perspective by analyzing the average time necessary to download a block of data under the Poisson request arrival model when each node takes a random amount of time to fetch its contents. We compare the delay performance of the availability codes with several alternatives such as conventional erasure codes and replication schemes
    • …
    corecore