4,435 research outputs found

    EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

    Get PDF
    We present a novel method, that we call EVENODD, for tolerating up to two disk failures in RAID architectures. EVENODD employs the addition of only two redundant disks and consists of simple exclusive-OR computations. This redundant storage is optimal, in the sense that two failed disks cannot be retrieved with less than two redundant disks. A major advantage of EVENODD is that it only requires parity hardware, which is typically present in standard RAID-5 controllers. Hence, EVENODD can be implemented on standard RAID-5 controllers without any hardware changes. The most commonly used scheme that employes optimal redundant storage (i.e., two extra disks) is based on Reed-Solomon (RS) error-correcting codes. This scheme requires computation over finite fields and results in a more complex implementation. For example, we show that the complexity of implementing EVENODD in a disk array with 15 disks is about 50% of the one required when using the RS scheme. The new scheme is not limited to RAID architectures: it can be used in any system requiring large symbols and relatively short codes, for instance, in multitrack magnetic recording. To this end, we also present a decoding algorithm for one column (track) in error

    Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

    Full text link
    Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

    Redundant disk arrays: Reliable, parallel secondary storage

    Get PDF
    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures

    A Business Continuity Solution for Telecommunications Billing Systems

    Get PDF
    The billing system is a critical component in a Telecommunications service provider\u27s suite of business support systems - without the billing system the provider cannot invoice their customers for services provided and therefore cannot generate revenue. Typically billing systems are hosted on a single large Unix/Oracle system located in the company\u27s data centre. Modern Unix servers with their redundant components and hot swap parts are highly resilient and can provide levels of availability when correctly installed in properly managed data centre with uninterruptible power supplies, cooling etc. High Availability clustering through the use of HP MC/ServiceGuard, Sun Cluster, IBM HACMP (High Availability Cluster Multi-Processing) or Oracle Clusterware/RAC (Real Application clusters) can bring this level of availability even higher. This approach however can only protect against the failure of a single server or component of the system, it cannot protect against the loss of an entire data centre in the event of a disaster such as a fire, flood or earthquake. In order to protect against such disasters it is necessary to provide some form of backup system on a site sufficiently remote from the primary site so that it would not be affected by any disaster, which might befall the primary site. This paper proposes a cost effective business continuity solution to protect a Telecommunications Billing system from the effects of unplanned downtime due to server or site outages. It is aimed at the smaller scale tier 2 and tier 3 providers such as Mobile Virtual Network Operators (MVNOs) and startup Competitive Local Exchange Carriers (CLECs) who are unlikely to have large established IT systems with business continuity features and for whom cost effectiveness is a key concern when implementing IT systems

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    RAID Level 6 and Level 6+ Reliability

    Get PDF
    Storage systems are built of fallible components but have to provide high degrees of reliability. Besides mirroring and triplicating data, redundant storage of information using erasure-correcting codes is the only possibility to have data survive device failure.We provide here exact formula for the data-loss probability of a disk array composed of several RAID Level 6 stripes. This two-failure tolerant is not only used in practice but can also provide a reference point for the assessment of other data organizations
    corecore