10,508 research outputs found

    Software reliability through fault-avoidance and fault-tolerance

    Get PDF
    The use of back-to-back, or comparison, testing for regression test or porting is examined. The efficiency and the cost of the strategy is compared with manual and table-driven single version testing. Some of the key parameters that influence the efficiency and the cost of the approach are the failure identification effort during single version program testing, the extent of implemented changes, the nature of the regression test data (e.g., random), and the nature of the inter-version failure correlation and fault-masking. The advantages and disadvantages of the technique are discussed, together with some suggestions concerning its practical use

    Beam Loss Monitors at LHC

    Full text link
    One of the main functions of the LHC beam loss measurement system is the protection of equipment against damage caused by impacting particles creating secondary showers and their energy dissipation in the matter. Reliability requirements are scaled according to the acceptable consequences and the frequency of particle impact events on equipment. Increasing reliability often leads to more complex systems. The downside of complexity is a reduction of availability; therefore, an optimum has to be found for these conflicting requirements. A detailed review of selected concepts and solutions for the LHC system will be given to show approaches used in various parts of the system from the sensors, signal processing, and software implementations to the requirements for operation and documentation.Comment: 16 pages, contribution to the 2014 Joint International Accelerator School: Beam Loss and Accelerator Protection, Newport Beach, CA, USA , 5-14 Nov 201

    Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

    Full text link
    Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

    Flexible provisioning of Web service workflows

    No full text
    Web services promise to revolutionise the way computational resources and business processes are offered and invoked in open, distributed systems, such as the Internet. These services are described using machine-readable meta-data, which enables consumer applications to automatically discover and provision suitable services for their workflows at run-time. However, current approaches have typically assumed service descriptions are accurate and deterministic, and so have neglected to account for the fact that services in these open systems are inherently unreliable and uncertain. Specifically, network failures, software bugs and competition for services may regularly lead to execution delays or even service failures. To address this problem, the process of provisioning services needs to be performed in a more flexible manner than has so far been considered, in order to proactively deal with failures and to recover workflows that have partially failed. To this end, we devise and present a heuristic strategy that varies the provisioning of services according to their predicted performance. Using simulation, we then benchmark our algorithm and show that it leads to a 700% improvement in average utility, while successfully completing up to eight times as many workflows as approaches that do not consider service failures
    corecore