342,702 research outputs found

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Implementation of reliability aware scheduling in an open source scheduling system

    Get PDF
    High performance computing clusters provide an efficient and cost effective solution to tackle large and complex problems. These clusters make use of the computing power available from widely available and relatively inexpensive commodity hardware. However, commodity hardware is liable to frequent failures, which can cause processes that are executing on these components to fail. Hence, high performance clusters often suffer from poor reliability. Whenever failures occur, additional costs are generated which lead to an increase in the cost of running the cluster. To prevent processes from failing, proactive fault tolerance strategies may be used in these cluster systems. The scheduler in these systems is an appropriate venue for applying proactive strategies to help prevent failures from occurring. In this thesis we have implemented an approach that incorporates reliability awareness in the scheduler. Based on historic system logs, estimates are made about the reliability of resources in the cluster. The scheduler makes decisions on where to schedule jobs depending on the reliability need of the job and the expected predicted reliability of computing nodes. This reliability need is calculated based on the characteristics of the job. Typically, jobs which are large and complex have a high reliability need. The scheduler assigns jobs which have a high reliability need to resources that can provide an adequate level of reliability, and avoids resources which have a low reliability. The lower reliability resources are allocated to jobs which have a low reliability need. Thus, by assigning jobs to resources based on reliability characteristics, failures of large and complex jobs can be statistically avoided compared to a typical node assignment strategy. Hence, by using this approach, the costs associated with failures can be reduced, and overall reliability of the system can be improved

    Reliable and energy efficient resource provisioning in cloud computing systems

    Get PDF
    Cloud Computing has revolutionized the Information Technology sector by giving computing a perspective of service. The services of cloud computing can be accessed by users not knowing about the underlying system with easy-to-use portals. To provide such an abstract view, cloud computing systems have to perform many complex operations besides managing a large underlying infrastructure. Such complex operations confront service providers with many challenges such as security, sustainability, reliability, energy consumption and resource management. Among all the challenges, reliability and energy consumption are two key challenges focused on in this thesis because of their conflicting nature. Current solutions either focused on reliability techniques or energy efficiency methods. But it has been observed that mechanisms providing reliability in cloud computing systems can deteriorate the energy consumption. Adding backup resources and running replicated systems provide strong fault tolerance but also increase energy consumption. Reducing energy consumption by running resources on low power scaling levels or by reducing the number of active but idle sitting resources such as backup resources reduces the system reliability. This creates a critical trade-off between these two metrics that are investigated in this thesis. To address this problem, this thesis presents novel resource management policies which target the provisioning of best resources in terms of reliability and energy efficiency and allocate them to suitable virtual machines. A mathematical framework showing interplay between reliability and energy consumption is also proposed in this thesis. A formal method to calculate the finishing time of tasks running in a cloud computing environment impacted with independent and correlated failures is also provided. The proposed policies adopted various fault tolerance mechanisms while satisfying the constraints such as task deadlines and utility values. This thesis also provides a novel failure-aware VM consolidation method, which takes the failure characteristics of resources into consideration before performing VM consolidation. All the proposed resource management methods are evaluated by using real failure traces collected from various distributed computing sites. In order to perform the evaluation, a cloud computing framework, 'ReliableCloudSim' capable of simulating failure-prone cloud computing systems is developed. The key research findings and contributions of this thesis are: 1. If the emphasis is given only to energy optimization without considering reliability in a failure prone cloud computing environment, the results can be contrary to the intuitive expectations. Rather than reducing energy consumption, a system ends up consuming more energy due to the energy losses incurred because of failure overheads. 2. While performing VM consolidation in a failure prone cloud computing environment, a significant improvement in terms of energy efficiency and reliability can be achieved by considering failure characteristics of physical resources. 3. By considering correlated occurrence of failures during resource provisioning and VM allocation, the service downtime or interruption is reduced significantly by 34% in comparison to the environments with the assumption of independent occurrence of failures. Moreover, measured by our mathematical model, the ratio of reliability and energy consumption is improved by 14%

    Design, Verification, Test and In-Field Implications of Approximate Computing Systems

    Get PDF
    Today, the concept of approximation in computing is becoming more and more a “hot topic” to investigate how computing systems can be more energy efficient, faster, and less complex. Intuitively, instead of performing exact computations and, consequently, requiring a high amount of resources, Approximate Computing aims at selectively relaxing the specifications, trading accuracy off for efficiency. While Approximate Computing gives several promises when looking at systems’ performance, energy efficiency and complexity, it poses significant challenges regarding the design, the verification, the test and the in-field reliability of Approximate Computing systems. This tutorial paper covers these aspects leveraging the experience of the authors in the field to present state-of-the-art solutions to apply during the different development phases of an Approximate Computing system

    On the suitability of time-randomized processors for secure and reliable high-performance computing

    Get PDF
    Time-randomized processor (TRP) architectures have been shown as one of the most promising approaches to deal with the overwhelming complexity of the timing analysis of high complex processor architectures for safety-related real-time systems. With TRPs the timing analysis step mainly relies on collecting measurements of the task under analysis rather than on complex timing models of the processor. Additionally, randomization techniques applied in TRPs provide increased reliability and security features. In this thesis, we elaborate on the reliability and security properties of TRPs and the suitability of extending this processor architecture design paradigm to the high-performance computing domain

    Early Component-Based System Reliability Analysis for Approximate Computing Systems

    Get PDF
    A key enabler of real applications on approximate computing systems is the availability of instruments to analyze system reliability, early in the design cycle. Accurately measuring the impact on system reliability of any change in the technology, circuits, microarchitecture and software is most of the time a multi-team multi-objective problem and reliability must be traded off against other crucial design attributes (or objectives) such as power, performance and cost. Unfortunately, tools and models for cross-layer reliability analysis are still at their early stages compared to other very mature design tools and this represents a major issue for mainstream applications. This paper presents preliminary information on a cross-layer framework built on top of a Bayesian model designed to perform component-based reliability analysis of complex systems
    • …
    corecore