23 research outputs found

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc

    Byzantine Fault Tolerance for Nondeterministic Applications

    Get PDF
    The growing reliance on online services accessible on the Internet demands highly reliable system that would not be interrupted when encountering faults. A number of Byzantine fault tolerance (BFT) algorithms have been developed to mask the most complicated type of faults - Byzantine faults such as software bugs,operator mistakes, and malicious attacks, which are usually the major cause of service interruptions. However, it is often difficult to apply these algorithms to practical applications because such applications often exhibit sophisticated non-deterministic behaviors that the existing BFT algorithms could not cope with. In this thesis, we propose a classification of common types of replica nondeterminism with respect to the requirement of achieving Byzantine fault tolerance, and describe the design and implementation of the core mechanisms necessary to handle such replica nondeterminism within a Byzantine fault tolerance framework. In addition, we evaluated the performance of our BFT library, referred to as ND-BFT using both a micro-benchmark application and a more realistic online porker game application. The performance results show that the replicated online poker game performs approximately 13 slower than its nonreplicated counterpart in the presence of small number of player

    Self-management for large-scale distributed systems

    Get PDF
    Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control

    Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

    Get PDF
    Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro) architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components

    Safety and security of cyber-physical systems

    Get PDF
    The number of embedded controllers in charge of physical systems has rapidly increased over the past years. Embedded controllers are present in every aspect of our lives, from our homes to our vehicles and factories. The complexity of these systems is also more than ever. These systems are expected to deliver many features and high performance without trading off in robustness and assurance. As systems increase in complexity, however, the cost of formally verifying their correctness and eliminating security vulnerabilities can quickly explode. On top of the unintentional bugs and problems, malicious attacks on cyber-physical systems (CPS) can also lead to adverse outcomes on physical plants. Some of the recent attacks on CPS are focused on causing physical damage to the plants or the environment. Such intruders make their way into the system using cyber exploits but then initiate actions that can destabilize and even damage the underlying (physical) systems. Given the reality mentioned above and the reliability standards of the industry, there is a need to embrace new CPS design paradigms where faults and security vulnerabilities are the norms rather than an anomaly. Such imperfections must be assumed to exist in every system and component unless it is formally verified and scanned. Faults and vulnerabilities should be safely handled and the CPS must be able to recover from them at run-time. Our goal in this work is to introduce and investigate a few designs compatible with this paradigm. The architectures and techniques proposed in this dissertation do not rely on the testing and complete system verification. Instead, they enforce safety at the highest level of the system and extend guaranteed safety from a few certified components to the entire system. These solutions are carefully curated to utilize unverified components and provide guaranteed performance
    corecore