907 research outputs found

    System Description for a Scalable, Fault-Tolerant, Distributed Garbage Collector

    Full text link
    We describe an efficient and fault-tolerant algorithm for distributed cyclic garbage collection. The algorithm imposes few requirements on the local machines and allows for flexibility in the choice of local collector and distributed acyclic garbage collector to use with it. We have emphasized reducing the number and size of network messages without sacrificing the promptness of collection throughout the algorithm. Our proposed collector is a variant of back tracing to avoid extensive synchronization between machines. We have added an explicit forward tracing stage to the standard back tracing stage and designed a tuned heuristic to reduce the total amount of work done by the collector. Of particular note is the development of fault-tolerant cooperation between traces and a heuristic that aggressively reduces the set of suspect objects.Comment: 47 pages, LaTe

    Doctor of Philosophy

    Get PDF
    dissertationOver the last decade, cyber-physical systems (CPSs) have seen significant applications in many safety-critical areas, such as autonomous automotive systems, automatic pilot avionics, wireless sensor networks, etc. A Cps uses networked embedded computers to monitor and control physical processes. The motivating example for this dissertation is the use of fault- tolerant routing protocol for a Network-on-Chip (NoC) architecture that connects electronic control units (Ecus) to regulate sensors and actuators in a vehicle. With a network allowing Ecus to communicate with each other, it is possible for them to share processing power to improve performance. In addition, networked Ecus enable flexible mapping to physical processes (e.g., sensors, actuators), which increases resilience to Ecu failures by reassigning physical processes to spare Ecus. For the on-chip routing protocol, the ability to tolerate network faults is important for hardware reconfiguration to maintain the normal operation of a system. Adding a fault-tolerance feature in a routing protocol, however, increases its design complexity, making it prone to many functional problems. Formal verification techniques are therefore needed to verify its correctness. This dissertation proposes a link-fault-tolerant, multiflit wormhole routing algorithm, and its formal modeling and verification using two different methodologies. An improvement upon the previously published fault-tolerant routing algorithm, a link-fault routing algorithm is proposed to relax the unrealistic node-fault assumptions of these algorithms, while avoiding deadlock conservatively by appropriately dropping network packets. This routing algorithm, together with its routing architecture, is then modeled in a process-algebra language LNT, and compositional verification techniques are used to verify its key functional properties. As a comparison, it is modeled using channel-level VHDL which is compiled to labeled Petri-nets (LPNs). Algorithms for a partial order reduction method on LPNs are given. An optimal result is obtained from heuristics that trace back on LPNs to find causally related enabled predecessor transitions. Key observations are made from the comparison between these two verification methodologies

    FastPay: High-Performance Byzantine Fault Tolerant Settlement

    Get PDF
    FastPay allows a set of distributed authorities, some of which are Byzantine, to maintain a high-integrity and availability settlement system for pre-funded payments. It can be used to settle payments in a native unit of value (crypto-currency), or as a financial side-infrastructure to support retail payments in fiat currencies. FastPay is based on Byzantine Consistent Broadcast as its core primitive, foregoing the expenses of full atomic commit channels (consensus). The resulting system has low-latency for both confirmation and payment finality. Remarkably, each authority can be sharded across many machines to allow unbounded horizontal scalability. Our experiments demonstrate intra-continental confirmation latency of less than 100ms, making FastPay applicable to point of sale payments. In laboratory environments, we achieve over 80,000 transactions per second with 20 authorities---surpassing the requirements of current retail card payment networks, while significantly increasing their robustness

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc

    Portable Inter-workgroup Barrier Synchronisation for GPUs

    Get PDF
    Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation. We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study

    Unification of Transactions and Replication in Three-Tier Architectures Based on CORBA

    Get PDF
    In this paper, we describe a software infrastructure that unifies transactions and replication in three-tier architectures and provides data consistency and high availability for enterprise applications. The infrastructure uses transactions based on the CORBA object transaction service to protect the application data in databases on stable storage, using a roll-backward recovery strategy, and replication based on the fault tolerant CORBA standard to protect the middle-tier servers, using a roll-forward recovery strategy. The infrastructure replicates the middle-tier servers to protect the application business logic processing. In addition, it replicates the transaction coordinator, which renders the two-phase commit protocol nonblocking and, thus, avoids potentially long service disruptions caused by failure of the coordinator. The infrastructure handles the interactions between the replicated middle-tier servers and the database servers through replicated gateways that prevent duplicate requests from reaching the database servers. It implements automatic client-side failover mechanisms, which guarantee that clients know the outcome of the requests that they have made, and retries aborted transactions automatically on behalf of the clients
    • …
    corecore