5,789 research outputs found
Microprocessor fault-tolerance via on-the-fly partial reconfiguration
This paper presents a novel approach to exploit FPGA dynamic partial reconfiguration to improve the fault tolerance of complex microprocessor-based systems, with no need to statically reserve area to host redundant components. The proposed method not only improves the survivability of the system by allowing the online replacement of defective key parts of the processor, but also provides performance graceful degradation by executing in software the tasks that were executed in hardware before a fault and the subsequent reconfiguration happened. The advantage of the proposed approach is that thanks to a hardware hypervisor, the CPU is totally unaware of the reconfiguration happening in real-time, and there's no dependency on the CPU to perform it. As proof of concept a design using this idea has been developed, using the LEON3 open-source processor, synthesized on a Virtex 4 FPG
Run-time fault detection in monitor based concurrent programming
Software Development and Management Lab., Dept. of ComputingRefereed conference paper2001-2002 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe
Production of Reliable Flight Crucial Software: Validation Methods Research for Fault Tolerant Avionics and Control Systems Sub-Working Group Meeting
The state of the art in the production of crucial software for flight control applications was addressed. The association between reliability metrics and software is considered. Thirteen software development projects are discussed. A short term need for research in the areas of tool development and software fault tolerance was indicated. For the long term, research in format verification or proof methods was recommended. Formal specification and software reliability modeling, were recommended as topics for both short and long term research
System Description for a Scalable, Fault-Tolerant, Distributed Garbage Collector
We describe an efficient and fault-tolerant algorithm for distributed cyclic
garbage collection. The algorithm imposes few requirements on the local
machines and allows for flexibility in the choice of local collector and
distributed acyclic garbage collector to use with it. We have emphasized
reducing the number and size of network messages without sacrificing the
promptness of collection throughout the algorithm. Our proposed collector is a
variant of back tracing to avoid extensive synchronization between machines. We
have added an explicit forward tracing stage to the standard back tracing stage
and designed a tuned heuristic to reduce the total amount of work done by the
collector. Of particular note is the development of fault-tolerant cooperation
between traces and a heuristic that aggressively reduces the set of suspect
objects.Comment: 47 pages, LaTe
Unattended network operations technology assessment study. Technical support for defining advanced satellite systems concepts
The results are summarized of an unattended network operations technology assessment study for the Space Exploration Initiative (SEI). The scope of the work included: (1) identified possible enhancements due to the proposed Mars communications network; (2) identified network operations on Mars; (3) performed a technology assessment of possible supporting technologies based on current and future approaches to network operations; and (4) developed a plan for the testing and development of these technologies. The most important results obtained are as follows: (1) addition of a third Mars Relay Satellite (MRS) and MRS cross link capabilities will enhance the network's fault tolerance capabilities through improved connectivity; (2) network functions can be divided into the six basic ISO network functional groups; (3) distributed artificial intelligence technologies will augment more traditional network management technologies to form the technological infrastructure of a virtually unattended network; and (4) a great effort is required to bring the current network technology levels for manned space communications up to the level needed for an automated fault tolerance Mars communications network
FastPay: High-Performance Byzantine Fault Tolerant Settlement
FastPay allows a set of distributed authorities, some of which are Byzantine,
to maintain a high-integrity and availability settlement system for pre-funded
payments. It can be used to settle payments in a native unit of value
(crypto-currency), or as a financial side-infrastructure to support retail
payments in fiat currencies. FastPay is based on Byzantine Consistent Broadcast
as its core primitive, foregoing the expenses of full atomic commit channels
(consensus). The resulting system has low-latency for both confirmation and
payment finality. Remarkably, each authority can be sharded across many
machines to allow unbounded horizontal scalability. Our experiments demonstrate
intra-continental confirmation latency of less than 100ms, making FastPay
applicable to point of sale payments. In laboratory environments, we achieve
over 80,000 transactions per second with 20 authorities---surpassing the
requirements of current retail card payment networks, while significantly
increasing their robustness
Applying Formal Methods to Networking: Theory, Techniques and Applications
Despite its great importance, modern network infrastructure is remarkable for
the lack of rigor in its engineering. The Internet which began as a research
experiment was never designed to handle the users and applications it hosts
today. The lack of formalization of the Internet architecture meant limited
abstractions and modularity, especially for the control and management planes,
thus requiring for every new need a new protocol built from scratch. This led
to an unwieldy ossified Internet architecture resistant to any attempts at
formal verification, and an Internet culture where expediency and pragmatism
are favored over formal correctness. Fortunately, recent work in the space of
clean slate Internet design---especially, the software defined networking (SDN)
paradigm---offers the Internet community another chance to develop the right
kind of architecture and abstractions. This has also led to a great resurgence
in interest of applying formal methods to specification, verification, and
synthesis of networking protocols and applications. In this paper, we present a
self-contained tutorial of the formidable amount of work that has been done in
formal methods, and present a survey of its applications to networking.Comment: 30 pages, submitted to IEEE Communications Surveys and Tutorial
Counter Attack on Byzantine Generals: Parameterized Model Checking of Fault-tolerant Distributed Algorithms
We introduce an automated parameterized verification method for
fault-tolerant distributed algorithms (FTDA). FTDAs are parameterized by both
the number of processes and the assumed maximum number of Byzantine faulty
processes. At the center of our technique is a parametric interval abstraction
(PIA) where the interval boundaries are arithmetic expressions over parameters.
Using PIA for both data abstraction and a new form of counter abstraction, we
reduce the parameterized problem to finite-state model checking. We demonstrate
the practical feasibility of our method by verifying several variants of the
well-known distributed algorithm by Srikanth and Toueg. Our semi-decision
procedures are complemented and motivated by an undecidability proof for FTDA
verification which holds even in the absence of interprocess communication. To
the best of our knowledge, this is the first paper to achieve parameterized
automated verification of Byzantine FTDA
Formal Generation of Executable Assertions for Application-Oriented Fault Tolerance
Executable assertions embedded into a distributed computing system can provide run-time assurance by ensuring that the program state, in the actual run-time environment, is consistent with the logical stage specified in the assertions; if not, then an error has occurred and a reliable communication of this diagnostic information is provided to the system such that reconfiguration and recovery can take place. Application- oriented fault tolerance is a method that provides fault detection using executable assertions based on the natural constraints of the application.
This paper focuses on giving application-oriented fault tolerance a theoretical foundation by providing a mathematical model for the generation of executable assertions which detect faults in the presence of arbitrary failures. The mathematical model of choice was axiomatic program verification. A method was developed that translates a concurrent verification proof outline into an error-detecting concurrent program. This paper shows the application of the developed method to several applications
- …