2,567 research outputs found

    Resilience in Numerical Methods: A Position on Fault Models and Methodologies

    Full text link
    Future extreme-scale computer systems may expose silent data corruption (SDC) to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for reasoning about SDC. Existing work randomly flips bits in running applications, but this only shows average-case behavior for a low-level, artificial hardware model. Algorithm developers need to understand worst-case behavior with the higher-level data types they actually use, in order to make their algorithms more resilient. Also, we know so little about how SDC may manifest in future hardware, that it seems premature to draw conclusions about the average case. We argue instead that numerical algorithms can benefit from a numerical unreliability fault model, where faults manifest as unbounded perturbations to floating-point data. Algorithms can use inexpensive "sanity" checks that bound or exclude error in the results of computations. Given a selective reliability programming model that requires reliability only when and where needed, such checks can make algorithms reliable despite unbounded faults. Sanity checks, and in general a healthy skepticism about the correctness of subroutines, are wise even if hardware is perfectly reliable.Comment: Position Pape

    Optimised configuration of sensors for fault tolerant control of an electro-magnetic suspension system

    Get PDF
    For any given system the number and location of sensors can affect the closed-loop performance as well as the reliability of the system. Hence, one problem in control system design is the selection of the sensors in some optimum sense that considers both the system performance and reliability. Although some methods have been proposed that deal with some of the aforementioned aspects, in this work, a design framework dealing with both control and reliability aspects is presented. The proposed framework is able to identify the best sensor set for which optimum performance is achieved even under single or multiple sensor failures with minimum sensor redundancy. The proposed systematic framework combines linear quadratic Gaussian control, fault tolerant control and multiobjective optimisation. The efficacy of the proposed framework is shown via appropriate simulations on an electro-magnetic suspension system

    Near Optimal Parallel Algorithms for Dynamic DFS in Undirected Graphs

    Full text link
    Depth first search (DFS) tree is a fundamental data structure for solving graph problems. The classical algorithm [SiComp74] for building a DFS tree requires O(m+n)O(m+n) time for a given graph GG having nn vertices and mm edges. Recently, Baswana et al. [SODA16] presented a simple algorithm for updating DFS tree of an undirected graph after an edge/vertex update in O~(n)\tilde{O}(n) time. However, their algorithm is strictly sequential. We present an algorithm achieving similar bounds, that can be adopted easily to the parallel environment. In the parallel model, a DFS tree can be computed from scratch using mm processors in expected O~(1)\tilde{O}(1) time [SiComp90] on an EREW PRAM, whereas the best deterministic algorithm takes O~(n)\tilde{O}(\sqrt{n}) time [SiComp90,JAlg93] on a CRCW PRAM. Our algorithm can be used to develop optimal (upto polylog n factors deterministic algorithms for maintaining fully dynamic DFS and fault tolerant DFS, of an undirected graph. 1- Parallel Fully Dynamic DFS: Given an arbitrary online sequence of vertex/edge updates, we can maintain a DFS tree of an undirected graph in O~(1)\tilde{O}(1) time per update using mm processors on an EREW PRAM. 2- Parallel Fault tolerant DFS: An undirected graph can be preprocessed to build a data structure of size O(m) such that for a set of kk updates (where kk is constant) in the graph, the updated DFS tree can be computed in O~(1)\tilde{O}(1) time using nn processors on an EREW PRAM. Moreover, our fully dynamic DFS algorithm provides, in a seamless manner, nearly optimal (upto polylog n factors) algorithms for maintaining a DFS tree in semi-streaming model and a restricted distributed model. These are the first parallel, semi-streaming and distributed algorithms for maintaining a DFS tree in the dynamic setting.Comment: Accepted to appear in SPAA'17, 32 Pages, 5 Figure

    Kompics: a message-passing component model for building distributed systems

    Get PDF
    The Kompics component model and programming framework was designedto simplify the development of increasingly complex distributed systems. Systems built with Kompics leverage multi-core machines out of the box and they can be dynamically reconfigured to support hot software upgrades. A simulation framework enables deterministic debugging and reproducible performance evaluation of unmodified Kompics distributed systems. We describe the component model and show how to program and compose event-based distributed systems. We present the architectural patterns and abstractions that Kompics facilitates and we highlight a case study of a complex distributed middleware that we have built with Kompics. We show how our approach enables systematic development and evaluation of large-scale and dynamic distributed systems

    Self-stabilizing sorting algorithms

    Full text link
    A distributed system consists of a set of machines which do not share a global memory. Depending on the connectivity of the network, each machine gets a partial view of the global state. Transient failures in one area of the network may go unnoticed in other areas and may cause the system to go to an illegal global state. However, if the system were self-stabilizing, it would be guaranteed that regardless of the current state, the system would recover to a legal configuration in a finite number of moves; The traditional way of creating reliable systems is to make redundant components. Self-stabilization allows systems to be fault tolerant through software as well. This is an evolving paradigm in the design of robust distributed systems. The ability to recover spontaneously from an arbitrary state makes self-stabilizing systems immune to transient failures or perturbations in the system state such as changes in network topology; This thesis presents an O(nh) fault-tolerant distributed sorting algorithm for a tree network, where n is the number of nodes in the system, and h is the height of the tree. Fault-tolerance is achieved using Dijkstra\u27s paradigm of self-stabilization which is a method of non-masking fault-tolerance embedding the fault-tolerance within the algorithm. Varghese\u27s counter flushing method is used in order to achieve synchronization among processes in the system. In the distributed sorting problem each node is given a value and an id which are non-corruptible. The idea is to have each node take a specific value based on its id. The algorithm handles transient faults by weeding out false information in the system. Nodes can start with completely false information concerning the values and ids of the system yet the intended behavior is still achieved. Also, nodes are allowed to crash and re-enter the system later as well as allowing new nodes to enter the system
    corecore