4,885 research outputs found

    Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant {MPI}

    Get PDF
    International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing. Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events to computation message. Several protocols have been proposed in the literature. Their merits are usually evaluated from four metrics: a) piggybacking computation cost, b) piggyback size, c) applications performance and d) fault recovery performance. In this paper, we investigate the benefit of using a stable storage for logging message events in causal message logging protocols. To evaluate the advantage of this technique we implemented three protocols: 1) a classical causal message protocol proposed in Manetho, 2) a state of the art protocol known as LogOn, 3) a light computation cost protocol called Vcausal. We demonstrate a major impact of this stable storage for the three protocols, on the four criteria for micro benchmarks as well as for the NAS benchmark

    Keeping Authorities "Honest or Bust" with Decentralized Witness Cosigning

    Get PDF
    The secret keys of critical network authorities - such as time, name, certificate, and software update services - represent high-value targets for hackers, criminals, and spy agencies wishing to use these keys secretly to compromise other hosts. To protect authorities and their clients proactively from undetected exploits and misuse, we introduce CoSi, a scalable witness cosigning protocol ensuring that every authoritative statement is validated and publicly logged by a diverse group of witnesses before any client will accept it. A statement S collectively signed by W witnesses assures clients that S has been seen, and not immediately found erroneous, by those W observers. Even if S is compromised in a fashion not readily detectable by the witnesses, CoSi still guarantees S's exposure to public scrutiny, forcing secrecy-minded attackers to risk that the compromise will soon be detected by one of the W witnesses. Because clients can verify collective signatures efficiently without communication, CoSi protects clients' privacy, and offers the first transparency mechanism effective against persistent man-in-the-middle attackers who control a victim's Internet access, the authority's secret key, and several witnesses' secret keys. CoSi builds on existing cryptographic multisignature methods, scaling them to support thousands of witnesses via signature aggregation over efficient communication trees. A working prototype demonstrates CoSi in the context of timestamping and logging authorities, enabling groups of over 8,000 distributed witnesses to cosign authoritative statements in under two seconds.Comment: 20 pages, 7 figure

    Project Final Report: HPC-Colony II

    Full text link

    Data logger for medical device coordination framework

    Get PDF
    Master of ScienceDepartment of Computing and Information SciencesDaniel A. AndresenA software application or a hardware device performs well under favorable conditions. Practically there can be many factors which effect the performance and functioning of the system. Scenarios where the system fails or performs better are needed to be determined. Logging is one of the best methodologies to determine such scenarios. Logging can be helpful in determining worst and effective performance. There is always an advantage of levels in logging which gives flexibility in logging different kinds of messages. Determining what messages to be logged is the key of logging. All the important events, state changes, messages are to be logged to know the higher level of progress of the system. Medical Device Coordination Framework (MDCF) deals with device connectivity with MDCF server. In this report, we propose a logging component to the existing MDCF. Logging component for MDCF is inspired from the flight data recorder, “black box”. Black box is a device used to log each and every message passing through the flight‟s system. In this way it is reliable and easy to investigate any failures in the system. We will also be able to simulate the replay of the scenarios. The important state changes in MDCF include device connection, scenario instantiation, initial state of MDCF server, destination creation. Logging in MDCF is implemented by wrapping Log4j logging framework. The interface provided by the logging component is used by MDCF in order to log. This implementation facilitates building more complex logging component for MDCF

    Unified fault-tolerance framework for hybrid task-parallel message-passing applications

    Get PDF
    We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Gathering experience in trust-based interactions

    Get PDF
    As advances in mobile and embedded technologies coupled with progress in adhoc networking fuel the shift towards ubiquitous computing systems it is becoming increasingly clear that security is a major concern. While this is true of all computing paradigms, the characteristics of ubiquitous systems amplify this concern by promoting spontaneous interaction between diverse heterogeneous entities across administrative boundaries [5]. Entities cannot therefore rely on a specific control authority and will have no global view of the state of the system. To facilitate collaboration with unfamiliar counterparts therefore requires that an entity takes a proactive approach to self-protection. We conjecture that trust management is the best way to provide support for such self-protection measures

    The state of peer-to-peer network simulators

    Get PDF
    Networking research often relies on simulation in order to test and evaluate new ideas. An important requirement of this process is that results must be reproducible so that other researchers can replicate, validate and extend existing work. We look at the landscape of simulators for research in peer-to-peer (P2P) networks by conducting a survey of a combined total of over 280 papers from before and after 2007 (the year of the last survey in this area), and comment on the large quantity of research using bespoke, closed-source simulators. We propose a set of criteria that P2P simulators should meet, and poll the P2P research community for their agreement. We aim to drive the community towards performing their experiments on simulators that allow for others to validate their results

    A survey of checkpointing algorithms for parallel and distributed computers

    Get PDF
    Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery
    corecore