919 research outputs found

    FADI: a fault-tolerant environment for open distributed computing

    Get PDF
    FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

    Designing SSI clusters with hierarchical checkpointing and single I/O space

    Get PDF
    Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio

    A New Concurrent Checkpoint Mechanism for Embeded Multi-Core Systems

    Get PDF
    his paper presents a new transparent, incremental, concurrent checkpoint mechanism for embedded multi-core systems. It allows the checkpointed process (also called checkpointee) to continue running without stopping while checkpoints are set to a large extent. Through tracing TLB misses to block the accesses to target memory pages first time while dumping memory pages (the most time-consuming step when setting a checkpoint). At that time, a kernel thread, called checkpointer, copies the memory access target pages to the designated memory buffer for constructing a consistent state of the checkpointee, and then resumes the memory accesses. From the experimental results, in contrast to a traditional concurrent checkpoint system, the proposed mechanism reduces the downtime of the checkpointed process by more than 10.1 %. Moreover, the incremental checkpointing functionality has been implemented in this new concurrent checkpoint mechanism as well. Compared with full checkpointing, incremental checkpointing can reduce the checkpoint time more than 95.5 % and 89.2 % while the benchmark is the matrix multiplication at the checkpoint intervals of 10 seconds and 20 seconds, respectively

    Solving multiprocessor drawbacks with kilo-instruction processors

    Get PDF
    Nowadays, a good multiprocessor system design has to deal with many drawbacks in order to achieve a good tradeoff between complexity and performance. For example, while solving problems like coherence and consistency is essential for correctness the way to solve processor stalls due to critical sections and synchronization points is desirable for performance. And none of these drawbacks has a straightforward solution. We show in our paper how the multi-checkpointing mechanism of the Kilo-Instruction Processors can be correctly leveraged in order to achieve a good complexity-effective multiprocessor design. Specifically, we describe a Kilo-Instruction Multiprocessor that transparently, i.e. without any software support, uses transaction-based memory updates. Our model simplifies the coherence and consistency hardware and gives the potential for easily applying different desirable speculative mechanisms to enhance performance when facing some synchronization constructs of current parallel applications.Postprint (published version
    • …
    corecore