277 research outputs found

    A Smart Checkpointing Scheme for Improving the Reliability of Clustering Routing Protocols

    Get PDF
    In wireless sensor networks, system architectures and applications are designed to consider both resource constraints and scalability, because such networks are composed of numerous sensor nodes with various sensors and actuators, small memories, low-power microprocessors, radio modules, and batteries. Clustering routing protocols based on data aggregation schemes aimed at minimizing packet numbers have been proposed to meet these requirements. In clustering routing protocols, the cluster head plays an important role. The cluster head collects data from its member nodes and aggregates the collected data. To improve reliability and reduce recovery latency, we propose a checkpointing scheme for the cluster head. In the proposed scheme, backup nodes monitor and checkpoint the current state of the cluster head periodically. We also derive the checkpointing interval that maximizes reliability while using the same amount of energy consumed by clustering routing protocols that operate without checkpointing. Experimental comparisons with existing non-checkpointing schemes show that our scheme reduces both energy consumption and recovery latency

    Recovery-oriented software architecture for grid applications (ROSA-Grids)

    Get PDF
    Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large-scale projects. Examples of grid resources include high-performance computers, massive data stores, high bandwidth networking, telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the heterogeneity of grid resources, the distribution of these resources over unreliable networks, the complexity of mechanisms that are needed to integrate such resources into a seamless utility, and the dynamic nature of the grid infrastructure that allows continuous changes to happen. To make matters worse, grid applications are generally long running, and these runs repeatedly require coordinated use of many resources at the same time. In this thesis, we propose the Recovery-Aware Components (RAC) approach. The RAC approach enables a grid application to handle failure reactively and proactively at the level of the smallest and independent execution unit of the application. The approach also combines runtime prediction with a proactive fault tolerance strategy. The RAC approach aims at improving the reliability of the grid application with the least overhead possible. Moreover, to allow a grid fault tolerance manager fine-tuned control and trading off of reliability gained and overhead paid, this thesis offers an architecture-aware modelling and simulation of reliability and overhead. The thesis demonstrates for a few of a dozen or so classes of application architecture already identified in prior research, that the typical architectural structure of the class can be captured in a few parameters. The work shows that these parameters suffice to achieve significant insight into, and control of, such tradeoffs. The contributions of our research project are as follows. We defined the RAC approach. We showed the usage of the RAC approach for improving the reliability of MapReduce and Combinational Logic grid applications. We provided Markov models that represent the execution behaviour of these applications for reliability and overhead analyses. We analysed the sensitivity of the reliability-overhead tradeoff of the RAC approach to the type of fault tolerance strategy, the parameters of a fault tolerance strategy, prediction interval and a predictor’s accuracy. The final contribution of our research is an experiment testbed that enables a grid fault tolerance expert to evaluate diverse fault tolerance support configurations, and then choose the one that will satisfy the reliability and cost requirements

    ETAP: Energy-Aware Timing Analysis of Intermittent Programs

    Get PDF
    Energy harvesting battery-free embedded devices rely only on ambient energy harvesting that enables stand-alone and sustainable IoT applications. These devices execute programs when the harvested ambient energy in their energy reservoir is sufficient to operate and stop execution abruptly (and start charging) otherwise. These intermittent programs have varying timing behavior under different energy conditions, hardware configurations, and program structures. This article presents Energy-aware Timing Analysis of intermittent Programs (ETAP), a probabilistic symbolic execution approach that analyzes the timing and energy behavior of intermittent programs at compile time. ETAP symbolically executes the given program while taking time and energy cost models for ambient energy and dynamic energy consumption into account. We evaluate ETAP by comparing the compile-time analysis results of our benchmark codes and real-world application with the results of their executions on real hardware. Our evaluation shows that ETAP’s prediction error rate is between 0.0076% and 10.8%, and it speeds up the timing analysis by at least two orders of magnitude compared to manual testing.acceptedVersio

    Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

    Get PDF
    In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described

    Input/Output of Ab-initio Nuclear Structure Calculations for Improved Performance and Portability

    Get PDF
    Many modern scientific applications rely on highly computation intensive calculations. However, most applications do not concentrate as much on the role that input/output operations can play for improved performance and portability. Parallelizing input/output operations of large files can significantly improve the performance of parallel applications where sequential I/O is a bottleneck. A proper choice of I/O library also offers a scope for making input/output operations portable across different architectures. Thus, use of parallel I/O libraries for organizing I/O of large data files offers great scope in improving performance and portability of applications. In particular, sequential I/O has been identified as a bottleneck for the highly scalable MFDn (Many Fermion Dynamics for nuclear structure) code performing ab-initio nuclear structure calculations. We develop interfaces and parallel I/O procedures to use a well-known parallel I/O library in MFDn. As a result, we gain efficient I/O of large datasets along with their portability and ease of use in the down-stream processing. Even situations where the amount of data to be written is not huge, proper use of input/output operations can boost the performance of scientific applications. Application checkpointing offers enormous performance improvement and flexibility by doing a negligible amount of I/O to disk. Checkpointing saves and resumes application state in such a manner that in most cases the application is unaware that there has been an interruption to its execution. This helps in saving large amount of work that has been previously done and continue application execution. This small amount of I/O provides substantial time saving by offering restart/resume capability to applications. The need for checkpointing in optimization code NEWUOA has been identified and checkpoint/restart capability has been implemented in NEWUOA by using simple file I/O

    High available and fault tolerant mobile communications infrastructure

    Get PDF

    Design and Implementation of a Byzantine Fault Tolerance Framework for Web Services

    Get PDF
    Many Web services are expected to run with high degree of security and dependability. To achieve this goal, it is essential to use a Web services compatible framework that tolerates not only crash faults, but Byzantine faults as well, due to the untrusted communication environment in which the Web services operate. In this paper, we describe the design and implementation of such a framework, called BFT-WS. BFT-WS is designed to operate on top of the standard SOAP messaging framework for maximum interoperability. It is implemented as a pluggable module within the Axis2 architecture, as such, it requires minimum changes to the Web applications. The core fault tolerance mechanisms used in BFT-WS are based on the well-known Castro and Liskov’s BFT algorithm for optimal efficiency. Our performance measurements confirm that BFT-WS incurs only moderate runtime overhead considering the complexity of the mechanisms

    An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems Laborator
    • …
    corecore