10 research outputs found

    PROMON: a profile monitor of software applications

    Get PDF
    Software techniques can be efficiently used to increase the dependability of safety-critical applications. Many approaches are based on information redundancy to prevent data and code corruption during the software execution. This paper presents PROMON, a C++ library that exploits a new methodology based on the concept of "Programming by Contract" to detect system malfunctions. Resorting to assertions, pre- and post-conditions, and marginal programmer interventions, PROMON-based applications can reach high level of dependabilit

    Validation of a software dependability tool via fault injection experiments

    Get PDF
    Presents the validation of the strategies employed in the RECCO tool to analyze a C/C++ software; the RECCO compiler scans C/C++ source code to extract information about the significance of the variables that populate the program and the code structure itself. Experimental results gathered on an Open Source Router are used to compare and correlate two sets of critical variables, one obtained by fault injection experiments, and the other applying the RECCO tool, respectively. Then the two sets are analyzed, compared, and correlated to prove the effectiveness of RECCO's methodology

    Software dependability techniques validated via fault injection experiments

    Get PDF
    The present paper proposes a C/C++ source-to-source compiler able to increase the dependability properties of a given application. The adopted strategy is based on two main techniques: variable duplication/triplication and control flow checking. The validation of these techniques is based on the emulation of fault appearance by software fault injection. The chosen test case is a client-server application in charge of calculating and drawing a Mandelbrot fracta

    Automated Synthesis of SEU Tolerant Architectures from OO Descriptions

    Get PDF
    SEU faults are a well-known problem in aerospace environment but recently their relevance grew up also at ground level in commodity applications coupled, in this frame, with strong economic constraints in terms of costs reduction. On the other hand, latest hardware description languages and synthesis tools allow reducing the boundary between software and hardware domains making the high-level descriptions of hardware components very similar to software programs. Moving from these considerations, the present paper analyses the possibility of reusing Software Implemented Hardware Fault Tolerance (SIHFT) techniques, typically exploited in micro-processor based systems, to design SEU tolerant architectures. The main characteristics of SIHFT techniques have been examined as well as how they have to be modified to be compatible with the synthesis flow. A complete environment is provided to automate the design instrumentation using the proposed techniques, and to perform fault injection experiments both at behavioural and gate level. Preliminary results presented in this paper show the effectiveness of the approach in terms of reliability improvement and reduced design effort

    PROMON: a profile monitor of software applications

    Get PDF
    Software techniques can be efficiently used to increase the dependability of safety-critical applications. Many approaches are based on information redundancy to prevent data and code corruption during the software execution. This paper presents PROMON, a C++ library that exploits a new methodology based on the concept of “Programming by Contract” to detect system malfunctions. Resorting to assertions, pre- and post-conditions, and marginal programmer interventions, PROMON-based applications can reach high level of dependability

    CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications

    Get PDF
    This is the peer reviewed version of the following article: RodrĂ­guez, G. , MartĂ­n, M. J., GonzĂĄlez, P. , Touriño, J. and Doallo, R. (2010), CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications. Concurrency Computat.: Pract. Exper., 22: 749-766. doi:10.1002/cpe.1541, which has been published in final form at https://doi.org/10.1002/cpe.1541. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] With the evolution of high‐performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad hoc representations renders checkpointing tools unable to work on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data need to be stored, where to store them, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler helps to achieve transparency by relieving the user from time‐consuming tasks, such as data flow and communications analyses and adding instrumentation code. This paper covers both the operation of the CPPC library and its compiler support. Experimental results using benchmarks and large‐scale real applications are included, demonstrating usability, efficiency, and portability.Miniesterio de EducaciĂłn y Ciencia; TIN2007‐67537‐C03Xunta de Galicia; 2006/

    Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

    Get PDF
    Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç

    Técnicas de ponto de controlo e adaptação em grelhas computacionais

    Get PDF
    Dissertação de mestrado em Engenharia de InformĂĄticaA recente popularidade dos ambientes de grelhas introduziu a necessidade de suportar a execução robusta de aplicaçÔes numa gama alargada de recursos computacionais. Em contextos de grelhas computacionais, onde a fiabilidade e disponibilidade dos recursos nĂŁo Ă© garantida, as aplicaçÔes deverĂŁo ser capazes de suportar dois requisitos fundamentais: 1) tolerĂąncia a faltas; 2) adaptação aos recursos disponĂ­veis. As tĂ©cnicas tradicionais utilizam uma abordagem "caixa-negra", onde a camada intermĂ©dia de software (mediador) Ă© a Ășnica responsĂĄvel por assegurar estes dois requisitos. Estes tipos de abordagens possibilitam o suporte a estes serviços com uma intervenção mĂ­nima do programador, mas limitam a utilização de conhecimento sobre as caracterĂ­sticas da aplicação, visando a otimização destes serviços. Nesta tese sĂŁo apresentadas abordagens orientadas aos aspetos para suportar tolerĂąncia a faltas e adaptação dinĂąmica aos recursos em grelhas computacionais. Nas abordagens propostas, as aplicaçÔes sĂŁo aprimoradas com capacidades de tolerĂąncia a faltas e de adaptação dinĂąmica atravĂ©s da ativação de mĂłdulos adicionais. A abordagem de tolerĂąncia a faltas utiliza a estratĂ©gia de ponto de controlo e restauro, enquanto a adaptação dinĂąmica utiliza uma variação da tĂ©cnica de sobre-decomposição. Ambas sĂŁo portĂĄveis entre sistemas operativos e restringem a quantidade de alteraçÔes necessĂĄrias no cĂłdigo base das aplicaçÔes. AlĂ©m disso, as aplicaçÔes poderĂŁo adaptar de uma execução sequencial para uma configuração multi-cluster. A adaptação pode ser realizada efetuando o ponto de controlo da aplicação e restaurando-a em diferentes mĂĄquinas, ou entĂŁo, realizada em plena execução da aplicação.Grids’ recent popularity introduced the necessity of supporting robust execution of applications on a wide range of computing resources. In computational grids’ context, where reliability and availability are not granted, applications must support two fundamental requirements, namely, fault tolerance and adaptation to available resources. Traditional techniques use a "black-box"approach, where middleware is the only sponsor for those requirements. These kind of approaches enable this services’ support with a minimum programmer’s intervention, but limits knowledge utilization of application’s features in order to optimize services. This thesis presents aspect-oriented approaches to support fault tolerance and dynamic adaptation to resources in computational grids. In the proposed approaches, applications are enhanced with the ability of fault tolerance and dynamic adaptation through additional modules activation. Fault tolerance approach uses a check point and restore strategy while dynamic adaptation uses a variation of the over-decomposition technique. Both are portable between operating systems and minimize alterations to base code of applications. Moreover, applications can adapt from a sequential execution to a multi-cluster configuration. Adaption can be performed by checkpointing the application and restarting on a different mode or can be performed during run-time

    Compiler assisted chekpointing of message-passing applications in heterogeneous environments

    Get PDF
    [Resumen] With the evolution of high performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities, Whether due to a failure in the execution or to a migration of the processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad-hoc representations renders checkpointing tools incapable of working on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data needs to be stored, where to store it, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency, while preserving the scalability of the executed applications. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler achieves transparency by relieving the user from time-consuming tasks, such as performing code analyses and adding instrumentation code
    corecore