Search CORE

1,007 research outputs found

Execution replay and debugging

Author: De Bosschere Koen
de Kergommeaux Jacques Chassin
Ronsse Michiel
Publication venue
Publication date: 01/01/2000
Field of study

As most parallel and distributed programs are internally non-deterministic -- consecutive runs with the same input might result in a different program flow -- vanilla cyclic debugging techniques as such are useless. In order to use cyclic debugging tools, we need a tool that records information about an execution so that it can be replayed for debugging. Because recording information interferes with the execution, we must limit the amount of information and keep the processing of the information fast. This paper contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003

arXiv.org e-Print Archive

CiteSeerX

Ghent University Academic Bibliography

AspectGrid: aspect-oriented fault-tolerance in grid platforms

Author: Medeiros Bruno
Sobral João Luís Ferreira
Publication venue: Slovak Academic Press Ltd.
Publication date: 01/01/2012
Field of study

Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications

Universidade do Minho: RepositoriUM

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Dependence-Based Source Level Tracing and Replay for Networked Embedded Systems

Author: Wang Man
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2013
Field of study

Error detection and diagnosis for networked embedded systems remain challenging and tedious due to issues such as a large number of computing entities, hardware resource constraints, and non-deterministic behaviors. The run-time checking is often necessitated by the fact that the static verification fails whenever there exist conditions unknown prior to execution. Complexities in hardware, software and even the operating environments can also defeat the static analysis and simulations. Record-and-replay has long been proposed for distributed systems error diagnosis. Under this method, assertions are inserted in the target program for run-time error detection. At run-time, the violation of any asserted property triggers actions for reporting an error and saving an execution trace for error replay. This dissertation takes wireless sensor networks, a special but representative type of networked embedded systems, as an example to propose a dependence-based source-level tracing-and-replay methodology for detecting and reproducing errors. This work makes three main contributions towards making error detection and replay automatic. First, SensorC, a domain-specific language for wireless sensor networks, is proposed to specify properties at a high level. This property specification approach can be not only used in our record-replay methodology but also integrated with other verification analysis approaches, such as model checking. Second, a greedy heuristic method is developed to decompose global properties into a set of local ones with the goal of minimizing the communication traffic for state information exchanges. Each local property is checked by a certain sensor node. Third, a dependence-based multi-level method for memory-efficient tracing and replay is proposed. In the interest of portability across different hardware platforms, this method is implemented as a source-level tracing and replaying tool. To test our methodology, we have built different wireless sensor networks by using TelosB motes and Zolertia Z1 motes separately. The experiments\u27 results show that our work has made it possible to instrument several test programs on wireless sensor networks under the stringent program memory constraint, reduce the data transferring required for error detection, and find and diagnose realistic errors

Purdue E-Pubs

Partial replay of long-running applications

Author: Cheung Alvin K.
Madden Samuel R.
Solar-Lezama Armando
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Bugs in deployed software can be extremely difficult to track down. Invasive logging techniques, such as logging all non-deterministic inputs, can incur substantial runtime overheads. This paper shows how symbolic analysis can be used to re-create path equivalent executions for very long running programs such as databases and web servers. The goal is to help developers debug such long-running programs by allowing them to walk through an execution of the last few requests or transactions leading up to an error. The challenge is to provide this functionality without the high runtime overheads associated with traditional replay techniques based on input logging or memory snapshots. Our approach achieves this by recording a small amount of information about program execution, such as the direction of branches taken, and then using symbolic analysis to reconstruct the execution of the last few inputs processed by the application, as well as the state of memory before these inputs were executed. We implemented our technique in a new tool called bbr. In this paper, we show that it can be used to replay bugs in long-running single-threaded programs starting from the middle of an execution. We show that bbr incurs low recording overhead (avg. of 10%) during program execution, which is much less than existing replay schemes. We also show that it can reproduce real bugs from web servers, database systems, and other common utilities

DSpace@MIT

Crossref

An aspect-oriented approach to fault-tolerance in grid platforms

Author: Medeiros Bruno
Sobral João Luís Ferreira
Publication venue: 'Netbiblo'
Publication date: 01/01/2011
Field of study

Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers to update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help to Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications

Universidade do Minho: RepositoriUM

Doctor of Philosophy

Author: Burtsev Anton
Publication venue: University of Utah
Publication date: 01/05/2013
Field of study

dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay

The University of Utah: J. Willard Marriott Digital Library

Causal-Consistent Replay Debugging for Message Passing Programs

Author: Lanese I.
Palacios A.
Vidal G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Debugging of concurrent systems is a tedious and error-prone activity. A main issue is that there is no guarantee that a bug that appears in the original computation is replayed inside the debugger. This problem is usually tackled by so-called replay debugging, which allows the user to record a program execution and replay it inside the debugger. In this paper, we present a novel technique for replay debugging that we call controlled causal-consistent replay. Controlled causal-consistent replay allows the user to record a program execution and, in contrast to traditional replay debuggers, to reproduce a visible misbehavior inside the debugger including all and only its causes. In this way, the user is not distracted by the actions of other, unrelated processes

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Debugging Microservices

Author: João Pedro Gomes Silva
Publication venue
Publication date: 17/07/2019
Field of study

Repositório Aberto da Universidade do Porto

Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging

Author: Bosilca George
Bouteiller Aurelien
González Patricia
Losada Nuria
Martín María J.
Publication venue: 'Elsevier BV'
Publication date: 01/02/2019
Field of study

[Abstract] The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grants of Nuria Losada ref. BES-2014-068066 and ref. EEBB-I-17-12005); by EU under the COST Program Action IC1305 Network for Sustainable Ultrascale Computing (NESUS) and a HiPEAC Collaboration Grant and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04). We gratefully thank Galicia Supercomputing Center for providing access to the FinisTerrae-II supercomputer. This material is also based upon work supported by the US National Science Foundation, Office of Advanced Cyberinfrastructure , under Grants No. #1664142 and #1339763Xunta de Galicia; ED431C 2017/04US National Science Foundation, Office of Advanced Cyberinfrastructure; 1664142US National Science Foundation, Office of Advanced Cyberinfrastructure; 133976

Repositorio da Universidade da Coruña

Recommended from our members

Leveraging Distributed Tracing and Container Cloning for Replay Debugging of Microservices

Author: Mathur Mihir
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Microservice architectures have gained prominence in recent years for building large-scale industrial distributed systems. However, microservice architectures make the usage of replay debugging, a powerful technique for finding root causes of faults, very challenging because of the polyglot (written in several languages) services, large accumulated state of services, and tight latency limits imposed by long hop-chains. This work attempts to provide a framework for enabling replay debugging in production microservice applications. We study 25 real-world faults in microservice systems collected from diverse sources, categorize these faults by fault symptoms, and create 15 application agnostic mutation operators for microservices. We then propose a language agnostic replay debugging framework for microservice applications that uses a distributed tracing system to record network requests and enables replay of those requests on cloned service containers running in a debug environment. A key component of this framework is an anomaly detector that uses span-level and container-level monitoring to detect fault symptoms found in our study and localizes faults to trace level so that faulty traces can be easily replayed to find the root cause. An open-source microservices application injected successively with the mutation operators is used for an evaluation that shows that our framework is upto an order of magnitude lighter-weight than language-specific recording tools such as Chrome DevTools or VisualVM and can help in finding root causes of 9 out of 15 mutations at a line or function level

eScholarship - University of California