1,088 research outputs found
Execution replay and debugging
As most parallel and distributed programs are internally non-deterministic --
consecutive runs with the same input might result in a different program flow
-- vanilla cyclic debugging techniques as such are useless. In order to use
cyclic debugging tools, we need a tool that records information about an
execution so that it can be replayed for debugging. Because recording
information interferes with the execution, we must limit the amount of
information and keep the processing of the information fast. This paper
contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003
MPreplay: Architecture Support for Deterministic Replay of Message Passing Programs on Message Passing Many-Core Processors
Coordinated Science Laboratory was formerly known as Control Systems Laborator
Doctor of Philosophy
dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay
Isolation of malicious external inputs in a security focused adaptive execution environment
pre-printReliable isolation of malicious application inputs is necessary for preventing the future success of an observed novel attack after the initial incident. In this paper we describe, measure and analyze, Input-Reduction, a technique that can quickly isolate malicious external inputs that embody unforeseen and potentially novel attacks, from other benign application inputs. The Input-Reduction technique is integrated into an advanced, security-focused, and adaptive execution environment that automates diagnosis and repair. In experiments we show that Input-Reduction is highly accurate and efficient in isolating attack inputs and determining casual relations between inputs. We also measure and show that the cost incurred by key services that support reliable reproduction and fast attack isolation is reasonable in the adaptive execution environment
Transactional failure recovery for a distributed key-value store
With the advent of cloud computing, many applications have embraced the ensuing paradigm shift towards modern distributed key-value data stores, like HBase, in order to benefit from the elastic scalability on offer. However, many applications still hesitate to make the leap from the traditional relational database model simply because they cannot compromise on the standard transactional guarantees of atomicity, isolation, and durability. To get the best of both worlds, one option is to integrate an independent transaction management component with a distributed key-value store. In this paper, we discuss the implications of this approach for durability. In particular, if the transaction manager provides durability (e.g., through logging), then we can relax durability constraints in the key-value store. However, if a component fails (e.g., a client or a key-value server), then we need a coordinated recovery procedure to ensure that commits are persisted correctly. In our research, we integrate an independent transaction manager with HBase. Our main contribution is a failure recovery middleware for the integrated system, which tracks the progress of each commit as it is flushed down by the client and persisted within HBase, so that we can recover reliably from failures. During recovery, commits that were interrupted by the failure are replayed from the transaction management log. Importantly, the recovery process does not interrupt transaction processing on the available servers. Using a benchmark, we evaluate the impact of component failure, and subsequent recovery, on application performance
Trust Management and Security in Satellite Telecommand Processing
New standards and initiatives in satellite system architecture are moving the space industry to more open and efficient mission operations. Primarily, these standards allow multiple missions to share standard ground and space based resources to reduce mission development and sustainment costs. With the benefits of these new concepts comes added risk associated with threats to the security of our critical space assets in a contested space and cyberspace domain. As one method to mitigate threats to space missions, this research develops, implements, and tests the Consolidated Trust Management System (CTMS) for satellite flight software. The CTMS architecture was developed using design requirements and features of Trust Management Systems (TMS) presented in the field of distributed information systems. This research advances the state of the art with the CTMS by refining and consolidating existing TMS theory and applying it to satellite systems. The feasibility and performance of this new CTMS architecture is demonstrated with a realistic implementation in satellite flight software and testing in an emulated satellite system environment. The system is tested with known threat modeling techniques and a specific forgery attack abuse case of satellite telecommanding functions. The CTMS test results show the promise of this technique to enhance security in satellite flight software telecommand processing. With this work, a new class of satellite protection mechanisms is established, which addresses the complex security issues facing satellite operations today. This work also fills a critical shortfall in validated security mechanisms for implementation in both public and private sector satellite systems
Application-level Fault Tolerance and Resilience in HPC Applications
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As necesidades computacionais das distintas ramas da ciencia medraron enormemente
nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado
polos supercomputadores. Cada vez constrúense sistemas de computación
de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos,
o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o
estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires
que os programas científicos poidan completar a súa execución, evitando ademais
que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis
populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas
céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída
tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart
a nivel de aplicación para os modelos de programación paralela roáis populares en
supercomputación. Implementáronse protocolos de checkpointing para aplicacións
híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos
dous casos prestando especial coidado á portabilidade e maleabilidade da solución.
En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia
que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo
detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos
fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global.
A maiores, esta solución de resiliencia optimizouse implementando unha volta
atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de
almacenaxe de mensaxes para garantires a consistencia e o progreso da execución.
Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria.
En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando
desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen]
El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia
ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores.
Cada vez se construyen sistemas de computación de altas prestaciones mayores,
con más recursos hardware de distintos tipos, lo que hace que las tasas de
fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos
eficientes resulta indispensable para garantizar que los programas científicos puedan
completar su ejecución, evitando además que se dispare el consumo de energía. La
técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte
de la investigación en este campo se ha centrado en estrategias stop-and-restart
para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta
tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de
programación paralela más populares en supercomputación. Se han implementado
protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones
heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la
portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria
distribuida, se propone una solución de resiliencia que puede ser usada de forma
genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada
sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos
y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta
solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en
la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje
de mensajes para garantizar la consistencia y el progreso de la ejecución. Por
último, se propone una extensión de una librería de checkpointing para facilitar la
implementación de estrategias de recuperación ad hoc ante corrupciones de memoria.
Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando
desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract]
The rapid increase in the computational demands of science has lead to a pronounced
growth in the performance offered by supercomputers. As High Performance
Computing (HPC) systems grow larger, including more hardware components
of different types, the system's failure rate becomes higher. Efficient fault
tolerance techniques are essential not only to ensure the execution completion but
also to save energy. Checkpoint/restart is one of the most popular fault tolerance
techniques. However, most of the research in this field is focused on stop-and-restart
strategies for distributed-memory applications in the event of fail-stop failures. Thís
thesis focuses on the implementation of application-level checkpoint/restart solutions
for the most popular parallel programming models used in HPC. Hence, we
have implemented checkpointing solutions to cope with fail-stop failures in hybrid
MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize
the restart portability and malleability, ie., the recovery can take place on
machines with different CPU / accelerator architectures, and/ or operating systems,
and can be adapted to the available resources (number of cores/accelerators). Regarding
distributed-memory applications, we propose a resilience solution that can
be generally applied to SPMD MPI programs. Resilient applications can detect and
react to failures without aborting their execution upon fail-stop failures. Instead,
failed processes are re-spawned, and the application state is recovered through a
global rollback. Moreover, we have optimized this resilience proposal by implementing
a local rollback protocol, in which only failed processes rollback to a previous
state, while message logging enables global consistency and further progress of the
computation. Finally, we have extended a checkpointing library to facilitate the
implementation of ad hoc recovery strategies in the event of soft errors) caused by
memory corruptions. Many times, these errors can be handled at the software-Ievel,
tIms, avoiding fail-stop failures and enabling a more efficient recovery
Recommended from our members
Survey in Smart Grid and Smart Home Security: Issues, Challenges and Countermeasures
The electricity industry is now at the verge of a new era. An era that promises, through the evolution of the existing electrical grids to Smart Grids, more efficient and effective power management, better reliability, reduced production costs and more environmentally friendly energy generation. Numerous initiatives across the globe, led by both industry and academia, reflect the mounting interest around the enormous benefits but also the great risks introduced by this evolution. This paper focuses on issues related to the security of the Smart Grid and the Smart Home, which we present as an integral part of the Smart Grid. Based on several scenarios we aim to present some of the most representative threats to the Smart Home / Smart Grid environment. The threats detected are categorized according to specific security goals set for the Smart Home/Smart Grid environment and their impact on the overall system security is evaluated. A review of contemporary literature is then conducted with the aim of presenting promising security countermeasures with respect to the identified specific security goals for each presented scenario. An effort to shed light on open issues and future research directions concludes the paper
- …