85 research outputs found
Mitigation of failures in high performance computing via runtime techniques
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results.
In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions.
The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost
Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell
As the number of cores in manycore systems grows exponentially, the number of failures is
also predicted to grow exponentially. Hence massively parallel computations must be able to
tolerate faults. Moreover new approaches to language design and system architecture are needed
to address the resilience of massively parallel heterogeneous architectures.
Symbolic computation has underpinned key advances in Mathematics and Computer Science,
for example in number theory, cryptography, and coding theory. Computer algebra software
systems facilitate symbolic mathematics. Developing these at scale has its own distinctive
set of challenges, as symbolic algorithms tend to employ complex irregular data and control
structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel
High Performance Computing platforms. A key element of SymGridParII is a domain specific
language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for
scalable distributed-memory parallelism, and employs work stealing to load balance dynamically
generated irregular task sizes.
To investigate providing scalable fault tolerant symbolic computation we design, implement
and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles
faults, using task replication as a key recovery strategy. The scheduler supports load balancing
with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault
tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel
skeletons that encapsulate common parallel programming patterns. The user is oblivious to
many failures, they are instead handled by the scheduler.
An operational semantics describes small-step reductions on states. A simple abstract machine
for scheduling transitions and task evaluation is presented. It defines the semantics of
supervised futures, and the transition rules for recovering tasks in the presence of failure. The
transition rules are demonstrated with a fault-free execution, and three executions that recover
from faults.
The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN
model checker is used to exhaustively search the intersection of states in this automaton to
validate a key resiliency property of the protocol. It asserts that an initially empty supervised
future on the supervisor node will eventually be full in the presence of all possible combinations
of failures.
The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling
achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when
executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision
overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in
the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey
mechanism has been developed for stress testing resiliency with random failure combinations.
All unit tests pass in the presence of random failure, terminating with the expected results
Towards efficient error detection in large-scale HPC systems
The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied.
Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%
Virtual machine scheduling in dedicated computing clusters
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed
Virtual machine scheduling in dedicated computing clusters
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed
Portable Checkpointing for Parallel Applications
High Performance Computing (HPC) systems represent the peak of modern computational capability. As
ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems,
modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these
scales, the huge number of individual components in a single system makes the probability that a single
component will fail quite high, with today's large HPC systems featuring mean times between failures on the
order of hours or a few days. As many modern computational tasks require days or months to complete, fault
tolerance becomes critical to HPC system design.
The past three decades have seen significant amounts of research on parallel system fault tolerance. However,
as most of it has been either theoretical or has focused on low-level solutions that are embedded into a
particular operating system or type of hardware, this work has had little impact on real HPC systems. This
thesis attempts to address this lack of impact by describing a high-level approach for implementing
checkpoint/restart functionality that decouples the fault tolerance solution from the details of the
operating system, system libraries and the hardware and instead connects it to the APIs implemented by the
above components. The resulting solution enables applications that use these APIs to become
self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement
the APIs.
The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It
presents two theoretical checkpointing protocols, one for the message passing communication model and one for
the shared memory model. The former is the first protocol to be compatible with application-level
checkpointing of individual processes, while the latter is the first protocol that is compatible with
arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing
protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel
APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these
popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms
and are shown to feature low overheads
Resource management for extreme scale high performance computing systems in the presence of failures
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems
Doctor of Philosophy
dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay
Recommended from our members
Fine-grained containment domains for throughput processors
Continued scaling of semiconductor technology has made modern processors rely on large design margins to guarantee correct operation under worst case conditions. Design margins appear in the form of higher supply voltage or lower clock frequency, leading to inefficiency. In practice, it is rare to observe such worst-case conditions and the processor can run at a reduced voltage or higher frequency experiencing only few infrequent errors. Recent proposals have used hardware error detectors and recovery mechanisms to detect and re- cover from these rare errors, a technique known as timing speculation. While this is effective for out-of-order processors with inherent capability to recover from misspeculation, implementing similar hardware for throughput processors such as the Graphics Processing Units (GPUs) is prohibitively costly due to the massive amount of thread context that needs to be preserved. Further- more, recovery overhead is much higher since the SIMD (Single Instruction Multiple Data) execution model of GPUs require multiple threads to roll back together in case of an error. In this dissertation, I develop a hardware/software co-design approach to enable reduced-margin operation on GPUs that overcomes the limitations of existing techniques. The proposed scheme leverages the hierarchical programming model of GPUs to provide hierarchical and uncoordinated local checkpoint-recovery. By decomposing a program into a hierarchically nested tree of code blocks which I refer to as containment domains (CDs), the pro- gram becomes amenable to automatic analysis and tuning, and an optimum trade-off can be made between preservation and recovery overhead. To aid this optimization process, an analytical model is developed to estimate the performance efficiency of a given application setting at a given error rate. With the analytical model, an exhaustive search can be performed to find the optimal solution. The tunability also allows the proposed scheme to easily adapt to a wide range of error rates making it future proof against emerging uncertainties in semiconductor design. The proposed scheme combines software and hardware components to achieve the highest efficiency in preservation, restoration, and recovery. The software components include: 1) an API and runtime that lets the programmers describe the hierarchy of containment domains within an application and preserve the state required for rollback recovery, and 2) a compiler analysis that automatically inserts preservation routines for register variables. The hardware components include: 1) a stack structure to keep track of recovery program counters (PC), 2) a set of error containment mechanisms to guarantee that no erroneous data is propagated outside of a containment domain and 3) an error reporting architecture that keeps track of affected threads and initiate recovery of them.Electrical and Computer Engineerin
Fault-tolerance and malleability in parallel message-passing applications
[Resumo]
Esta tese explora solucións para tolerancia a fallos e maleabilidade baseadas en
técnicas de checkpoint e reinicio para aplicacións de pase de mensaxes. No campo
da tolerancia a fallos, esta tese contribúe melloraudo o factor que máis incrementa
a sobrecarga, o custo de E/S no envorcado dos ficheiros de estado, propoñendo diferentes
técnicas para reducir o tamaño dos ficheiros de checkpoint. Ademais, tamén
se propón un mecanismo de migración de procesos baseado en checkpointing. Esto
permite a migración proactiva de procesos desde nodos que están a piques de fallar,
evitando un reinicio completo da execución e melloraudo a resistencia a fallos da
aplicación. Finalmente, esta tese presenta unha proposta para transformar de forma
transparente aplicacións MPI en traballos maleables. Esto é, programas paralelos
que en tempo de execución son capaces de adaptarse so número de procesadores
dispoñibles no sistema, conseguindo beneficios, como maior productividade, mellor
tempo de resposta ou maior resistencia a fallos nos nodos.
Todas as solucióru; propostas nesta tese foron implementadas a nivel de aplicación,
e son independentes da arquitectura hardware, o sistema operativo, a implementación
MPI usada, e de calquera framework de alto nivel, como os utilizados
para o envío de traballos.[Resumen]
Esta tesis explora soluciones de tolerancia a fallos y maleabilidad basadas en
técnicas de checkpoint y reinicio para aplicaciones de pase de mensajes. En el campo
de la tolerancia a fallos, contribuye mejorando el factor que más incrementa la
sobrecarga, el coste de E/S en el volcado de los ficheros de estado, proponiendo
diferentes técnicas para reducir el tamaño de los ficheros de checkpoint. Ademós,
también se propone nn mecanismo de migración de procesos basado en checkpointing.
Esto permite la migración proactiva de procesos desde nodos que están a punto
de fallar, evitando un reinicio completo de la ejecución y mejorando la resistencia
a fallos de la aplicación. Finalmente, se presenta una propuesta para transformar
de forma transparente aplicaciones MPI en trabajos maleables. Esto es, programas
paralelos que en tiempo de ejecución son capaces de adaptarse al número de procesadores
disponibles en el sistema, consiguiendo beneficios, como mayor productividad,
mejor tiempo de respuesta y mayor resistencia a fallos en los nodos.
Todas las soluciones propuestas han sido implementadas a nivel de aplicación,
siendo independientes de la arquitectura hardware, el sistema operativo, la implementación
MPI usada y de cualquier framework de alto nivel, como los utilizados
para el envío de trabajos.[Abstract]
This Thesis focuses on exploring fault-tolerant and malleability solutions, based
on checkpoint and restart techniques, for parallel message-passing applications. In
the fault-tolerant field, tbis Thesis contributes to improving the most important
overhead factor in checkpointing perfonnance, that is, the I/O cost of the state file
dumping, through the proposal of different techniques to reduce the checkpoint file
size. In addition, a process migration based on checkpointing is also proposed, that
allows for proactively migrating processes fram nades that are about to fail, avoiding
the complete restart of the execution and, thus, improving the application resilience.
Finally, this Thesis also includes a proposal to transparently transform MPI applications
into malleable jobs, that is, parallel programs that are able to adapt their
execution to the number of available processors at runtime, which provides important
benefits for the end users and the whole system, such as higher productivity
and a better response time, or a greater resilience to node failures.
All the solutions proposed in this Thesis have been implemented at the application-level,
and they are independent of the hardware architecture, the operating system,
or the MPI implementation used, and of any higher-level frameworks, such as job
submission frameworks
- …