133 research outputs found
Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences.
In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time
High Performance Computing Systems with Various Checkpointing Schemes
Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads
How Xenopus laevis embryos replicate reliably: investigating the random-completion problem
DNA synthesis in \textit{Xenopus} frog embryos initiates stochastically in
time at many sites (origins) along the chromosome. Stochastic initiation
implies fluctuations in the time to complete and may lead to cell death if
replication takes longer than the cell cycle time ( min).
Surprisingly, although the typical replication time is about 20 min, \textit{in
vivo} experiments show that replication fails to complete only about 1 in 300
times. How is replication timing accurately controlled despite the
stochasticity? Biologists have proposed two solutions to this
"random-completion problem." The first solution uses randomly located origins
but increases their rate of initiation as S phase proceeds, while the second
uses regularly spaced origins. In this paper, we investigate the
random-completion problem using a type of model first developed to describe the
kinetics of first-order phase transitions. Using methods from the field of
extreme-value statistics, we derive the distribution of replication-completion
times for a finite genome. We then argue that the biologists' first solution to
the problem is not only consistent with experiment but also nearly optimizes
the use of replicative proteins. We also show that spatial regularity in origin
placement does not alter significantly the distribution of replication times
and, thus, is not needed for the control of replication timing.Comment: 16 pages, 9 figures, submitted to Physical Review
A failure index for high performance computing applications
This dissertation introduces a new metric in the area of High Performance Computing (HPC) application reliability and performance modeling. Derived via the time-dependent implementation of an existing inequality measure, the Failure index (FI) generates a coefficient representing the level of volatility for the failures incurred by an application running on a given HPC system in a given time interval. This coefficient presents a normalized cross-system representation of the failure volatility of applications running on failure-rich HPC platforms. Further, the origin and ramifications of application failures are investigated, from which certain mathematical conclusions yield greater insight into the behavior of these applications in failure-rich system environments.
This work also includes background information on the problems facing HPC applications at the highest scale, the lack of standardized application-specific metrics within this arena, and a means of generating such metrics in a low latency manner. A case study containing detailed analysis showcasing the benefits of the FI is also included
Dynamic interval determination for pagelevel incremental checkpointing
A distributed system is composed of multiple independent machines that communicate using messages. Faults in a large distributed system are common events. Without fault tolerance mechanisms, an application running on a system has to be restarted from scratch if a fault happens in the middle of its execution, resulting in loss of useful computation. Checkpoint and Recovery mechanisms are used in distributed systems to provide fault tolerance for such applications. A checkpoint of a process is the information about the state of a process at some instant of time. A checkpoint of a distributed application is a set of checkpoints, one from each of its processes, satisfying certain constraints. If a fault occurs, the application is started from an earlier checkpoint instead of being restarted from scratch to save some of the computation. Several checkpoint and recovery protocols have been proposed in the literature. The performance of a checkpoint and recovery protocol depends upon the amount of computation it can save against the amount of overhead it incurs. Checkpointing protocols should not add much overhead to the system. Checkpoiniting overhead is mainly due to the coordination among processes and their context saving in stable storage. In coordination checkpointing, for taking single checkpoint, it will coordinate with other processes. Checkpoint initiating process coordinates with other processes through messages. If more number of messages are used for coordination then it increases the network tra±c. Which is not desirable. It is better to reduce the number of messages that are needed for checkpoint coordination. In this thesis, we present an algorithm which reduces the number of messages per process, that are needed for checkpoint Coordination and there by decreasing the network tra±c. The total running time of an application is depend on the execution time of the application and the amount of checkpointing overhead that incurs with the application. We should minimize this checkpointing overhead. Checkpointing overhead is the combination of context saving overhead and coordination overhead. Storing the context of application over stable storage also increases the overhead. In periodic interval checkpointing, sometimes processes takes checkpoints though it is not much useful. These unnecessary checkpoints increase the application's running time. We have proposed an algorithm which determines checkpointing interval dynamically, based on expected recovery time, to avoid unnecessary checkpoints. By eliminating unnecessary checkpoints, we can reduce running time of a process signi¯cantly
On Channel Failures, File Fragmentation Policies, and Heavy-Tailed Completion Times
It has been recently discovered that heavy-tailed completion times can result from protocol interaction even when file sizes are light-tailed. A key to this phenomenon is the use of a restart policy where if the file is interrupted before it is completed, it needs to restart from the beginning. In this paper, we show that fragmenting a file into pieces whose sizes are either bounded or independently chosen after each interruption guarantees light-tailed completion time as long as the file size is light-tailed; i.e., in this case, heavy-tailed completion time can only originate from heavy-tailed file sizes. If the file size is heavy-tailed, then the completion time is necessarily heavy-tailed. For this case, we show that when the file size distribution is regularly varying, then under independent or bounded fragmentation, the completion time tail distribution function is asymptotically bounded above by that of the original file size stretched by a constant factor. We then prove that if the distribution of times between interruptions has nondecreasing failure rate, the expected completion time is minimized by dividing the file into equal-sized fragments; this optimal fragment size is unique but depends on the file size. We also present a simple blind fragmentation policy where the fragment sizes are constant and independent of the file size and prove that it is asymptotically optimal. Both these policies are also shown to have desirable completion time tail behavior. Finally, we bound the error in expected completion time due to error in modeling of the failure process
The automation of PDE-constrained optimisation and its applications
This thesis is concerned with the automation of solving optimisation problems
constrained by partial differential equations (PDEs). Gradient-based
optimisation algorithms are the key to solve optimisation problems of practical
interest. The required derivatives can be efficiently computed with
the adjoint approach. However, current methods for the development of
adjoint models often require a significant amount of effort and expertise, in
particular for non-linear time-dependent problems.
This work presents a new high-level reinterpretation of algorithmic differentiation
to develop adjoint models. This reinterpretation considers the
discrete system as a sequence of equation solves. Applying this approach
to a general finite-element framework results in an automatic and robust
way of deriving and solving adjoint models. This drastically reduces the
development effort compared to traditional methods.
Based on this result, a new framework for rapidly defining and solving
optimisation problems constrained by PDEs is developed. The user specifies the discrete optimisation problem in a compact high-level language
that resembles the mathematical structure of the underlying system. All
remaining steps, including parameter updates, PDE solves and derivative
computations, are performed without user intervention. The framework
can be applied to a wide range of governing PDEs, and interfaces to various
gradient-free and gradient-based optimisation algorithms.
The capabilities of this framework are demonstrated through the application
to two PDE-constrained optimisation problems. The first is concerned
with the optimal layout of turbines in tidal stream farms; this optimisation
problem is one of the main challenges facing the marine renewable energy industry. The second application applies data assimilation to reconstruct
the profile of tsunami waves based on inundation observations. This provides
the first step towards the general reconstruction of tsunami signals
from satellite information
Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing
Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs. To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs. The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives. For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET. While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized. Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time
Stratégies de checkpoint pour protéger les tâches parallèles contre des erreurs ayant des distributions générales
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy is asymptotically optimal for very long jobs. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal 2.51 or Weibull 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.Cet article étudie les stratégies de checkpoint pour des tâches parallèles sujettes `a des erreurs fatales. La stratégie optimale est bien connue lorsque les temps d’inter-arrivée des pannes obéissent `a une loi exponentielle, mais elle est inconnue pour les distributions d’erreurs générales. Nous expliquons pourquoi ce dernier fait est mal compris dans la littérature récente. Nous proposons une stratégie générale qui maximise l’efficacité attendue jusqu’`a la prochaine d´défaillance, et nous montrons que cette stratégie est asymptotiquement optimale pour les travaux très longs. Par des simulations extensives, nous montrons que la nouvelle stratégie est toujours au moins aussi bonne que la stratégie de Young/Daly pour diverses distributions de pannes. Pour les distributions avec une mortalité infantile élevée (comme LogNormal 2.51 ou Weibull 0.5), le temps d’exécution est divisé par un facteur 1.9 en moyenne, et jusqu’`a un facteur 4.2 pour des plates-formes récemment déployées
The engineering design integration (EDIN) system
A digital computer program complex for the evaluation of aerospace vehicle preliminary designs is described. The system consists of a Univac 1100 series computer and peripherals using the Exec 8 operating system, a set of demand access terminals of the alphanumeric and graphics types, and a library of independent computer programs. Modification of the partial run streams, data base maintenance and construction, and control of program sequencing are provided by a data manipulation program called the DLG processor. The executive control of library program execution is performed by the Univac Exec 8 operating system through a user established run stream. A combination of demand and batch operations is employed in the evaluation of preliminary designs. Applications accomplished with the EDIN system are described
- …