133 research outputs found

    Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms

    Get PDF
    As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences. In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time

    High Performance Computing Systems with Various Checkpointing Schemes

    Get PDF
    Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads

    How Xenopus laevis embryos replicate reliably: investigating the random-completion problem

    Full text link
    DNA synthesis in \textit{Xenopus} frog embryos initiates stochastically in time at many sites (origins) along the chromosome. Stochastic initiation implies fluctuations in the time to complete and may lead to cell death if replication takes longer than the cell cycle time (25\approx 25 min). Surprisingly, although the typical replication time is about 20 min, \textit{in vivo} experiments show that replication fails to complete only about 1 in 300 times. How is replication timing accurately controlled despite the stochasticity? Biologists have proposed two solutions to this "random-completion problem." The first solution uses randomly located origins but increases their rate of initiation as S phase proceeds, while the second uses regularly spaced origins. In this paper, we investigate the random-completion problem using a type of model first developed to describe the kinetics of first-order phase transitions. Using methods from the field of extreme-value statistics, we derive the distribution of replication-completion times for a finite genome. We then argue that the biologists' first solution to the problem is not only consistent with experiment but also nearly optimizes the use of replicative proteins. We also show that spatial regularity in origin placement does not alter significantly the distribution of replication times and, thus, is not needed for the control of replication timing.Comment: 16 pages, 9 figures, submitted to Physical Review

    A failure index for high performance computing applications

    Get PDF
    This dissertation introduces a new metric in the area of High Performance Computing (HPC) application reliability and performance modeling. Derived via the time-dependent implementation of an existing inequality measure, the Failure index (FI) generates a coefficient representing the level of volatility for the failures incurred by an application running on a given HPC system in a given time interval. This coefficient presents a normalized cross-system representation of the failure volatility of applications running on failure-rich HPC platforms. Further, the origin and ramifications of application failures are investigated, from which certain mathematical conclusions yield greater insight into the behavior of these applications in failure-rich system environments. This work also includes background information on the problems facing HPC applications at the highest scale, the lack of standardized application-specific metrics within this arena, and a means of generating such metrics in a low latency manner. A case study containing detailed analysis showcasing the benefits of the FI is also included

    Dynamic interval determination for pagelevel incremental checkpointing

    Get PDF
    A distributed system is composed of multiple independent machines that communicate using messages. Faults in a large distributed system are common events. Without fault tolerance mechanisms, an application running on a system has to be restarted from scratch if a fault happens in the middle of its execution, resulting in loss of useful computation. Checkpoint and Recovery mechanisms are used in distributed systems to provide fault tolerance for such applications. A checkpoint of a process is the information about the state of a process at some instant of time. A checkpoint of a distributed application is a set of checkpoints, one from each of its processes, satisfying certain constraints. If a fault occurs, the application is started from an earlier checkpoint instead of being restarted from scratch to save some of the computation. Several checkpoint and recovery protocols have been proposed in the literature. The performance of a checkpoint and recovery protocol depends upon the amount of computation it can save against the amount of overhead it incurs. Checkpointing protocols should not add much overhead to the system. Checkpoiniting overhead is mainly due to the coordination among processes and their context saving in stable storage. In coordination checkpointing, for taking single checkpoint, it will coordinate with other processes. Checkpoint initiating process coordinates with other processes through messages. If more number of messages are used for coordination then it increases the network tra±c. Which is not desirable. It is better to reduce the number of messages that are needed for checkpoint coordination. In this thesis, we present an algorithm which reduces the number of messages per process, that are needed for checkpoint Coordination and there by decreasing the network tra±c. The total running time of an application is depend on the execution time of the application and the amount of checkpointing overhead that incurs with the application. We should minimize this checkpointing overhead. Checkpointing overhead is the combination of context saving overhead and coordination overhead. Storing the context of application over stable storage also increases the overhead. In periodic interval checkpointing, sometimes processes takes checkpoints though it is not much useful. These unnecessary checkpoints increase the application's running time. We have proposed an algorithm which determines checkpointing interval dynamically, based on expected recovery time, to avoid unnecessary checkpoints. By eliminating unnecessary checkpoints, we can reduce running time of a process signi¯cantly

    On Channel Failures, File Fragmentation Policies, and Heavy-Tailed Completion Times

    Get PDF
    It has been recently discovered that heavy-tailed completion times can result from protocol interaction even when file sizes are light-tailed. A key to this phenomenon is the use of a restart policy where if the file is interrupted before it is completed, it needs to restart from the beginning. In this paper, we show that fragmenting a file into pieces whose sizes are either bounded or independently chosen after each interruption guarantees light-tailed completion time as long as the file size is light-tailed; i.e., in this case, heavy-tailed completion time can only originate from heavy-tailed file sizes. If the file size is heavy-tailed, then the completion time is necessarily heavy-tailed. For this case, we show that when the file size distribution is regularly varying, then under independent or bounded fragmentation, the completion time tail distribution function is asymptotically bounded above by that of the original file size stretched by a constant factor. We then prove that if the distribution of times between interruptions has nondecreasing failure rate, the expected completion time is minimized by dividing the file into equal-sized fragments; this optimal fragment size is unique but depends on the file size. We also present a simple blind fragmentation policy where the fragment sizes are constant and independent of the file size and prove that it is asymptotically optimal. Both these policies are also shown to have desirable completion time tail behavior. Finally, we bound the error in expected completion time due to error in modeling of the failure process

    The automation of PDE-constrained optimisation and its applications

    No full text
    This thesis is concerned with the automation of solving optimisation problems constrained by partial differential equations (PDEs). Gradient-based optimisation algorithms are the key to solve optimisation problems of practical interest. The required derivatives can be efficiently computed with the adjoint approach. However, current methods for the development of adjoint models often require a significant amount of effort and expertise, in particular for non-linear time-dependent problems. This work presents a new high-level reinterpretation of algorithmic differentiation to develop adjoint models. This reinterpretation considers the discrete system as a sequence of equation solves. Applying this approach to a general finite-element framework results in an automatic and robust way of deriving and solving adjoint models. This drastically reduces the development effort compared to traditional methods. Based on this result, a new framework for rapidly defining and solving optimisation problems constrained by PDEs is developed. The user specifies the discrete optimisation problem in a compact high-level language that resembles the mathematical structure of the underlying system. All remaining steps, including parameter updates, PDE solves and derivative computations, are performed without user intervention. The framework can be applied to a wide range of governing PDEs, and interfaces to various gradient-free and gradient-based optimisation algorithms. The capabilities of this framework are demonstrated through the application to two PDE-constrained optimisation problems. The first is concerned with the optimal layout of turbines in tidal stream farms; this optimisation problem is one of the main challenges facing the marine renewable energy industry. The second application applies data assimilation to reconstruct the profile of tsunami waves based on inundation observations. This provides the first step towards the general reconstruction of tsunami signals from satellite information

    Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing

    Get PDF
    Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs. To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs. The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives. For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET. While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized. Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time

    Stratégies de checkpoint pour protéger les tâches parallèles contre des erreurs ayant des distributions générales

    Get PDF
    This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy is asymptotically optimal for very long jobs. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal 2.51 or Weibull 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.Cet article étudie les stratégies de checkpoint pour des tâches parallèles sujettes `a des erreurs fatales. La stratégie optimale est bien connue lorsque les temps d’inter-arrivée des pannes obéissent `a une loi exponentielle, mais elle est inconnue pour les distributions d’erreurs générales. Nous expliquons pourquoi ce dernier fait est mal compris dans la littérature récente. Nous proposons une stratégie générale qui maximise l’efficacité attendue jusqu’`a la prochaine d´défaillance, et nous montrons que cette stratégie est asymptotiquement optimale pour les travaux très longs. Par des simulations extensives, nous montrons que la nouvelle stratégie est toujours au moins aussi bonne que la stratégie de Young/Daly pour diverses distributions de pannes. Pour les distributions avec une mortalité infantile élevée (comme LogNormal 2.51 ou Weibull 0.5), le temps d’exécution est divisé par un facteur 1.9 en moyenne, et jusqu’`a un facteur 4.2 pour des plates-formes récemment déployées

    The engineering design integration (EDIN) system

    Get PDF
    A digital computer program complex for the evaluation of aerospace vehicle preliminary designs is described. The system consists of a Univac 1100 series computer and peripherals using the Exec 8 operating system, a set of demand access terminals of the alphanumeric and graphics types, and a library of independent computer programs. Modification of the partial run streams, data base maintenance and construction, and control of program sequencing are provided by a data manipulation program called the DLG processor. The executive control of library program execution is performed by the Univac Exec 8 operating system through a user established run stream. A combination of demand and batch operations is employed in the evaluation of preliminary designs. Applications accomplished with the EDIN system are described
    corecore