1,568 research outputs found
FADI: a fault-tolerant environment for open distributed computing
FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes
An agent-based framework for performance modeling of an optimistic parallel discrete event simulator
Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory
Improving Performance of Iterative Methods by Lossy Checkponting
Iterative methods are commonly used approaches to solve large, sparse linear
systems, which are fundamental operations for many modern scientific
simulations. When the large-scale iterative methods are running with a large
number of ranks in parallel, they have to checkpoint the dynamic variables
periodically in case of unavoidable fail-stop errors, requiring fast I/O
systems and large storage space. To this end, significantly reducing the
checkpointing overhead is critical to improving the overall performance of
iterative methods. Our contribution is fourfold. (1) We propose a novel lossy
checkpointing scheme that can significantly improve the checkpointing
performance of iterative methods by leveraging lossy compressors. (2) We
formulate a lossy checkpointing performance model and derive theoretically an
upper bound for the extra number of iterations caused by the distortion of data
in lossy checkpoints, in order to guarantee the performance improvement under
the lossy checkpointing scheme. (3) We analyze the impact of lossy
checkpointing (i.e., extra number of iterations caused by lossy checkpointing
files) for multiple types of iterative methods. (4)We evaluate the lossy
checkpointing scheme with optimal checkpointing intervals on a high-performance
computing environment with 2,048 cores, using a well-known scientific
computation package PETSc and a state-of-the-art checkpoint/restart toolkit.
Experiments show that our optimized lossy checkpointing scheme can
significantly reduce the fault tolerance overhead for iterative methods by
23%~70% compared with traditional checkpointing and 20%~58% compared with
lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
Analysis of checkpointing schemes for multiprocessor systems
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault-tolerance, by duplicating the task into more than a single processor, and comparing the states of the processors at checkpoints. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes with task duplication. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task execution time and total work of four checkpointing schemes, and show that generally increasing the number of processors reduces the average execution time, but increases the total work done by the processors. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change
Checkpointing of parallel applications in a Grid environment
The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid
environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of
parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred.
While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size
Active Virtual Network Management Prediction: Complexity as a Framework for Prediction, Optimization, and Assurance
Research into active networking has provided the incentive to re-visit what
has traditionally been classified as distinct properties and characteristics of
information transfer such as protocol versus service; at a more fundamental
level this paper considers the blending of computation and communication by
means of complexity. The specific service examined in this paper is network
self-prediction enabled by Active Virtual Network Management Prediction.
Computation/communication is analyzed via Kolmogorov Complexity. The result is
a mechanism to understand and improve the performance of active networking and
Active Virtual Network Management Prediction in particular. The Active Virtual
Network Management Prediction mechanism allows information, in various states
of algorithmic and static form, to be transported in the service of prediction
for network management. The results are generally applicable to algorithmic
transmission of information. Kolmogorov Complexity is used and experimentally
validated as a theory describing the relationship among algorithmic
compression, complexity, and prediction accuracy within an active network.
Finally, the paper concludes with a complexity-based framework for Information
Assurance that attempts to take a holistic view of vulnerability analysis
POSE: getting over grainsize in parallel discrete event simulation
Parallel discrete event simulations (PDES) encom-pass a broad range of analytical simulations. Their utility lies in their ability to model a system and pro-vide information about its behavior in a timely manner. Current PDES methods provide limited performance im-provements over sequential simulation. Many logical models for applications have fine granularity making them challenging to parallelize. In POSE, we exam-ine the overhead required for optimistically synchroniz-ing events. We have designed an object model based on the concept of virtualization and new adaptive op-timistic methods to improve the performance of fine-grained PDES applications. These novel approaches exploit the speculative nature of optimistic protocols to improve single-processor parallel over sequential per-formance and achieve scalability for previously hard-to-parallelize fine-grained simulations.1 1
- âŠ