1,252 research outputs found
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
FADI: a fault-tolerant environment for open distributed computing
FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes
Parallel Implementation of Lossy Data Compression for Temporal Data Sets
Many scientific data sets contain temporal dimensions. These are the data
storing information at the same spatial location but different time stamps.
Some of the biggest temporal datasets are produced by parallel computing
applications such as simulations of climate change and fluid dynamics. Temporal
datasets can be very large and cost a huge amount of time to transfer among
storage locations. Using data compression techniques, files can be transferred
faster and save storage space. NUMARCK is a lossy data compression algorithm
for temporal data sets that can learn emerging distributions of element-wise
change ratios along the temporal dimension and encodes them into an index table
to be concisely represented. This paper presents a parallel implementation of
NUMARCK. Evaluated with six data sets obtained from climate and astrophysics
simulations, parallel NUMARCK achieved scalable speedups of up to 8788 when
running 12800 MPI processes on a parallel computer. We also compare the
compression ratios against two lossy data compression algorithms, ISABELA and
ZFP. The results show that NUMARCK achieved higher compression ratio than
ISABELA and ZFP.Comment: 10 pages, HiPC 201
HOL(y)Hammer: Online ATP Service for HOL Light
HOL(y)Hammer is an online AI/ATP service for formal (computer-understandable)
mathematics encoded in the HOL Light system. The service allows its users to
upload and automatically process an arbitrary formal development (project)
based on HOL Light, and to attack arbitrary conjectures that use the concepts
defined in some of the uploaded projects. For that, the service uses several
automated reasoning systems combined with several premise selection methods
trained on all the project proofs. The projects that are readily available on
the server for such query answering include the recent versions of the
Flyspeck, Multivariate Analysis and Complex Analysis libraries. The service
runs on a 48-CPU server, currently employing in parallel for each task 7 AI/ATP
combinations and 4 decision procedures that contribute to its overall
performance. The system is also available for local installation by interested
users, who can customize it for their own proof development. An Emacs interface
allowing parallel asynchronous queries to the service is also provided. The
overall structure of the service is outlined, problems that arise and their
solutions are discussed, and an initial account of using the system is given
- âŠ