351 research outputs found
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC
Fault tolerance is one of the major design goals for HPC. The emergence of
non-volatile memories (NVM) provides a solution to build fault tolerant HPC.
Data in NVM-based main memory are not lost when the system crashes because of
the non-volatility nature of NVM. However, because of volatile caches, data
must be logged and explicitly flushed from caches into NVM to ensure
consistence and correctness before crashes, which can cause large runtime
overhead.
In this paper, we introduce an algorithm-based method to establish crash
consistence in NVM for HPC applications. We slightly extend application data
structures or sparsely flush cache blocks, which introduce ignorable runtime
overhead. Such extension or cache flushing allows us to use algorithm knowledge
to \textit{reason} data consistence or correct inconsistent data when the
application crashes. We demonstrate the effectiveness of our method for three
algorithms, including an iterative solver, dense matrix multiplication, and
Monte-Carlo simulation. Based on comprehensive performance evaluation on a
variety of test environments, we demonstrate that our approach has very small
runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller
than that of traditional checkpoint, while having the same or less
recomputation cost.Comment: 12 page
Reliability -aware optimal checkpoint /restart model in high performance computing
Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging due to complexity, coordination, and timing issues. In this dissertation, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims to address the fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques
An autoencoder compression approach for accelerating large-scale inverse problems
PDE-constrained inverse problems are some of the most challenging and
computationally demanding problems in computational science today. Fine meshes
that are required to accurately compute the PDE solution introduce an enormous
number of parameters and require large scale computing resources such as more
processors and more memory to solve such systems in a reasonable time. For
inverse problems constrained by time dependent PDEs, the adjoint method that is
often employed to efficiently compute gradients and higher order derivatives
requires solving a time-reversed, so-called adjoint PDE that depends on the
forward PDE solution at each timestep. This necessitates the storage of a high
dimensional forward solution vector at every timestep. Such a procedure quickly
exhausts the available memory resources. Several approaches that trade
additional computation for reduced memory footprint have been proposed to
mitigate the memory bottleneck, including checkpointing and compression
strategies. In this work, we propose a close-to-ideal scalable compression
approach using autoencoders to eliminate the need for checkpointing and
substantial memory storage, thereby reducing both the time-to-solution and
memory requirements. We compare our approach with checkpointing and an
off-the-shelf compression approach on an earth-scale ill-posed seismic inverse
problem. The results verify the expected close-to-ideal speedup for both the
gradient and Hessian-vector product using the proposed autoencoder compression
approach. To highlight the usefulness of the proposed approach, we combine the
autoencoder compression with the data-informed active subspace (DIAS) prior to
show how the DIAS method can be affordably extended to large scale problems
without the need of checkpointing and large memory
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
Checkpoint and run-time adaptation with pluggable parallelisation
Enabling applications for computational Grids requires new approaches to develop applications that can effectively cope with resource volatility. Applications must be resilient to resource faults, adapting the behaviour to available resources. This paper describes an approach to application-level adaptation that efficiently supports application-level checkpointing. The key of this work is the concept of pluggable parallelisation, which localises parallelisation issues into multiple modules that can be (un)plugged to match resource availability. This paper shows how pluggable parallelisation can be extended to effectively support checkpointing and run-time adaptation. We present the developed pluggable mechanism that helps the programmer to include checkpointing in the base (sequential). Based on these mechanisms and on previous work on pluggable parallelisation, our approach is able to automatically add support for checkpointing in parallel execution environments. Moreover, applications can adapt from a sequential execution to a multi-cluster configuration. Adaptation can be performed by checkpointing the application and restarting on a different mode or can be performed during run-time. Pluggable parallelisation intrinsically promotes the separation of software functionality from fault-tolerance and adaptation issues facilitating their analysis and evolution. The work presented in this paper reinforces this idea by showing the feasibility of the approach and performance benefits that can be achieved.(undefined
Extending Scojo-PECT by migration based on system-level checkpointing
In recent years, a significant amount of research has been done on job scheduling in high performance computing area. Parallel jobs have different running time and require a different number of processors, thus jobs need to be scheduled and packed to improve system utilization. Scojo-PECT is a job scheduler which provides service guarantees by using coarse-grain time sharing. However, Scojo-PECT does not provide process migration. We extend the Scojo-PECT by migrating parallel jobs based on system-level checkpointing. We investigate different cases in the Scojo-PECT scheduling algorithm where migration based on system-level checkpointing can be used to improve resource utilization and reduce job response time. Our experimental results show reduction of relative response times on medium jobs over the results of the original Scojo-PECT scheduler and the long jobs do not suffer any disadvantage
Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory
Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study
This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. The final authenticated version is available online at: https://doi.org/10.1093/comjnl/bxr018[Abstract] This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a tool for the checkpointing of parallel message-passing applications. Its performance and the factors that impact it are transparently and rigorously identified and assessed. The tests were performed on a public supercomputing infrastructure, using a large number of very different applications and showing excellent results in terms of performance and effort required for integration into user codes. Statistical analysis techniques have been used to better approximate the performance of the tool. Quantitative and qualitative comparisons with other rollback-recovery approaches to fault tolerance are also included. All these data and comparisons are then discussed in an effort to extract meaningful conclusions about the state-of-the-art and future research trends in the rollback-recovery field.Minsiterio de Ciencia e Innovación; TIN2010-1673
Resource-Efficient Replication and Migration of Virtual Machines.
Continuous replication and live migration of Virtual Machines (VMs) are two vital tools in a virtualized environment, but they are resource-expensive. Continuously replicating a VM's checkpointed state to a backup host maintains high-availability (HA) of the VM despite host failures, but checkpoint replication can generate significant network traffic. Each replicated VM also incurs a 100% memory overhead, since the backup unproductively reserves the same amount of memory to hold the redundant VM state. Live migration, though being widely used for load-balancing, power-saving, etc., can also generate excessive network traffic, by transferring VM state iteratively. In addition, it can incur a long completion time and degrade application performance.
This thesis explores ways to replicate VMs for HA using resources efficiently, and to migrate VMs fast, with minimal execution disruption and using resources efficiently. First, we investigate the tradeoffs in using different compression methods to reduce the network traffic of checkpoint replication in a HA system. We evaluate gzip, delta and similarity compressions based on metrics that are specifically important in a HA system, and then suggest guidelines for their selection.
Next, we propose HydraVM, a storage-based HA approach that eliminates the unproductive memory reservation made in backup hosts. HydraVM maintains a recent image of a protected VM in a shared storage by taking and consolidating incremental VM checkpoints. When a failure occurs, HydraVM quickly resumes the execution of a failed VM by loading a small amount of essential VM state from the storage. As the VM executes, the VM state not yet loaded is supplied on-demand.
Finally, we propose application-assisted live migration, which skips transfer of VM memory that need not be migrated to execute running applications at the destination. We develop a generic framework for the proposed approach, and then use the framework to build JAVMM, a system that migrates VMs running Java applications skipping transfer of garbage in Java memory. Our evaluation results show that compared to Xen live migration, which is agnostic of running applications, JAVMM can reduce the completion time, network traffic and application downtime caused by Java VM migration, all by up to over 90%.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111575/1/karenhou_1.pd
Laadunvarmistustyökalujen varmistusvedostus järjestelmätasolla
In modern software development many kinds of verification is performed to prevent regressions and to ensure robustness of the software. Execution of verification tasks is usually automated with continuous delivery (CD) systems built on CD-platforms.
Currently available CD-platforms (Jenkins, Concourse, GoCD) are essentially job schedulers based on traditional job scheduling model. They execute tasks to completion in order of arrival. This model is known to cause user dissatisfaction due to long wait-times when the variation in task execution times is high. It's also known to exhibit low resource utilization. This prevents integration of new kinds of verification, reduces cost-effectiveness and decreases developer productivity.
Preemption, that is task-switching, enables much more flexibility to scheduling. It greatly improves the system's responsiveness by reducing wait-times. It solves the problem of short tasks having to wait extendedly for long tasks to complete. By enabling time-slicing of resources it increases their utilization. The result is interactive service for developers, supporting more kinds of verification in CD and enabling more value to be extracted of available compute resources.
Implementation of preemption requires ability to suspend and resume the execution of verification tools. We evaluate system-level checkpointing, a technique used for preemption in high performance computing, that does not require modification of the verification tools. We selected Checkpoint and Restore in Userspace (CRIU) as the checkpointing utility to be evaluated. We evaluated CRIU's capability to checkpoint verification tools and measured checkpoint creation time and checkpoint image size. We selected AFL, AddressSanitizer, Valgrind and Android Emulator as the tools to be tested.
Our results show CRIU is not yet capable of preempting arbitrary verification tools as only AFL and Valgrind were checkpointable. Checkpoint creation was fast making it feasible for interactive use in a CD-system. Checkpoint image's size was found to depend on the verification tool's memory size, as expected, meaning most tools would be feasible for preemption to network storage in a cluster.Nykypäivän ohjelmistokehityksessä käytetään monenlaisia laadunvarmistusmenetelmiä regressioiden estämiseen ja ohjelmistojen vikasietoisuuden takaamiseksi. Tällaisten tehtävien suoritus yleensä automatisoidaan jatkuvan toimituksen (CD) järjestelmillä, jotka on rakennettu jollekin CD-alustalle.
Saatavilla olevat CD-alustat (Jenkins, Concourse, GoCD) ovat pääpiirteissään perinteiseen ryväslaskennan vuoronnusmalliin pohjautuvia tehtävävuorontajia. Ne suorittavat tehtäviä saapumisjärjestyksessä alusta loppuun. Tehtävien keston vaihdellessa odotusajat kasvavat pitkiksi, joten mallin käyttökokemus on huono. Resursseja ei myöskään hyödynnetä tehokkaasti. Nämä estävät uusien varmistusmenetelmien käytön sekä heikentävät kustannustehokkuutta ja ohjelmistokehittäjien tuottavuutta.
Tehtävien vuorottelu tekee vuoronnuksesta joustavaa. Se lyhentää odotusaikoja huomattavasti. Lyhyet tehtävät eivät enää joudu odottamaan pitkäkestoisten tehtävien päättymistä ja resursseja hyödynnetään tehokkaammin. Näillä saavutetaan ohjelmistokehittäjille vuorovaikutteinen käyttökokemus, uudenlaisia varmistusmenetelmiä voidaan ottaa käyttöön ja laskentaresursseista saadaan parempi hyöty.
Vuorottelun toteuttamiseksi laadunvarmistustyökaluiden suoritus täytyy olla keskeytettävissä. Työssä arvioimme järjestelmätason varmistusvedostusta, joka on suurteholaskennassa käytetty menetelmä tehtävien vuorotteluun. Menetelmä ei vaadi muutoksia työkaluihin. Tarkastelemme Checkpoint and Restore in Userspace (CRIU)-varmistusvedostustyökalua, sen kykyä laadunvarmistustyökalujen vuorotteluun sekä vedoksen luontiin kuluvaa aikaa ja vedoksen kokoa. Kokeiltuja laadunvarmistustyökaluja olivat AFL, AddressSanitizer, Valgrind sekä Android Emulator.
Ilmeni, että CRIU ei vielä kykene kaikkien laadunvarmistustyökalujen vuorotteluun sillä kokeilluista työkaluista vain AFL ja Valgrind voitiin vedostaa. Vedoksen luonti oli nopeaa, mikä tekee varmistusvedostuksesta käyttökelpoisen vuorovaikutteisissa CD-järjestelmissä. Kuten oletettiin, vedoksen koko riippui laadunvarmistustyökalun muistin koosta, joten yleisimpien työkalujen vuorottelu verkkotallennusta käyttävissä laskentaryppäissä olisi mahdollista
- …