91 research outputs found

    Compiler assisted chekpointing of message-passing applications in heterogeneous environments

    Get PDF
    [Resumen] With the evolution of high performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities, Whether due to a failure in the execution or to a migration of the processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad-hoc representations renders checkpointing tools incapable of working on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data needs to be stored, where to store it, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency, while preserving the scalability of the executed applications. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler achieves transparency by relieving the user from time-consuming tasks, such as performing code analyses and adding instrumentation code

    Standart-konformes Snapshotting für SystemC Virtuelle Plattformen

    Get PDF
    The steady increase in complexity of high-end embedded systems goes along with an increasingly complex design process. We are currently still in a transition phase from Hardware-Description Language (HDL) based design towards virtual-platform-based design of embedded systems. As design complexity rises faster than developer productivity a gap forms. Restoring productivity while at the same time managing increased design complexity can also be achieved through focussing on the development of new tools and design methodologies. In most application areas, high-level modelling languages such as SystemC are used in early design phases. In modern software development Continuous Integration (CI) is used to automatically test if a submitted piece of code breaks functionality. Application of the CI concept to embedded system design and testing requires fast build and test execution times from the virtual platform framework. For this use case the ability to save a specific state of a virtual platform becomes necessary. The saving and restoring of specific states of a simulation requires the ability to serialize all data structures within the simulation models. Improving the frameworks and establishing better methods will only help to narrow the design gap, if these changes are introduced with the needs of the engineers and developers in mind. Ultimately, it is their productivity that shall be improved. The ability to save the state of a virtual platform enables developers to run longer test campaigns that can even contain randomized test stimuli. If the saved states are modifiable the developers can inject faulty states into the simulation models. This work contributes an extension to the SoCRocket virtual platform framework to enable snapshotting. The snapshotting extension can be considered a reference implementation as the utilization of current SystemC/TLM standards makes it compatible to other frameworkds. Furthermore, integrating the UVM SystemC library into the framework enables test driven development and fast validation of SystemC/TLM models using snapshots. These extensions narrow the design gap by supporting designers, testers and developers to work more efficiently.Die stetige Steigerung der Komplexität eingebetteter Systeme geht einher mit einer ebenso steigenden Komplexität des Entwurfsprozesses. Wir befinden uns momentan in der Übergangsphase vom Entwurf von eingebetteten Systemen basierend auf Hardware-Beschreibungssprachen hin zum Entwurf ebendieser basierend auf virtuellen Plattformen. Da die Entwurfskomplexität rasanter steigt als die Produktivität der Entwickler, entsteht eine Kluft. Die Produktivität wiederherzustellen und gleichzeitig die gesteigerte Entwurfskomplexität zu bewältigen, kann auch erreicht werden, indem der Fokus auf die Entwicklung neuer Werkzeuge und Entwurfsmethoden gelegt wird. In den meisten Anwendungsgebieten werden Modellierungssprachen auf hoher Ebene, wie zum Beispiel SystemC, in den frühen Entwurfsphasen benutzt. In der modernen Software-Entwicklung wird Continuous Integration (CI) benutzt um automatisiert zu überprüfen, ob eine eingespielte Änderung am Quelltext bestehende Funktionalitäten beeinträchtigt. Die Anwendung des CI-Konzepts auf den Entwurf und das Testen von eingebetteten Systemen fordert schnelle Bau- und Test-Ausführungszeiten von dem genutzten Framework für virtuelle Plattformen. Für diesen Anwendungsfall wird auch die Fähigkeit, einen bestimmten Zustand der virtuellen Plattform zu speichern, erforderlich. Das Speichern und Wiederherstellen der Zustände einer Simulation erfordert die Serialisierung aller Datenstrukturen, die sich in den Simulationsmodellen befinden. Das Verbessern von Frameworks und Etablieren besserer Methodiken hilft nur die Entwurfs-Kluft zu verringern, wenn diese Änderungen mit Berücksichtigung der Bedürfnisse der Entwickler und Ingenieure eingeführt werden. Letztendlich ist es ihre Produktivität, die gesteigert werden soll. Die Fähigkeit den Zustand einer virtuellen Plattform zu speichern, ermöglicht es den Entwicklern, längere Testkampagnen laufen zu lassen, die auch zufällig erzeugte Teststimuli beinhalten können oder, falls die gespeicherten Zustände modifizierbar sind, fehlerbehaftete Zustände in die Simulationsmodelle zu injizieren. Mein mit dieser Arbeit geleisteter Beitrag beinhaltet die Erweiterung des SoCRocket Frameworks um Checkpointing Funktionalität im Sinne einer Referenzimplementierung. Weiterhin ermöglicht die Integration der UVM SystemC Bibliothek in das Framework die Umsetzung der testgetriebenen Entwicklung und schnelle Validierung von SystemC/TLM Modellen mit Hilfe von Snapshots

    Soft-fault recovery in MPI applications

    Get PDF
    [Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU cores, and this number is expected to grow into the millions in the near future. With such an elevated number of processors, the mean time between failures (MTBF) can become so small that most scientific applications will not have time to complete their execution before a failure occurs. It is therefore critical to develop fault tolerance and resilience mechanisms in order to guarantee the completion and integrity of massively parallel applications. University of A Coruña’s Computer Architecture Group (GAC) proposed a solution (Controller/comPiler for Portable Checkpointing - CPPC) in order to transparently convert generic MPI applications into fault tolerant applications, based on a checkpoint-restart scheme. CPPC was extended by Nuria Losada into CPPC-resilience in order to make resilient MPI applications, that is, those that are capable of detecting and reacting to failures without aborting the application, such that survivor processes don’t have to be restarted. This was accomplished by means of a logging protocol and the usage of a proposed fault tolerance interface addition to the MPI standard (User Level Failure Mitigation). However, this system cannot handle soft errors efficiently, since it kills and respawns the failed processes entirely when it is not necessary, as these errors are transient in nature. The object of this project is to extend and adapt CPPCresilience in order to handle soft errors in a more efficient manner, without having to respawn the failed processes. This proposal has been evaluated using 3 MPI applications with different characteristics, achieving a decrease in recovery times after a soft error ranging from 2 to 44 percent, depending on the total number of processes involved.[Resumo] Os sistemas actuais de computación de altas prestacións (HPC) están formados por miles de núcleos de procesadores, e espérase que este número aumente ata os millóns nun futuro cercano. Cun número tan elevado de procesadores, o tempo medio entre fallos (MTBF) pode chegar a reducirse tanto que a maioría de computacións científicas non terían tempo de completar a súa execución antes de que ocorrese un fallo. Polo tanto, é crítico o desenvolvemento de sistemas tolerantes e resilientes a fallos, para garantizar a finalización e integridade das aplicacións masivamente paralelas. O Grupo de Arquitectura de Computadores (GAC) da UDC propuxo unha solución (Controller/comPiler for Portable Checkpointing - CPPC) para convertir de xeito transparente aplicatcións MPI xenéricas en aplicacións tolerantes a fallos, basándose nun esquema de checkpointing e reinicio. CPPC foi posteriormente extendido por Nuria Losada en CPPC-resilience coa finalidade de crear aplicacións MPI resilientes, é dicir, aquelas que son capaces de detectar e reaccionar a fallos sen abortar a aplicación, de xeito que os procesos superviventes non necesitan ser reiniciados. Isto logrouse mediante un protocolo de logging de mensaxes e o uso dunha interfaz de tolerancia a fallos, ULFM (User Level Failure Mitigation), proposta para adición ao estándar MPI. Sen embargo, este sistema non xestiona os errores soft de maneira eficiente, xa que mata e reinicia os procesos fallados por completo cando non é necesario, xa que este tipo de errores teñen natureza transitoria. A meta deste TFG é extender e adaptar CPPC-resilience para poder manexar os errores soft eficientemente, sen ter que reiniciar os procesos fallados. Esta proposta foi evaluada utilizando 3 aplicacións MPI con diferentes características, conseguindo unha redución nos tempos de recuperación tras un erro soft de entre un 2 e un 44 por cento, dependendo do número total de procesos involucrados

    Files as first-class objects in fault -tolerant concurrent systems

    Get PDF
    Concurrent systems are used in applications where multiple processors are needed to complete tasks within a reasonable amount of time, or where the data sets involved will not fit within the main memory of a single computer. Because of their reliance on multiple machines, such systems are proportionally more vulnerable to both hardware and software induced failures. Fault-tolerance schemes are used to recover some earlier consistent state of the system after such a failure.;One important technique used to achieve fault-tolerance is checkpointing and rollback-recovery. In this thesis, we present a method for efficiently and transparently incorporating the part of the process state contained in the file system into process checkpoints, and we show how recovery of consistent versions of the file system and processes may be done after a failure. We present the details of a prototype system which implements our method.;We show that by using the special properties of the log-structured file system, the class of programs which are amenable to checkpointing and rollback-recovery schemes can be expanded to include those that use files. We impose no a priori restriction on the types of file system operations that can be done, and we demonstrate that our scheme does not impose significant failure-free overhead on the computation

    Mitigation of failures in high performance computing via runtime techniques

    Get PDF
    As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost

    Prediction-based failure management for supercomputers

    Get PDF
    The growing requirements of a diversity of applications necessitate the deployment of large and powerful computing systems and failures in these systems may cause severe damage in every aspect from loss of human lives to world economy. However, current fault tolerance techniques cannot meet the increasing requirements for reliability. Thus new solutions are urgently needed and research on proactive schemes is one of the directions that may offer better efficiency. This thesis proposes a novel proactive failure management framework. Its goal is to reduce the failure penalties and improve fault tolerance efficiency in supercomputers when running complex applications. The proposed proactive scheme builds on two core components: failure prediction and proactive failure recovery. More specifically, the failure prediction component is based on the assessment of system events and employs semi-Markov models to capture the dependencies between failures and other events for the forecasting of forthcoming failures. Furthermore, a two-level failure prediction strategy is described that not only estimates the future failure occurrence but also identifies the specific failure categories. Based on the accurate failure forecasting, a prediction-based coordinated checkpoint mechanism is designed to construct extra checkpoints just before each predicted failure occurrence so that the wasted computational time can be significantly reduced. Moreover, a theoretical model has been developed to assess the proactive scheme that enables calculation of the overall wasted computational time.The prediction component has been applied to industrial data from the IBM BlueGene/L system. Results of the failure prediction component show a great improvement of the prediction accuracy in comparison with three other well-known prediction approaches, and also demonstrate that the semi-Markov based predictor, which has achieved the precision of 87.41% and the recall of 77.95%, performs better than other predictors.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A checkpointing mechanism for GPU intensive HPC applications

    Get PDF
    Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC) grants EP/N028201/1 and EP/L00058X/

    Multiparty session types for dynamic verification of distributed systems

    Get PDF
    In large-scale distributed systems, each application is realised through interactions among distributed components. To guarantee safe communication (no deadlocks and communication mismatches) we need programming languages and tools that structure, manage, and policy-check these interactions. Multiparty session types (MPST), a typing discipline for structured interactions between communicating processes, offers a promising approach. To date, however, session types applications have been limited to static verification, which is not always feasible and is often restrictive in terms of programming API and specifying policies. This thesis investigates the design and implementation of a runtime verification framework, ensuring conformance between programs and specifications. Specifications are written in Scribble, a protocol description language formally founded on MPST. The central idea of the approach is a dynamic monitor, which takes a form of a communicating finite state machine, automatically generated from Scribble specifications, and a communication runtime stipulating a message format. We extend and apply Scribble-based runtime verification in manifold ways. First, we implement a Python library, facilitated with session primitives and verification runtime. We integrate the library in a large cyber-infrastructure project for oceanography. Second, we examine multiple communication patterns, which reveal and motivate two novel extensions, asynchronous interrupts for verification of exception handling behaviours, and time constraints for enforcement of realtime protocols. Third, we apply the verification framework to actor programming by augmenting an actor library in Python with protocol annotations. For both implementations, measurements show Scribble-based dynamic checking delivers minimal overhead and allows expressive specifications. Finally, we explore a static analysis of Scribble specifications as to efficiently compute a safe global state from which a monitored system of interacting processes can be recovered after a failure. We provide an implementation of a verification framework for recovery in Erlang. Benchmarks show our recovery strategy outperforms a built-in static recovery strategy, in Erlang, on a number of use cases.Open Acces

    SUDS : automatic parallelization for raw processors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 177-181).A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind of system built by humans. A computer system's throughput, however, is constrained by that system's ability to find concurrency. Given a particular target work load the computer architect's role is to design mechanisms to find and exploit the available concurrency in that work load. This thesis describes SUDS (Software Un-Do System), a compiler and runtime system that can automatically find and exploit the available concurrency of scalar operations in imperative programs with arbitrary unstructured and unpredictable control flow. The core compiler transformation that enables this is scalar queue conversion. Scalar queue conversion makes scalar renaming an explicit operation through a process similar to closure conversion, a technique traditionally used to compile functional languages. The scalar queue conversion compiler transformation is speculative, in the sense that it may introduce dynamic memory allocation operations into code that would not otherwise dynamically allocate memory. Thus, SUDS also includes a transactional runtime system that periodically checkpoints machine state, executes code speculatively, checks if the speculative execution produced results consistent with the original sequential program semantics, and then either commits or rolls back the speculative execution path. In addition to safely running scalar queue converted code, the SUDS runtime system safely permits threads to speculatively run in parallel and concurrently issue memory operations, even when the compiler is unable to prove that the reordered memory operations will always produce correct results.(cont.) Using this combination of compile time and runtime techniques, SUDS can find concurrency in programs where previous compiler based renaming techniques fail because the programs contain unstructured loops, and where Tomasulo's algorithm fails because it sequentializes mispredicted branches. Indeed, we describe three application programs, with unstructured control flow, where the prototype SUDS system, running in software on a Raw microprocessor, achieves speedups equivalent to, or better than, an idealized, and unrealizable, model of a hardware implementation of Tomasulo's algorithm.by Matthew Ian Frank.Ph.D
    • …
    corecore