29 research outputs found

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    Transaction Support for Ada

    Get PDF
    This paper describes the transaction support framework OPTIMA and its implementation for Ada 95. First, a transaction model that fits concurrent programming languages is presented. Then the design of the framework is given. Applications from many different domains can benefit from using transactions; it is therefore important to provide means to customize the framework depending on the application requirements. This flexibility is achieved by using design patterns. Class hierarchies with classes implementing standard transactional behavior are provided, but a programmer is free to extend the hierarchies by implementing application-specific functionalities. An interface for Ada programmers is presented and its use demonstrated via a simple example

    Replication of non-deterministic objects

    Get PDF
    This thesis discusses replication of non-deterministic objects in distributed systems to achieve fault tolerance against crash failures. The objects replicated are the virtual nodes of a distributed application. Replication is viewed as an issue that is to be dealt with only during the configuration of a distributed application and that should not affect the development of the application. Hence, replication of virtual nodes should be transparent to the application. Like all measures to achieve fault tolerance, replication introduces redundancy in the system. Not surprisingly, the main difficulty is guaranteeing the consistency of all replicas such that they behave in the same way as if the object was not replicated (replication transparency). This is further complicated if active objects (like virtual nodes) are replicated, and these objects themselves can be clients of still further objects in the distributed application. The problems of replication of active non-deterministic objects are analyzed in the context of distributed Ada 95 applications. The ISO standard for Ada 95 defines a model for distributed execution based on remote procedure calls (RPC). Virtual nodes in Ada 95 use this as their sole communication paradigm, but they may contain tasks to execute activities concurrently, thus making the execution potentially non-deterministic due to implicit timing dependencies. Such non-determinism cannot be avoided by choosing deterministic tasking policies. I present two different approaches to maintain replica consistency despite this non-determinism. In a first approach, I consider the run-time support of Ada 95 as a black box (except for the part handling remote communications). This corresponds to a non-deterministic computation model. I show that replication of non-deterministic virtual nodes requires that remote procedure calls are implemented as nested transactions. Unfortunately, effects of failures are not local to the replicas of a virtual node: when a failure occurs, nested remote calls made to other virtual nodes must be undone. Also, using transactional semantics for RPCs necessitates a compromise regarding transparency: the application must identify global state for it cannot be determined reliably in an automatic way. Further study reveals that this approach cannot be implemented in a transparent way at all because the consistency criterion of Ada 95 (linearizability) is much weaker than that of transactions (serializability). An execution of remote procedure calls as transactions may thus lead to incompatibilities with the semantics of the programming language. If remotely called subprograms on a replicated virtual node perform partial operations, i.e., entry calls on global protected objects, deadlocks that cannot be broken can occur in certain cases. Such deadlocks do not occur when the virtual node is not replicated. The transactional semantics of RPCs must therefore be exposed to the application. A second approach is based on a piecewise deterministic computation model, i.e., the execution of a virtual node is seen as a sequence of deterministic state intervals. Whenever a non-deterministic event occurs, a new state interval is started. I study replica organization under this computation model (semi-active replication). In this model, all non-deterministic decisions are made on one distinguished replica (the leader), while all other replicas (the followers) are forced to follow the same sequence of non-deterministic events. I show that it suffices to synchronize the followers with the leader upon each observable event, i.e., when the leader sends a message to some other virtual node. It is not necessary to synchronize upon each and every non-deterministic event — which would incur a prohibitively high overhead. Non-deterministic events occurring on the leader between observable events are logged and sent to the followers just before the leader executes an observable event. Consequently, it is guaranteed that the followers will reach the same state as the leader, and thus the effects of failures remain mostly local to the replicas. A prototype implementation called RAPIDS (Replicated Ada Partitions In Distributed Systems) serves as a proof of concept for this second approach, demonstrating its feasibility. RAPIDS is an Ada 95 implementation of a replication manager for semi-active replication for the GNAT development system for Ada 95. It is entirely contained within the run-time support and hence largely transparent for the application

    Open Multithreaded Transactions: A Transaction Model for Concurrent Object-Oriented Programming

    Get PDF
    To read the abstract, please go to my PhD home page

    Integrated Data, Message, and Process Recovery for Failure Masking in Web Services

    Get PDF
    Modern Web Services applications encompass multiple distributed interacting components, possibly including millions of lines of code written in different programming languages. With this complexity, some bugs often remain undetected despite extensive testing procedures, and occasionally cause transient system failures. Incorrect failure handling in applications often leads to incomplete or to unintentional request executions. A family of recovery protocols called interaction contracts provides a generic solution to this problem by means of system-integrated data, process, and message recovery for multi-tier applications. It is able to mask failures, and allows programmers to concentrate on the application logic, thus speeding up the development process. This thesis consists of two major parts. The first part formally specifies the interaction contracts using the state-and-activity chart language. Moreover, it presents a formal specification of a concrete Web Service that makes use of interaction contracts, and contains no other error-handling actions. The formal specifications undergo verification where crucial safety and liveness properties expressed in temporal logics are mathematically proved by means of model checking. In particular, it is shown that each end-user request is executed exactly once. The second part of the thesis demonstrates the viability of the interaction framework in a real world system. More specifically, a cascadable Web Service platform, EOS, is built based on widely used components, Microsoft Internet Explorer and PHP application server, with interaction contracts integrated into them.Heutige Web-Service-Anwendungen setzen sich aus mehreren verteilten interagierenden Komponenten zusammen. Dabei werden oft mehrere Programmiersprachen eingesetzt, und der Quellcode einer Komponente kann mehrere Millionen Programmzeilen umfassen. In Anbetracht dieser Komplexität bleiben typischerweise einige Programmierfehler trotz intensiver Qualitätssicherung unentdeckt und verursachen vorübergehende Systemsausfälle zur Laufzeit. Eine ungenügende Fehlerbehandlung in Anwendungen führt oft zur unvollständigen oder unbeabsichtigt wiederholten Ausführung einer Operation. Eine Familie von Recovery-Protokollen, die so genannten "Interaction Contracts", bietet eine generische Lösung dieses Problems. Diese Recovery- Protokolle sorgen für die Fehlermaskierung und ermöglichen somit, dass Entwickler ihre ganze Konzentration der Anwendungslogik widmen können. Dies trägt zu einer erheblichen Beschleunigung des Entwicklungsprozesses bei. Diese Dissertation besteht aus zwei wesentlichen Teilen. Der erste Teil widmet sich der formalen Spezifikation der Recovery-Protokolle unter Verwendung des Formalismus der State-and-Activity-Charts. Darüber hinaus entwickeln wir die formale Spezifikation einer Web-Service-Anwendung, die außer den Recovery-Protokollen keine weitere Fehlerbehandlung beinhaltet. Die formalen Spezifikationen werden in Bezug auf kritische Sicherheits- und Lebendigkeitseigenschaften, die als temporallogische Formeln angegeben sind, mittels "Model Checking" verifiziert. Unter anderem wird somit mathematisch bewiesen, dass jede Operation eines Endbenutzers genau einmal ausgeführt wird. Der zweite Teil der Dissertation beschreibt die Implementierung der Recovery- Protokolle im Rahmen einer beliebig verteilbaren Web-Service-Plattform EOS, die auf weit verbreiteten Web-Produkten aufbaut: dem Browser "Microsoft Internet Explorer" und dem PHP-Anwendungsserver

    Management of object-oriented action-based distributed programs

    Get PDF
    Phd ThesisThis thesis addresses the problem of managing the runtime behaviour of distributed programs. The thesis of this work is that management is fundamentally an information processing activity and that the object model, as applied to actionbased distributed systems and database systems, is an appropriate representation of the management information. In this approach, the basic concepts of classes, objects, relationships, and atomic transition systems are used to form object models of distributed programs. Distributed programs are collections of objects whose methods are structured using atomic actions, i.e., atomic transactions. Object models are formed of two submodels, each representing a fundamental aspect of a distributed program. The structural submodel represents a static perspective of the distributed program, and the control submodel represents a dynamic perspective of it. Structural models represent the program's objects, classes and their relationships. Control models represent the program's object states, events, guards and actions-a transition system. Resolution of queries on the distributed program's object model enable the management system to control certain activities of distributed programs. At a different level of abstraction, the distributed program can be seen as a reactive system where two subprograms interact: an application program and a management program; they interact only through sensors and actuators. Sensors are methods used to probe an object's state and actuators are methods used to change an object's state. The management program is capable to prod the application program into action by activating sensors and actuators available at the interface of the application program. Actions are determined by management policies that are encoded in the management program. This way of structuring the management system encourages a clear modularization of application and management distributed programs, allowing better separation of concerns. Managemental concerns can be dealt with by the management program, functional concerns can be assigned to the application program. The object-oriented action-based computational model adopted by the management system provides a natural framework for the implementation of faulttolerant distributed programs. Object orientation provides modularity and extensibility through object encapsulation. Atomic actions guarantee the consistency of the objects of the distributed program despite concurrency and failures. Replication of the distributed program provides increased fault-tolerance by guaranteeing the consistent progress of the computation, even though some of the replicated objects can fail. A prototype management system based on the management theory proposed above has been implemented atop Arjuna; an object-oriented programming system which provides a set of tools for constructing fault-tolerant distributed programs. The management system is composed of two subsystems: Stabilis, a management system for structural information, and Vigil, a management system for control information. Example applications have been implemented to illustrate the use of the management system and gather experimental evidence to give support to the thesis.CNPq (Consellho Nacional de Desenvolvimento Cientifico e Tecnol6gico, Brazil): BROADCAST (Basic Research On Advanced Distributed Computing: from Algorithms to SysTems)

    JMSGroups:JMS compliant group communication

    Get PDF
    Nowadays, computers are the indispensable part of our life. They evolve rapidly and are more and more versatile. Computer networks made the remote corners of the world just a click away. But unavoidably, any software and hardware component is subject to failure. Distributed systems spread on tens or hundreds of machines are particularly vulnerable to failures. Consequently, high availability and fault tolerance became a "must have" feature for such systems. Software fault tolerance is achieved through the technique called replication. In replication several software replicas are executed at the same time. If one or several of them fail, other still provide the service. Software replication is often implemented using group communication, which provides communication primitives with various semantics and greatly simplifies the development of highly available and fault tolerant services. However, despite tremendous advances in research and numerous prototypes, group communication stays confined to small niches and academic prototypes. In contrast, other technology, called messageoriented middleware such as the Java Message Service (JMS) is widely used in distributed systems, and has become a de-facto standard. We believe that the lack of a well-defined and easily understandable standard is the reason that hinders the deployment of group communication systems. Since JMS is a well-established technology, we propose to extend JMS adding group communication primitives to it. Foremost, this requires to extend the traditional semantics of group communication in order to take into account various features of JMS, e.g., durable/non-durable subscriptions and persistent/non-persistent messages. The resulting new group communication specification, together with the corresponding API, defines group communication primitives compatible with JMS, that we call JMSGroups. To validate the specification and API we provide a prototype implementation of JMSGroups. As such, we believe it facilitates the acceptance of group communication by a larger community and provides a powerful environment for building fault-tolerant applications

    Transactional actors in cooperative information systems

    No full text
    Transaction management in advanced distributed information systems is a very important issue under research scrutiny with many technical and open problems. Most of the research and development activities use conventional database technology to address this important issue. The transaction model presented in this thesis combines attractive properties of the actor model of computation with advanced database transaction concepts in an object-oriented environment to address transactional necessities of cooperative information systems. The novel notion of transaction tree in our model includes subtransactions as well as a rich collection of decision making, chronological ordering, and communication and synchronization constructs for them. Advanced concepts such as blocking/ non_blocking synchronization, vital and non_vital subtransactions , contingency transactions, temporal and value dependencies, and delegation are supported. Compensatable subtransactions are distinguished and early commit is accomplished in order to release resources and facilitate cooperative as well as longduration transactions. Automatic cancel procedures are provided to logically undo the effects of such commits if the global transaction fails. The complexity and semantics-orientation of advanced database applications is our main motivation to design and implement a high-level scripting language for the proposed transaction model. Database programming can gain in performance and problem-orientation if the semantic dependencies between transactions can be expressed directly. Simple and flexible mechanisms are provided for advanced users to query the databases, program their transactions accordingly, and accept weak forms of semantic coherence that allows for more concurrency. The transaction model is grafted onto the concurrent obj ect-oriented programming language Sather developed at UC Berkeley which has a nice high-level syntax, supports advanced obj ect-oriented concepts, and aims toward performance and reusability. W have augmented the language with distributed programming facilities and various types of message passing routines as well as advanced transactions management constructs . The thesis is organized in three parts. The first part introduces the problem, reviews state of the art, and presents the transaction model. The second part describes the scripting language and talks about implementation details. The third part presents the formal semantics of the transaction model using mathematical notations and concludes the thesis

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D
    corecore