29 research outputs found
Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures.
A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks.
In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications.
Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes.
Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable.
El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks.
En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores.
Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones.
Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version
Transaction Support for Ada
This paper describes the transaction support framework OPTIMA and its implementation for Ada 95. First, a transaction model that fits concurrent programming languages is presented. Then the design of the framework is given. Applications from many different domains can benefit from using transactions; it is therefore important to provide means to customize the framework depending on the application requirements. This flexibility is achieved by using design patterns. Class hierarchies with classes implementing standard transactional behavior are provided, but a programmer is free to extend the hierarchies by implementing application-specific functionalities. An interface for Ada programmers is presented and its use demonstrated via a simple example
Replication of non-deterministic objects
This thesis discusses replication of non-deterministic objects in distributed systems to achieve fault tolerance against crash failures. The objects replicated are the virtual nodes of a distributed application. Replication is viewed as an issue that is to be dealt with only during the configuration of a distributed application and that should not affect the development of the application. Hence, replication of virtual nodes should be transparent to the application. Like all measures to achieve fault tolerance, replication introduces redundancy in the system. Not surprisingly, the main difficulty is guaranteeing the consistency of all replicas such that they behave in the same way as if the object was not replicated (replication transparency). This is further complicated if active objects (like virtual nodes) are replicated, and these objects themselves can be clients of still further objects in the distributed application. The problems of replication of active non-deterministic objects are analyzed in the context of distributed Ada 95 applications. The ISO standard for Ada 95 defines a model for distributed execution based on remote procedure calls (RPC). Virtual nodes in Ada 95 use this as their sole communication paradigm, but they may contain tasks to execute activities concurrently, thus making the execution potentially non-deterministic due to implicit timing dependencies. Such non-determinism cannot be avoided by choosing deterministic tasking policies. I present two different approaches to maintain replica consistency despite this non-determinism. In a first approach, I consider the run-time support of Ada 95 as a black box (except for the part handling remote communications). This corresponds to a non-deterministic computation model. I show that replication of non-deterministic virtual nodes requires that remote procedure calls are implemented as nested transactions. Unfortunately, effects of failures are not local to the replicas of a virtual node: when a failure occurs, nested remote calls made to other virtual nodes must be undone. Also, using transactional semantics for RPCs necessitates a compromise regarding transparency: the application must identify global state for it cannot be determined reliably in an automatic way. Further study reveals that this approach cannot be implemented in a transparent way at all because the consistency criterion of Ada 95 (linearizability) is much weaker than that of transactions (serializability). An execution of remote procedure calls as transactions may thus lead to incompatibilities with the semantics of the programming language. If remotely called subprograms on a replicated virtual node perform partial operations, i.e., entry calls on global protected objects, deadlocks that cannot be broken can occur in certain cases. Such deadlocks do not occur when the virtual node is not replicated. The transactional semantics of RPCs must therefore be exposed to the application. A second approach is based on a piecewise deterministic computation model, i.e., the execution of a virtual node is seen as a sequence of deterministic state intervals. Whenever a non-deterministic event occurs, a new state interval is started. I study replica organization under this computation model (semi-active replication). In this model, all non-deterministic decisions are made on one distinguished replica (the leader), while all other replicas (the followers) are forced to follow the same sequence of non-deterministic events. I show that it suffices to synchronize the followers with the leader upon each observable event, i.e., when the leader sends a message to some other virtual node. It is not necessary to synchronize upon each and every non-deterministic event — which would incur a prohibitively high overhead. Non-deterministic events occurring on the leader between observable events are logged and sent to the followers just before the leader executes an observable event. Consequently, it is guaranteed that the followers will reach the same state as the leader, and thus the effects of failures remain mostly local to the replicas. A prototype implementation called RAPIDS (Replicated Ada Partitions In Distributed Systems) serves as a proof of concept for this second approach, demonstrating its feasibility. RAPIDS is an Ada 95 implementation of a replication manager for semi-active replication for the GNAT development system for Ada 95. It is entirely contained within the run-time support and hence largely transparent for the application
Open Multithreaded Transactions: A Transaction Model for Concurrent Object-Oriented Programming
To read the abstract, please go to my PhD home page
Integrated Data, Message, and Process Recovery for Failure Masking in Web Services
Modern Web Services applications encompass multiple distributed interacting components, possibly including millions of lines of code written in different programming languages. With this complexity, some bugs often remain undetected despite extensive testing procedures, and occasionally cause transient system failures. Incorrect failure handling in applications often leads to incomplete or to unintentional request executions. A family of recovery protocols called interaction contracts provides a generic solution to this problem by means of system-integrated data, process, and message recovery for multi-tier applications. It is able to mask failures, and allows programmers to concentrate on the application logic, thus speeding up the development process. This thesis consists of two major parts. The first part formally specifies the interaction contracts using the state-and-activity chart language. Moreover, it presents a formal specification of a concrete Web Service that makes use of interaction contracts, and contains no other error-handling actions. The formal specifications undergo verification where crucial safety and liveness properties expressed in temporal logics are mathematically proved by means of model checking. In particular, it is shown that each end-user request is executed exactly once. The second part of the thesis demonstrates the viability of the interaction framework in a real world system. More specifically, a cascadable Web Service platform, EOS, is built based on widely used components, Microsoft Internet Explorer and PHP application server, with interaction contracts integrated into them.Heutige Web-Service-Anwendungen setzen sich aus mehreren verteilten interagierenden
Komponenten zusammen. Dabei werden oft mehrere Programmiersprachen eingesetzt,
und der Quellcode einer Komponente kann mehrere Millionen Programmzeilen
umfassen. In Anbetracht dieser Komplexität bleiben typischerweise einige
Programmierfehler trotz intensiver Qualitätssicherung unentdeckt und verursachen
vorübergehende Systemsausfälle zur Laufzeit. Eine ungenügende Fehlerbehandlung in
Anwendungen führt oft zur unvollständigen oder unbeabsichtigt wiederholten
Ausführung einer Operation. Eine Familie von Recovery-Protokollen, die so genannten
"Interaction Contracts", bietet eine generische Lösung dieses Problems. Diese Recovery-
Protokolle sorgen für die Fehlermaskierung und ermöglichen somit, dass Entwickler ihre
ganze Konzentration der Anwendungslogik widmen können. Dies trägt zu einer
erheblichen Beschleunigung des Entwicklungsprozesses bei.
Diese Dissertation besteht aus zwei wesentlichen Teilen. Der erste Teil widmet sich der
formalen Spezifikation der Recovery-Protokolle unter Verwendung des Formalismus der
State-and-Activity-Charts. Darüber hinaus entwickeln wir die formale Spezifikation einer
Web-Service-Anwendung, die außer den Recovery-Protokollen keine weitere
Fehlerbehandlung beinhaltet. Die formalen Spezifikationen werden in Bezug auf kritische
Sicherheits- und Lebendigkeitseigenschaften, die als temporallogische Formeln
angegeben sind, mittels "Model Checking" verifiziert. Unter anderem wird somit
mathematisch bewiesen, dass jede Operation eines Endbenutzers genau einmal ausgeführt
wird. Der zweite Teil der Dissertation beschreibt die Implementierung der Recovery-
Protokolle im Rahmen einer beliebig verteilbaren Web-Service-Plattform EOS, die auf
weit verbreiteten Web-Produkten aufbaut: dem Browser "Microsoft Internet Explorer"
und dem PHP-Anwendungsserver
Management of object-oriented action-based distributed programs
Phd ThesisThis thesis addresses the problem of managing the runtime behaviour of distributed
programs. The thesis of this work is that management is fundamentally
an information processing activity and that the object model, as applied to actionbased
distributed systems and database systems, is an appropriate representation
of the management information. In this approach, the basic concepts of classes,
objects, relationships, and atomic transition systems are used to form object
models of distributed programs. Distributed programs are collections of objects
whose methods are structured using atomic actions, i.e., atomic transactions.
Object models are formed of two submodels, each representing a fundamental
aspect of a distributed program. The structural submodel represents a static
perspective of the distributed program, and the control submodel represents a
dynamic perspective of it. Structural models represent the program's objects,
classes and their relationships. Control models represent the program's object
states, events, guards and actions-a transition system. Resolution of queries on
the distributed program's object model enable the management system to control
certain activities of distributed programs.
At a different level of abstraction, the distributed program can be seen as a
reactive system where two subprograms interact: an application program and a
management program; they interact only through sensors and actuators. Sensors
are methods used to probe an object's state and actuators are methods used
to change an object's state. The management program is capable to prod the
application program into action by activating sensors and actuators available at
the interface of the application program. Actions are determined by management
policies that are encoded in the management program. This way of structuring
the management system encourages a clear modularization of application and
management distributed programs, allowing better separation of concerns. Managemental
concerns can be dealt with by the management program, functional
concerns can be assigned to the application program.
The object-oriented action-based computational model adopted by the management
system provides a natural framework for the implementation of faulttolerant
distributed programs. Object orientation provides modularity and extensibility
through object encapsulation. Atomic actions guarantee the consistency of
the objects of the distributed program despite concurrency and failures. Replication
of the distributed program provides increased fault-tolerance by guaranteeing
the consistent progress of the computation, even though some of the replicated
objects can fail.
A prototype management system based on the management theory proposed
above has been implemented atop Arjuna; an object-oriented programming system
which provides a set of tools for constructing fault-tolerant distributed programs. The management system is composed of two subsystems: Stabilis, a
management system for structural information, and Vigil, a management system
for control information. Example applications have been implemented to illustrate
the use of the management system and gather experimental evidence to give
support to the thesis.CNPq (Consellho Nacional de Desenvolvimento Cientifico e Tecnol6gico, Brazil):
BROADCAST (Basic Research On Advanced Distributed Computing: from Algorithms to SysTems)
JMSGroups:JMS compliant group communication
Nowadays, computers are the indispensable part of our life. They evolve rapidly and are more and more versatile. Computer networks made the remote corners of the world just a click away. But unavoidably, any software and hardware component is subject to failure. Distributed systems spread on tens or hundreds of machines are particularly vulnerable to failures. Consequently, high availability and fault tolerance became a "must have" feature for such systems. Software fault tolerance is achieved through the technique called replication. In replication several software replicas are executed at the same time. If one or several of them fail, other still provide the service. Software replication is often implemented using group communication, which provides communication primitives with various semantics and greatly simplifies the development of highly available and fault tolerant services. However, despite tremendous advances in research and numerous prototypes, group communication stays confined to small niches and academic prototypes. In contrast, other technology, called messageoriented middleware such as the Java Message Service (JMS) is widely used in distributed systems, and has become a de-facto standard. We believe that the lack of a well-defined and easily understandable standard is the reason that hinders the deployment of group communication systems. Since JMS is a well-established technology, we propose to extend JMS adding group communication primitives to it. Foremost, this requires to extend the traditional semantics of group communication in order to take into account various features of JMS, e.g., durable/non-durable subscriptions and persistent/non-persistent messages. The resulting new group communication specification, together with the corresponding API, defines group communication primitives compatible with JMS, that we call JMSGroups. To validate the specification and API we provide a prototype implementation of JMSGroups. As such, we believe it facilitates the acceptance of group communication by a larger community and provides a powerful environment for building fault-tolerant applications
Transactional actors in cooperative information systems
Transaction management in advanced distributed information systems is a very
important issue under research scrutiny with many technical and open problems.
Most of the research and development activities use conventional database technology
to address this important issue. The transaction model presented in this
thesis combines attractive properties of the actor model of computation with
advanced database transaction concepts in an object-oriented environment to
address transactional necessities of cooperative information systems. The novel
notion of transaction tree in our model includes subtransactions as well as a
rich collection of decision making, chronological ordering, and communication
and synchronization constructs for them. Advanced concepts such as blocking/
non_blocking synchronization, vital and non_vital subtransactions , contingency
transactions, temporal and value dependencies, and delegation are supported.
Compensatable subtransactions are distinguished and early commit is accomplished
in order to release resources and facilitate cooperative as well as longduration
transactions. Automatic cancel procedures are provided to logically
undo the effects of such commits if the global transaction fails.
The complexity and semantics-orientation of advanced database applications
is our main motivation to design and implement a high-level scripting language for
the proposed transaction model. Database programming can gain in performance
and problem-orientation if the semantic dependencies between transactions can
be expressed directly. Simple and flexible mechanisms are provided for advanced
users to query the databases, program their transactions accordingly, and accept
weak forms of semantic coherence that allows for more concurrency. The transaction
model is grafted onto the concurrent obj ect-oriented programming language
Sather developed at UC Berkeley which has a nice high-level syntax, supports
advanced obj ect-oriented concepts, and aims toward performance and reusability.
W have augmented the language with distributed programming facilities
and various types of message passing routines as well as advanced transactions
management constructs . The thesis is organized in three parts. The first part introduces the problem, reviews state of the art, and presents the transaction model. The second part describes
the scripting language and talks about implementation details. The third
part presents the formal semantics of the transaction model using mathematical
notations and concludes the thesis
Recommended from our members
Replication and Nested Transactions in the Eden Distributed System
Hardware redundancy in distributed systems offers the potential for increased availability and performance, but this requires software support if the full potential is to be realized. We have designed and implemented two mechanisms for such support. The first provides crash-resistant resources, replicated transparently and consistently to increase the availability of distributed data. To update multiple copies despite down nodes, we have introduced the Regeneration method used in the implementation of a replicated system directory. Regeneration restores inaccessible copies elsewhere in the network, maintains the availability of resources, and adapts to configuration changes. The second mechanism is a systell1 supporting nested transactions, which can manage the complex failure modes in a distributed system, synchronize concurrent resource access internal to applications, and facilitate safe module composition. In the tree-structured nesting, each transaction has a Transaction Manager (TM), responsible for the concurrency control and crash recovery of its subtransactions. Many concurrency control and recovery techniques can be combined in this TM Tree design framework. We chose locking and versions for the first implementation. Using Eden objects and the replicated directory, our nested transactions provide consistent concurrent access 10 location-independent, crash- resistant resources. In summary, the principal contributions of this research are the Regeneration method and the TM Tree framework. Regeneration uses the separation of hardware repair from data restoration to increase replicated data availability. TM tree composes existing techniques to derive many difficult designs for nested transaction. Both have been proven in the design and implementation of actual systems
Monitoring and analysis system for performance troubleshooting in data centers
It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was
not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance
troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming.
To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers.
VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel
software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By
running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the
performance issue.
VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found
via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus
with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D