6,605 research outputs found

    Integrated Data, Message, and Process Recovery for Failure Masking in Web Services

    Get PDF
    Modern Web Services applications encompass multiple distributed interacting components, possibly including millions of lines of code written in different programming languages. With this complexity, some bugs often remain undetected despite extensive testing procedures, and occasionally cause transient system failures. Incorrect failure handling in applications often leads to incomplete or to unintentional request executions. A family of recovery protocols called interaction contracts provides a generic solution to this problem by means of system-integrated data, process, and message recovery for multi-tier applications. It is able to mask failures, and allows programmers to concentrate on the application logic, thus speeding up the development process. This thesis consists of two major parts. The first part formally specifies the interaction contracts using the state-and-activity chart language. Moreover, it presents a formal specification of a concrete Web Service that makes use of interaction contracts, and contains no other error-handling actions. The formal specifications undergo verification where crucial safety and liveness properties expressed in temporal logics are mathematically proved by means of model checking. In particular, it is shown that each end-user request is executed exactly once. The second part of the thesis demonstrates the viability of the interaction framework in a real world system. More specifically, a cascadable Web Service platform, EOS, is built based on widely used components, Microsoft Internet Explorer and PHP application server, with interaction contracts integrated into them.Heutige Web-Service-Anwendungen setzen sich aus mehreren verteilten interagierenden Komponenten zusammen. Dabei werden oft mehrere Programmiersprachen eingesetzt, und der Quellcode einer Komponente kann mehrere Millionen Programmzeilen umfassen. In Anbetracht dieser Komplexität bleiben typischerweise einige Programmierfehler trotz intensiver Qualitätssicherung unentdeckt und verursachen vorübergehende Systemsausfälle zur Laufzeit. Eine ungenügende Fehlerbehandlung in Anwendungen führt oft zur unvollständigen oder unbeabsichtigt wiederholten Ausführung einer Operation. Eine Familie von Recovery-Protokollen, die so genannten "Interaction Contracts", bietet eine generische Lösung dieses Problems. Diese Recovery- Protokolle sorgen für die Fehlermaskierung und ermöglichen somit, dass Entwickler ihre ganze Konzentration der Anwendungslogik widmen können. Dies trägt zu einer erheblichen Beschleunigung des Entwicklungsprozesses bei. Diese Dissertation besteht aus zwei wesentlichen Teilen. Der erste Teil widmet sich der formalen Spezifikation der Recovery-Protokolle unter Verwendung des Formalismus der State-and-Activity-Charts. Darüber hinaus entwickeln wir die formale Spezifikation einer Web-Service-Anwendung, die außer den Recovery-Protokollen keine weitere Fehlerbehandlung beinhaltet. Die formalen Spezifikationen werden in Bezug auf kritische Sicherheits- und Lebendigkeitseigenschaften, die als temporallogische Formeln angegeben sind, mittels "Model Checking" verifiziert. Unter anderem wird somit mathematisch bewiesen, dass jede Operation eines Endbenutzers genau einmal ausgeführt wird. Der zweite Teil der Dissertation beschreibt die Implementierung der Recovery- Protokolle im Rahmen einer beliebig verteilbaren Web-Service-Plattform EOS, die auf weit verbreiteten Web-Produkten aufbaut: dem Browser "Microsoft Internet Explorer" und dem PHP-Anwendungsserver

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

    Automata for Web Services Fault Monitoring and Diagnosis

    Get PDF
    Like any software, web service fault management is also required to go through different phases of fault management lifecycle. Model based diagnosis has been a well established practice for its several positive aspects including cognitively being better understood by development and testing teams. Automata is a simple and formally well defined model being used for monitoring and diagnosis of system faults. For the reason, here we have reviewed works on automata for web service fault management and also propose a model of stochastic automata for the purpose

    A review of experiences with reliable multicast

    Get PDF

    Master/worker parallel discrete event simulation

    Get PDF
    The execution of parallel discrete event simulation across metacomputing infrastructures is examined. A master/worker architecture for parallel discrete event simulation is proposed providing robust executions under a dynamic set of services with system-level support for fault tolerance, semi-automated client-directed load balancing, portability across heterogeneous machines, and the ability to run codes on idle or time-sharing clients without significant interaction by users. Research questions and challenges associated with issues and limitations with the work distribution paradigm, targeted computational domain, performance metrics, and the intended class of applications to be used in this context are analyzed and discussed. A portable web services approach to master/worker parallel discrete event simulation is proposed and evaluated with subsequent optimizations to increase the efficiency of large-scale simulation execution through distributed master service design and intrinsic overhead reduction. New techniques for addressing challenges associated with optimistic parallel discrete event simulation across metacomputing such as rollbacks and message unsending with an inherently different computation paradigm utilizing master services and time windows are proposed and examined. Results indicate that a master/worker approach utilizing loosely coupled resources is a viable means for high throughput parallel discrete event simulation by enhancing existing computational capacity or providing alternate execution capability for less time-critical codes.Ph.D.Committee Chair: Fujimoto, Richard; Committee Member: Bader, David; Committee Member: Perumalla, Kalyan; Committee Member: Riley, George; Committee Member: Vuduc, Richar
    corecore