33,929 research outputs found

    Fault-tolerant distributed computing scheme based on erasure codes

    Get PDF
    Some emerging classes of distributed computing systems, such peer-to-peer or grid computing computing systems, are composed of heterogeneous computing resources potentially unreliable. This paper proposes to use erasure codes to improve the fault-tolerance of parallel distributed computing applications in this context. A general method to generate redundant processes from a set of parallel processes is presented. This scheme allows the recovery of the result of the application even if some of the processes crash

    Computing in the RAIN: a reliable array of independent nodes

    Get PDF
    The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

    Dynamic fault tolerant grid workflow in the water threat management project

    Get PDF
    Achieving fault tolerance is an inevitable problem in distributed systems, with it becoming more challenging in decentralized, heterogeneous, and dynamic-environment systems such as a Grid. When deploying applications requires time-criticality, how to allocate resources for jobs in a fault-tolerant manner is an important issue for the delivery of the services. The Water Threat Management project is a research to find solutions for the contamination incidents problems in urban water distribution systems, and it involves the development of the cyberinfrastructure in a Grid environment. To handle such urgent events properly, the deployment of the system demands real-time processing without the failure. Our approach of integrating a fault-tolerant framework into a Water Threat Management system provides fault tolerance at the queuing stage rather than the job-execution stage by scheduling jobs in fault-tolerant ways. This includes the development of the batch queuing system in the Cyberaide Shell project. In addition, we present a dynamic workflow in the Water Threat Management system that can reduce the queue wait time in the changing environment

    Fault Tolerant Real Time Dynamic Scheduling Algorithm For Heterogeneous Distributed System

    Get PDF
    Fault-tolerance becomes an important key to establish dependability in Real Time Distributed Systems (RTDS). In fault-tolerant Real Time Distributed systems, detection of fault and its recovery should be executed in timely manner so that in spite of fault occurrences the intended output of real-time computations always take place on time. Hardware and software redundancy are well-known e ective methods for faulttolerance, where extra hard ware (e.g., processors, communication links) and software (e.g., tasks, messages) are added into the system to deal with faults. Performances of RTDS are mostly guided by eciency of scheduling algorithm and schedulability analysis are performed on the system to ensure the timing constrains. This thesis examines the scenarios where a real time system requires very little redundant hardware resources to tolerate failures in heterogeneous real time distributed systems with point-to-point communication links. Fault tolerance can be achieved by..

    Integrated Design Tools for Embedded Control Systems

    Get PDF
    Currently, computer-based control systems are still being implemented using the same techniques as 10 years ago. The purpose of this project is the development of a design framework, consisting of tools and libraries, which allows the designer to build high reliable heterogeneous real-time embedded systems in a very short time at a fraction of the present day costs. The ultimate focus of current research is on transformation control laws to efficient concurrent algorithms, with concerns about important non-functional real-time control systems demands, such as fault-tolerance, safety,\ud reliability, etc.\ud The approach is based on software implementation of CSP process algebra, in a modern way (pure objectoriented design in Java). Furthermore, it is intended that the tool will support the desirable system-engineering stepwise refinement design approach, relying on past research achievements ¿ the mechatronics design trajectory based on the building-blocks approach, covering all complex (mechatronics) engineering phases: physical system modeling, control law design, embedded control system implementation and real-life realization. Therefore, we expect that this project will result in an\ud adequate tool, with results applicable in a wide range of target hardware platforms, based on common (off-theshelf) distributed heterogeneous (cheap) processing units

    Challenging Anti-fragile Blockchain Applications

    Get PDF
    International audienceFailures in production are a de facto rule for distributed software systems. In particular, modern distributed systems are composed of heterogeneous building blocks contributed by third parties and guaranteeing the end-to-end resilience is becoming a major challenge. Even though each of these software components can embed fault tolerance or dependability protocols, it remains difficult to assess their effectiveness upon the occurrences of unexpected failures. As part of this work, we propose a new generation of fault injection framework that can be deployed in production to challenge Blockchain-based distributed systems. This paper therefore reports on the state of the art in this area and potential opportunities for novel contributions towards building anti-fragile distributed systems on the Blockchain

    A model-based approach for automatic recovery from memory leaks in enterprise applications

    Get PDF
    Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality

    A language and toolkit for the specification, execution and monitoring of dependable distributed applications

    Get PDF
    PhD ThesisThis thesis addresses the problem of specifying the composition of distributed applications out of existing applications, possibly legacy ones. With the automation of business processes on the increase, more and more applications of this kind are being constructed. The resulting applications can be quite complex, usually long-lived and are executed in a heterogeneous environment. In a distributed environment, long-lived activities need support for fault tolerance and dynamic reconfiguration. Indeed, it is likely that the environment where they are run will change (nodes may fail, services may be moved elsewhere or withdrawn) during their execution and the specification will have to be modified. There is also a need for modularity, scalability and openness. However, most of the existing systems only consider part of these requirements. A new area of research, called workflow management has been trying to address these issues. This work first looks at what needs to be addressed to support the specification and execution of these new applications in a heterogeneous, distributed environment. A co- ordination language (scripting language) is developed that fulfils the requirements of specifying the composition and inter-dependencies of distributed applications with the properties of dynamic reconfiguration, fault tolerance, modularity, scalability and openness. The architecture of the overall workflow system and its implementation are then presented. The system has been implemented as a set of CORBA services and the execution environment is built using a transactional workflow management system. Next, the thesis describes the design of a toolkit to specify, execute and monitor distributed applications. The design of the co-ordination language and the toolkit represents the main contribution of the thesis.UK Engineering and Physical Sciences Research Council, CaberNet, Northern Telecom (Nortel)

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
    corecore