33,929 research outputs found
Fault-tolerant distributed computing scheme based on erasure codes
Some emerging classes of distributed computing systems, such peer-to-peer or grid computing computing systems, are composed of heterogeneous computing resources potentially
unreliable. This paper proposes to use erasure codes to improve the fault-tolerance of parallel distributed computing applications in this context. A general method to generate redundant processes from a set of parallel processes is presented. This scheme allows the recovery of the result of the application even if some of the processes crash
Computing in the RAIN: a reliable array of independent nodes
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology
Dynamic fault tolerant grid workflow in the water threat management project
Achieving fault tolerance is an inevitable problem in distributed systems, with it becoming more challenging in decentralized, heterogeneous, and dynamic-environment systems such as a Grid. When deploying applications requires time-criticality, how to allocate resources for jobs in a fault-tolerant manner is an important issue for the delivery of the services. The Water Threat Management project is a research to find solutions for the contamination incidents problems in urban water distribution systems, and it involves the development of the cyberinfrastructure in a Grid environment. To handle such urgent events properly, the deployment of the system demands real-time processing without the failure. Our approach of integrating a fault-tolerant framework into a Water Threat Management system provides fault tolerance at the queuing stage rather than the job-execution stage by scheduling jobs in fault-tolerant ways. This includes the development of the batch queuing system in the Cyberaide Shell project. In addition, we present a dynamic workflow in the Water Threat Management system that can reduce the queue wait time in the changing environment
Fault Tolerant Real Time Dynamic Scheduling Algorithm For Heterogeneous Distributed System
Fault-tolerance becomes an important key to establish dependability in Real Time Distributed Systems (RTDS). In fault-tolerant Real Time Distributed systems, detection of fault and its recovery should be executed in timely manner so that in spite of fault occurrences the intended output of real-time computations always take place on time. Hardware and software redundancy are well-known e ective methods for faulttolerance, where extra hard ware (e.g., processors, communication links) and software (e.g., tasks, messages) are added into the system to deal with faults. Performances of RTDS are mostly guided by eciency of scheduling algorithm and schedulability analysis are performed on the system to ensure the timing constrains. This thesis examines the scenarios where a real time system requires very little redundant hardware resources to tolerate failures in heterogeneous real time distributed systems with point-to-point communication links. Fault tolerance can be achieved by..
Integrated Design Tools for Embedded Control Systems
Currently, computer-based control systems are still being implemented using the same techniques as 10 years ago. The purpose of this project is the development of a design framework, consisting of tools and libraries, which allows the designer to build high reliable heterogeneous real-time embedded systems in a very short time at a fraction of the present day costs. The ultimate focus of current research is on transformation control laws to efficient concurrent algorithms, with concerns about important non-functional real-time control systems demands, such as fault-tolerance, safety,\ud
reliability, etc.\ud
The approach is based on software implementation of CSP process algebra, in a modern way (pure objectoriented design in Java). Furthermore, it is intended that the tool will support the desirable system-engineering stepwise refinement design approach, relying on past research achievements ¿ the mechatronics design trajectory based on the building-blocks approach, covering all complex (mechatronics) engineering phases: physical system modeling, control law design, embedded control system implementation and real-life realization. Therefore, we expect that this project will result in an\ud
adequate tool, with results applicable in a wide range of target hardware platforms, based on common (off-theshelf) distributed heterogeneous (cheap) processing units
Challenging Anti-fragile Blockchain Applications
International audienceFailures in production are a de facto rule for distributed software systems. In particular, modern distributed systems are composed of heterogeneous building blocks contributed by third parties and guaranteeing the end-to-end resilience is becoming a major challenge. Even though each of these software components can embed fault tolerance or dependability protocols, it remains difficult to assess their effectiveness upon the occurrences of unexpected failures. As part of this work, we propose a new generation of fault injection framework that can be deployed in production to challenge Blockchain-based distributed systems. This paper therefore reports on the state of the art in this area and potential opportunities for novel contributions towards building anti-fragile distributed systems on the Blockchain
A model-based approach for automatic recovery from memory leaks in enterprise applications
Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality
A language and toolkit for the specification, execution and monitoring of dependable distributed applications
PhD ThesisThis thesis addresses the problem of specifying the composition of distributed applications
out of existing applications, possibly legacy ones. With the automation of business processes
on the increase, more and more applications of this kind are being constructed. The resulting
applications can be quite complex, usually long-lived and are executed in a heterogeneous
environment. In a distributed environment, long-lived activities need support for fault tolerance
and dynamic reconfiguration. Indeed, it is likely that the environment where they are run will
change (nodes may fail, services may be moved elsewhere or withdrawn) during their
execution and the specification will have to be modified. There is also a need for modularity,
scalability and openness. However, most of the existing systems only consider part of these
requirements. A new area of research, called workflow management has been trying to address
these issues.
This work first looks at what needs to be addressed to support the specification and
execution of these new applications in a heterogeneous, distributed environment. A co-
ordination language (scripting language) is developed that fulfils the requirements of specifying
the composition and inter-dependencies of distributed applications with the properties of
dynamic reconfiguration, fault tolerance, modularity, scalability and openness. The architecture
of the overall workflow system and its implementation are then presented. The system has been
implemented as a set of CORBA services and the execution environment is built using a
transactional workflow management system. Next, the thesis describes the design of a toolkit
to specify, execute and monitor distributed applications. The design of the co-ordination
language and the toolkit represents the main contribution of the thesis.UK Engineering and Physical Sciences Research Council,
CaberNet,
Northern Telecom (Nortel)
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
- …