199 research outputs found
Failover in cellular automata
A cellular automata (CA) configuration is constructed that exhibits emergent
failover. The configuration is based on standard Game of Life rules. Gliders
and glider-guns form the core messaging structure in the configuration. The
blinker is represented as the basic computational unit, and it is shown how it
can be recreated in case of a failure. Stateless failover using primary-backup
mechanism is demonstrated. The details of the CA components used in the
configuration and its working are described, and a simulation of the complete
configuration is also presented.Comment: 16 pages, 15 figures and associated video at
http://dl.dropbox.com/u/7553694/failover_demo.avi and simulation at
http://dl.dropbox.com/u/7553694/failover_simulation.ja
Comparison of Enhancing Methods for Primary/Backup Approach Meant for Fault Tolerant Scheduling
This report explores algorithms aiming at reducing the algorithm run-time and rejection rate when online scheduling tasks on real-time embedded systems consisting of several processors prone to fault occurrence. The authors introduce a new processor scheduling policy and propose new enhancing methods for the primary/backup approach and analyse their performances. The studied techniques are as follows: (i) the method of restricted scheduling windows within which the primary and backup copies can be scheduled, (ii) the method of limitation on the number of comparisons, accounting for the algorithm run-time, when scheduling a task on a system, and (iii) the method of several scheduling attempts. Last but not least, we inject faults to evaluate the impact on scheduling algorithms. Thorough experiments show that the best proposed method is based on the combination of the limitation on the number of comparisons and two scheduling attempts. When it is compared to the primary/backup approach without this method, the algorithm run-time is reduced by 23% (mean value) and 67% (maximum value) and the rejection rate is decreased by 4%. This improvement in the algorithm run-time is significant, especially for embedded systems dealing with hard real-time tasks. Finally, we found out that the studied algorithm performs well in a harsh environment
Virtual Net: a Decentralized Architecture for Interaction in Mobile Virtual Worlds
With the development of mobile technology, mobile virtual worlds have
attracted massive users. To improve scalability, a peer-to-peer virtual world
provides the solution to accommodate more users without increasing hardware
investment. In mobile settings, however, existing P2P solutions are not
applicable due to the unreliability of mobile devices and the instability of
mobile networks. To address the issue, a novel infrastructure model, called
Virtual Net, is proposed to provide fault-tolerance in managing user content
and object state. In this paper, the key problem, namely object state update,
is resolved to maintain state consistency and high interaction responsiveness.
This work is important in implementing a scalable mobile virtual world
Recovery Time Considerations in Real-Time Systems Employing Software Fault Tolerance
Safety-critical real-time systems like modern automobiles with advanced driving-assist features must employ redundancy for crucial software tasks to tolerate permanent crash faults. This redundancy can be achieved by using techniques like active replication or the primary-backup approach. In such systems, the recovery time which is the amount of time it takes for a redundant task to take over execution on the failure of a primary task becomes a very important design parameter. The recovery time for a given task depends on various factors like task allocation, primary and redundant task priorities, system load and the scheduling policy. Each task can also have a different recovery time requirement (RTR). For example, in automobiles with automated driving features, safety-critical tasks like perception and steering control have strict RTRs, whereas such requirements are more relaxed in the case of tasks like heating control and mission planning. In this paper, we analyze the recovery time for software tasks in a real-time system employing Rate-Monotonic Scheduling (RMS). We derive bounds on the recovery times for different redundant task options and propose techniques to determine the redundant-task type for a task to satisfy its RTR. We also address the fault-tolerant task allocation problem, with the additional constraint of satisfying the RTR of each task in the system. Given that the problem of assigning tasks to processors is a well-known NP-hard bin-packing problem we propose computationally-efficient heuristics to find a feasible allocation of tasks and their redundant copies. We also apply the simulated annealing method to the fault-tolerant task allocation problem with RTR constraints and compare against our heuristics
A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds
Cloud platforms have emerged as a prominent environment to execute high
performance computing (HPC) applications providing on-demand resources as well
as scalability. They usually offer different classes of Virtual Machines (VMs)
which ensure different guarantees in terms of availability and volatility,
provisioning the same resource through multiple pricing models. For instance,
in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs
are unused instances available for lower price. Despite the monetary
advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any
moment.
Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we
propose in this paper a static scheduling for HPC applications which are
composed by independent tasks (bag-of-task) with deadline constraints. However,
if a spot VM hibernates and it does not resume within a time which guarantees
the application's deadline, a temporal failure takes place. Our scheduling,
thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2
cloud, respecting its deadline and avoiding temporal failures. To this end, our
algorithm statically creates two scheduling maps: (i) the first one contains,
for each task, its starting time and on which VM (i.e., an available spot or
on-demand VM with the current lowest price) the task should execute; (ii) the
second one contains, for each task allocated on a VM spot in the first map, its
starting time and on which on-demand VM it should be executed to meet the
application deadline in order to avoid temporal failures. The latter will be
used whenever the hibernation period of a spot VM exceeds a time limit.
Performance results from simulation with task execution traces, configuration
of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of
our scheduling and that it tolerates temporal failures
Mechanisms for improving ZooKeeper Atomic Broadcast performance
PhD ThesisCoordination services are essential for building higher-level primitives that are often
used in todayâs data-center infrastructures, as they greatly facilitate the operation of
distributed client applications. Examples of typical functionalities offered by coordination
services include the provision of group membership, support for leader election,
distributed synchronization, as well as reliable low-volume storage and naming.
To provide reliable services to the client applications, coordination services in general
are replicated for fault tolerance and should deliver high performance to ensure that
they do not become bottlenecks for dependent applications. Apache ZooKeeper, for
example, is a well-known coordination service and applies a primary-backup approach
in which the leader server processes all state-modifying requests and then forwards
the corresponding state updates to a set of follower servers using an atomic broadcast
protocol called Zab.
Having analyzed state-of-the-art coordination services, we identified two main
limitations that prevent existing systems such as Apache ZooKeeper from achieving a
higher write performance: First, while this approach prevents the data stored by client
applications from being lost as a result of server crashes, it also comes at the cost of a
performance penalty. In particular, the fact that it relies on a leader-based protocol,
means that its performance becomes bottlenecked when the leader server has to handle
an increased message traffic as the number of client requests and replicas increases.
Second, Zab requires significant communication between instances (as it entails three
communication steps). This can potentially lead to performance overhead and uses up
more computer resources, resulting in less guarantees for users who must then build
more complex applications to handle these issues.
To this end, the work makes four contributions. First, we implement ZooKeeper
atomic broadcast, extracting from ZooKeeper in order to make it easier for other
developers to build their applications on top of Zab without the complexity of integrating
the entire ZooKeeper codebase. Second, we propose three variations of Zab, which
are all capable of reaching an agreement in fewer communication steps than Zab. The
v
variations are built with restriction assumptions that server crashes are independent
and a server quorum remains operative at all times. The first variation offers excellent
performance but can only be used for 3-server systems; the other two are built without
this limitation. Then, we redesigned the latest two Zab variations to operate under the
least-restricted Zab fault assumptions. Third, we design and implement a ZooKeeper
coin-tossing protocol, called ZabCT which addresses the above concerns by having the
other, non-leader server replicas toss a coin and broadcast their acknowledgment of a
leaderâs proposal only if the toss results in an outcome of Head. We model the ZabCT
process and derive analytical expressions for estimating the coin-tossing probability
of Head for a given arrival rate of service requests such that the dual objectives of
performance gains and traffic reduction can be accomplished. If a coin-tossing protocol,
ZabCT is judged not to offer performance benefits over Zab, processes should be able to
switch autonomously to Zab. We design protocol switching by letting processes switch
between ZabCT and Zab without stopping message delivery. Finally, an extensive
performance evaluation is provided for Zab and Zab-variant protocols
Designing application software in wide area network settings
Progress in methodologies for developing robust local area network software has not been matched by similar results for wide area settings. The design of application software spanning multiple local area environments is examined. For important classes of applications, simple design techniques are presented that yield fault tolerant wide area programs. An implementation of these techniques as a set of tools for use within the ISIS system is described
GORDA: an open architecture for database replication
Database replication has been a common feature in database management systems (DBMSs) for a long time. In particular, asynchronous or lazy propagation of updates provides a simple yet efficient way of increasing performance and data availability and is widely available across the DBMS product spectrum. High end systems additionally offer sophisticated conflict resolution and data propagation options as well as, synchronous replication based on distributed locking and two-phase commit protocols. This paper presents GORDA architecture and programming interface (GAPI), that enables different replication strategies to be implemented once and deployed in multiple DBMSs. This is achieved by proposing a reflective interface to transaction processing instead of relying on-client interfaces or ad-hoc server extensions. The proposed approach is thus cost-effective, in enabling reuse of replication protocols or components in multiple DBMSs, as well as potentially efficient, as it allows close coupling with DBMS internals.(undefined
- âŠ