2,600 research outputs found

    Middleware support for locality-aware wide area replication

    Get PDF
    technical reportCoherent wide-area data caching can improve the scalability and responsiveness of distributed services such as wide-area file access, database and directory services, and content distribution. However, distributed services differ widely in the frequency of read/write sharing, the amount of contention between clients for the same data, and their ability to make tradeoffs between consistency and availability. Aggressive replication enhances the scalability and availability of services with read-mostly data or data that need not be kept strongly consistent. However, for applications that require strong consistency of writeshared data, you must throttle replication to achieve reasonable performance. We have developed a middleware data store called Swarm designed to support the widearea data sharing needs of distributed services. To support the needs of diverse distributed services, Swarm provides: (i) a failure-resilient proximity-aware data replication mechanism that adjusts the replication hierarchy based on observed network characteristics and node availability, (ii) a customizable consistency mechanism that allows applications to specify allowable consistency-availability tradeoffs, and (iii) a contention-aware caching mechanism that monitors contention between replicas and adjusts its replication policies accordingly. On a 240-node P2P file sharing system, Swarm's proximity-aware caching and replica hierarchy maintenance mechanisms improve latency by 80%, reduce WAN bandwidth consumed by 80%, and limit the impact of high node churn (5 node deaths/sec) to roughly one-fifth that of random replication. In addition, Swarm's contention-aware caching mechanism outperforms RPCs and static caching mechanisms at all levels of contention on an enterprise service workload

    Preliminary investigation of the practicality of an industrial training for engineering technology program-industries view

    Get PDF
    One of the important aspects of Engineering Technology (ET) program is the students must be able to apply a significant hands-on job throughout the program. Apart from laboratory work carried out at the university, the industrial training components can also contribute a significant practical work to enhance the skills of the students. In this study, the difference between ET and Engineering program is distinguished by proposing longer periods of industrial training in ET program. However, the effectiveness of longer periods of training must be investigated in order to find out whether this framework has to be retained for future training. For this, the university has structured the industrial training by imposing the students to undergo two (2) months training during the third (3rd) semester of year two (2), another two (2) months during the third (3rd) semester of year three (3) and finally, six (6) months during the last semester of fourth (4th) year (i.e. final semester). An interview has been conducted with two industrial panels to find out the effectiveness of the proposed training. A few suggestions and ideas given by both panels were considered for the development for industrial training syllabus in ET program

    ElfStore: A Resilient Data Storage Service for Federated Edge and Fog Resources

    Full text link
    Edge and fog computing have grown popular as IoT deployments become wide-spread. While application composition and scheduling on such resources are being explored, there exists a gap in a distributed data storage service on the edge and fog layer, instead depending solely on the cloud for data persistence. Such a service should reliably store and manage data on fog and edge devices, even in the presence of failures, and offer transparent discovery and access to data for use by edge computing applications. Here, we present Elfstore, a first-of-its-kind edge-local federated store for streams of data blocks. It uses reliable fog devices as a super-peer overlay to monitor the edge resources, offers federated metadata indexing using Bloom filters, locates data within 2-hops, and maintains approximate global statistics about the reliability and storage capacity of edges. Edges host the actual data blocks, and we use a unique differential replication scheme to select edges on which to replicate blocks, to guarantee a minimum reliability and to balance storage utilization. Our experiments on two IoT virtual deployments with 20 and 272 devices show that ElfStore has low overheads, is bound only by the network bandwidth, has scalable performance, and offers tunable resilience.Comment: 24 pages, 14 figures, To appear in IEEE International Conference on Web Services (ICWS), Milan, Italy, 201

    Operating System Support for Redundant Multithreading

    Get PDF
    Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors

    Extended Fault Taxonomy of SOA-Based Systems

    Get PDF
    Service Oriented Architecture (SOA) is considered as a standard for enterprise software development. The main characteristics of SOA are dynamic discovery and composition of software services in a heterogeneous environment. These properties pose newer challenges in fault management of SOA-based systems (SBS). A proper understanding of different faults in an SBS is very necessary for effective fault handling. A comprehensive three-fold fault taxonomy is presented here that covers distributed, SOA specific and non-functional faults in a holistic manner. A comprehensive fault taxonomy is a key starting point for providing techniques and methods for accessing the quality of a given system. In this paper, an attempt has been made to outline several SBSs faults into a well-structured taxonomy that may assist developers to plan suitable fault repairing strategies. Some commonly emphasized fault recovery strategies are also discussed. Some challenges that may occur during fault handling of SBSs are also mentioned

    Flexible consistency for wide area peer replication

    Get PDF
    technical reportThe lack of a flexible consistency management solution hinders P2P implementation of applications involving updates, such as read-write file sharing, directory services, online auctions and wide area collaboration. Managing mutable shared data in a P2P setting requires a consistency solution that can operate efficiently over variable-quality failure-prone networks, support pervasive replication for scaling, and give peers autonomy to tune consistency to their sharing needs and resource constraints. Existing solutions lack one or more of these features. In this paper, we describe a new consistency model for P2P sharing of mutable data called composable consistency, and outline its implementation in a wide area middleware file service called Swarm1. Composable consistency lets applications compose consistency semantics appropriate for their sharing needs by combining a small set of primitive options. Swarm implements these options efficiently to support scalable, pervasive, failure-resilient, wide-area replication behind a simple yet flexible interface. We present two applications to demonstrate the expressive power and effectiveness of composable consistency: a wide area file system that outperforms Coda in providing close-to-open consistency overWANs, and a replicated BerkeleyDB database that reaps order-of-magnitude performance gains by relaxing consistency for queries and updates

    Achieving Causal Consistency under Partial Replication for Geo-distributed Cloud Storage

    Get PDF
    Causal consistency has emerged as an attractive middle-ground to architecting cloud storage systems, as it allows for high availability and low latency, while supporting stronger-than-eventual-consistency semantics. However, causally-consistent cloud storage systems have seen limited deployment in practice. A key factor is these systems employ full replication of all the data in all the data centers (DCs), incurring high cost. A simple extension of current causal systems to support partial replication by clustering DCs into rings incurs availability and latency problems. We propose Karma, the first system to enable causal consistency for partitioned data stores while achieving the cost advantages of partial replication without the availability and latency problems of the simple extension. Our evaluation with 64 servers emulating 8 geo-distributed DCs shows that Karma (i) incurs much lower cost than a fully-replicated causal store (obviously due to the lower replication factor); and (ii) offers higher availability and better performance than the above partial-replication extension at similar costs

    DATA REPLICATION IN DISTRIBUTED SYSTEMS USING OLYMPIAD OPTIMIZATION ALGORITHM

    Get PDF
    Achieving timely access to data objects is a major challenge in big distributed systems like the Internet of Things (IoT) platforms. Therefore, minimizing the data read and write operation time in distributed systems has elevated to a higher priority for system designers and mechanical engineers. Replication and the appropriate placement of the replicas on the most accessible data servers is a problem of NP-complete optimization. The key objectives of the current study are minimizing the data access time, reducing the quantity of replicas, and improving the data availability. The current paper employs the Olympiad Optimization Algorithm (OOA) as a novel population-based and discrete heuristic algorithm to solve the replica placement problem which is also applicable to other fields such as mechanical and computer engineering design problems. This discrete algorithm was inspired by the learning process of student groups who are preparing for the Olympiad exams. The proposed algorithm, which is divide-and-conquer-based with local and global search strategies, was used in solving the replica placement problem in a standard simulated distributed system. The 'European Union Database' (EUData) was employed to evaluate the proposed algorithm, which contains 28 nodes as servers and a network architecture in the format of a complete graph. It was revealed that the proposed technique reduces data access time by 39% with around six replicas, which is vastly superior to the earlier methods. Moreover, the standard deviation of the results of the algorithm's different executions is approximately 0.0062, which is lower than the other techniques' standard deviation within the same experiments

    Speculation in Parallel and Distributed Event Processing Systems

    Get PDF
    Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data. ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful. As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs. Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations. For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1 1.1 Event stream processing systems ......................... 1 1.2 Running example ................................. 3 1.3 Challenges and contributions ........................... 4 1.4 Outline ...................................... 6 2 Background 7 2.1 Event stream processing ............................. 7 2.1.1 State in operators: Windows and synopses ............................ 8 2.1.2 Types of operators ............................ 12 2.1.3 Our prototype system........................... 13 2.2 Software transactional memory.......................... 18 2.2.1 Overview ................................. 18 2.2.2 Memory operations............................ 19 2.3 Fault tolerance in distributed systems ...................................... 23 2.3.1 Failure model and failure detection ...................................... 23 2.3.2 Recovery semantics............................ 24 2.3.3 Active and passive replication ...................... 24 2.4 Summary ..................................... 26 3 Extending event stream processing systems with speculation 27 3.1 Motivation..................................... 27 3.2 Goals ....................................... 28 3.3 Local versus distributed speculation ....................... 29 3.4 Models and assumptions ............................. 29 3.4.1 Operators................................. 30 3.4.2 Events................................... 30 3.4.3 Failures .................................. 31 4 Local speculation 33 4.1 Overview ..................................... 33 4.2 Requirements ................................... 35 4.2.1 Order ................................... 35 4.2.2 Aborts................................... 37 4.2.3 Optimism control ............................. 38 4.2.4 Notifications ............................... 39 4.3 Applications.................................... 40 4.3.1 Out-of-order processing ......................... 40 4.3.2 Optimistic parallelization......................... 42 4.4 Extensions..................................... 44 4.4.1 Avoiding unnecessary aborts ....................... 44 4.4.2 Making aborts unnecessary........................ 45 4.5 Evaluation..................................... 47 4.5.1 Overhead of speculation ......................... 47 4.5.2 Cost of misspeculation .......................... 50 4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53 4.5.4 Behavior with example operators .................... 57 4.6 Summary ..................................... 60 5 Distributed speculation 63 5.1 Overview ..................................... 63 5.2 Requirements ................................... 64 5.2.1 Speculative events ............................ 64 5.2.2 Speculative accesses ........................... 69 5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72 5.3 Applications .................................... 75 5.3.1 Passive replication and rollback recovery ................................ 75 5.3.2 Active replication ............................. 80 5.4 Extensions ..................................... 82 5.4.1 Active replication and software bugs ..................................... 82 5.4.2 Enabling operators to output multiple events ........................ 87 5.5 Evaluation .................................... 87 5.5.1 Passive replication ............................ 88 5.5.2 Active replication ............................. 88 5.6 Summary ..................................... 93 6 Related work 95 6.1 Event stream processing engines ......................... 95 6.2 Parallelization and optimistic computing ................................ 97 6.2.1 Speculation ................................ 97 6.2.2 Optimistic parallelization ......................... 98 6.2.3 Parallelization in event processing .................................... 99 6.2.4 Speculation in event processing ..................... 99 6.3 Fault tolerance .................................. 100 6.3.1 Passive replication and rollback recovery ............................... 100 6.3.2 Active replication ............................ 101 6.3.3 Fault tolerance in event stream processing systems ............. 103 7 Conclusions 105 7.1 Summary of contributions ............................ 105 7.2 Challenges and future work ............................ 106 Appendices Publications 107 Pseudocode for the consensus protocol 10
    corecore