167 research outputs found

    Towards Middleware for Fault-tolerance in Distributed Real-time and Embedded Systems

    Get PDF
    Abstract. Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to provide fault tolerance capabilities that respect time-critical needs of DRE systems. Conventional middleware solutions, such as Fault-tolerant CORBA (FT-CORBA) and Continuous Availability API for J2EE, have limited utility for DRE systems because they are heavyweight (e.g., the complexity of their feature-rich fault tolerance capabilities consumes excessive runtime resources), yet incomplete (e.g., they lack mechanisms that enable fault tolerance while maintaining real-time predictability). This paper provides three contributions to the development and standardization of lightweight real-time and fault-tolerant middleware for DRE systems. First, we discuss the challenges in realizing real-time faulttolerant solutions for DRE systems using contemporary middleware. Second, we describe recent progress towards standardizing a CORBA lightweight fault-tolerance specification for DRE systems. Third, we present the architecture of FLARe, which is a prototype based on the OMG real-time fault-tolerant CORBA middleware standardization efforts that is lightweight (e.g., leverages only those server-and client-side mechanisms required for real-time systems) and predictable (e.g., provides fault-tolerant mechanisms that respect time-critical performance needs of DRE systems)

    On Maximizing the Efficiency of Multipurpose WSNs Through Avoidance of Over- or Under-Provisioning of Information

    Get PDF
    A wireless sensor network (WSN) is a distributed collection of sensor nodes, which are resource constrained and capable of operating with minimal user attendance. The core function of a WSN is to sample physical phenomena and their environment and transport the information of interest, such as current status or events, as required by the application. Furthermore, the operating conditions and/or user requirements of WSNs are often desired to be evolvable, either driven by changes of the monitored phenomena or by the properties of the WSN itself. Consequently, a key objective for setting up/configuring WSNs is to provide the desired information subject to user defined quality requirements (accuracy, reliability, timeliness etc.), while considering their evolvability at the same time. The current state of the art only addresses the functional blocks of sampling and information transport in isolation. The approaches indeed assume the respective other block to be perfect in maintaining the highest possible information contribution. In addition, some of the approaches just concentrate on a few information attributes such as accuracy and ignore other attributes (e.g., reliability, timeliness, etc.). The existing research targeting these blocks usually tries to enhance the information quality requirements (accuracy, reliability, timeliness etc.), regardless of user requirements and use more resources, leading to faster energy depletion. However, we argue that it is not always necessary to provide the highest possible information quality. In fact, it is essential to avoid under or over provision of information in order to save valuable resources such as energy while just satisfying user evolvable requirements. More precisely, we show the interdependence of the different user requirements and how to co-design them in order to tune the level of provisioning. To discern the fundamental issues dictating the tunable co-design in WSNs, this thesis models and co-designs the sampling accuracy, information transport reliability and timeliness, and compares existing techniques. We highlight the key problems of existing techniques and provide solutions to achieve desired application requirements without under or over provisioning of information. Our first research direction is to provide tunable information transport. We show that it is possible to drastically improve efficiency, while satisfying the user evolvable requirements on reliability and timeliness. In this regard, we provide a novel timeliness model and show the tradeoff between the reliability and timeliness. In addition, we show that the reliability and timeliness can work in composition for maximizing efficiency in information transport. Second, we consider the sampling and information transport co-design by just considering the attributes spatial accuracy and transport reliability. We provide a mathematical model in this regard and then show the optimization of sampling and information transport co-design. The approach is based on optimally choosing the number of samples in order to minimize the number of retransmission in the information transport while maintaining the required reliability. Third, we consider representing the physical phenomena accurately and optimize the network performance. Therefore, we jointly model accuracy, reliability and timeliness, and then derive the optimal combination of sampling and information transport. We provide an optimized model to choose the right representative sensor nodes to describe the phenomena and highlight the tunable co-design of sampling and information transport by avoiding over or under provision of information. Our simulation and experimental results show that the proposed tunable co-design supports evolving user requirements, copes with dynamic network properties and outperforms the state of the art solutions

    CAP Theorem: Revision of its related consistency models

    Get PDF
    [EN] The CAP theorem states that only two of these properties can be simultaneously guaranteed in a distributed service: (i) consistency, (ii) availability, and (iii) network partition tolerance. This theorem was stated and proved assuming that "consistency" refers to atomic consistency. However, multiple consistency models exist and atomic consistency is located at the strongest edge of that spectrum. Many distributed services deployed in cloud platforms should be highly available and scalable. Network partitions may arise in those deployments and should be tolerated. One way of dealing with CAP constraints consists in relaxing consistency. Therefore, it is interesting to explore the set of consistency models not supported in an available and partition-tolerant service (CAP-constrained models). Other weaker consistency models could be maintained when scalable services are deployed in partitionable systems (CAP-free models). Three contributions arise: (1) multiple other CAP-constrained models are identified, (2) a borderline between CAP-constrained and CAP-free models is set, and (3) a hierarchy of consistency models depending on their strength and convergence is built.Muñoz-EscoĂ­, FD.; Juan MarĂ­n, RD.; GarcĂ­a Escriva, JR.; GonzĂĄlez De MendĂ­vil Moreno, JR.; Bernabeu AubĂĄn, JM. (2019). CAP Theorem: Revision of its related consistency models. The Computer Journal. 62(6):943-960. https://doi.org/10.1093/comjnl/bxy142S943960626Davidson, S. B., Garcia-Molina, H., & Skeen, D. (1985). Consistency in a partitioned network: a survey. ACM Computing Surveys, 17(3), 341-370. doi:10.1145/5505.5508Gilbert, S., & Lynch, N. (2002). Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2), 51-59. doi:10.1145/564585.564601Muñoz-EscoĂ­, F. D., & BernabĂ©u-AubĂĄn, J. M. (2016). A survey on elasticity management in PaaS systems. Computing, 99(7), 617-656. doi:10.1007/s00607-016-0507-8Brewer, E. (2012). CAP twelve years later: How the «rules» have changed. Computer, 45(2), 23-29. doi:10.1109/mc.2012.37Attiya, H., Ellen, F., & Morrison, A. (2017). Limitations of Highly-Available Eventually-Consistent Data Stores. IEEE Transactions on Parallel and Distributed Systems, 28(1), 141-155. doi:10.1109/tpds.2016.2556669Viotti, P., & Vukolić, M. (2016). Consistency in Non-Transactional Distributed Storage Systems. ACM Computing Surveys, 49(1), 1-34. doi:10.1145/2926965Burckhardt, S. (2014). Principles of Eventual Consistency. Foundations and TrendsÂź in Programming Languages, 1(1-2), 1-150. doi:10.1561/2500000011Herlihy, M. P., & Wing, J. M. (1990). Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3), 463-492. doi:10.1145/78969.78972Lamport. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9), 690-691. doi:10.1109/tc.1979.1675439Ladin, R., Liskov, B., Shrira, L., & Ghemawat, S. (1992). Providing high availability using lazy replication. ACM Transactions on Computer Systems, 10(4), 360-391. doi:10.1145/138873.138877Yu, H., & Vahdat, A. (2002). Design and evaluation of a conit-based continuous consistency model for replicated services. ACM Transactions on Computer Systems, 20(3), 239-282. doi:10.1145/566340.566342Curino, C., Jones, E., Zhang, Y., & Madden, S. (2010). Schism. Proceedings of the VLDB Endowment, 3(1-2), 48-57. doi:10.14778/1920841.1920853Das, S., Agrawal, D., & El Abbadi, A. (2013). ElasTraS. ACM Transactions on Database Systems, 38(1), 1-45. doi:10.1145/2445583.2445588Chen, Z., Yang, S., Tan, S., He, L., Yin, H., & Zhang, G. (2014). A new fragment re-allocation strategy for NoSQL database systems. Frontiers of Computer Science, 9(1), 111-127. doi:10.1007/s11704-014-3480-4Kamal, J., Murshed, M., & Buyya, R. (2016). Workload-aware incremental repartitioning of shared-nothing distributed databases for scalable OLTP applications. Future Generation Computer Systems, 56, 421-435. doi:10.1016/j.future.2015.09.024Elghamrawy, S. M., & Hassanien, A. E. (2017). A partitioning framework for Cassandra NoSQL database using Rendezvous hashing. The Journal of Supercomputing, 73(10), 4444-4465. doi:10.1007/s11227-017-2027-5Muñoz-EscoĂ­, F. D., GarcĂ­a-EscrivĂĄ, J.-R., Sendra-Roig, J. S., BernabĂ©u-AubĂĄn, J. M., & GonzĂĄlez de MendĂ­vil, J. R. (2018). Eventual Consistency: Origin and Support. Computing and Informatics, 37(5), 1037-1072. doi:10.4149/cai_2018_5_1037Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2), 374-382. doi:10.1145/3149.21412

    Exploiting cost-performance tradeoffs for modern cloud systems

    Get PDF
    The trade-off between cost and performance is a fundamental challenge for modern cloud systems. This thesis explores cost-performance tradeoffs for three types of systems that permeate today's clouds, namely (1) storage, (2) virtualization, and (3) computation. A distributed key-value storage system must choose between the cost of keeping replicas synchronized (consistency) and performance (latency) or read/write operations. A cloud-based disaster recovery system can reduce the cost of managing a group of VMs as a single unit for recovery by implementing this abstraction in software (instead of hardware) at the risk of impacting application availability performance. As another example, run-time performance of graph analytics jobs sharing a multi-tenant cluster can be made better by trading of the cost of replication of the input graph data-set stored in the associated distributed file system. Today cloud system providers have to manually tune the system to meet desired trade-offs. This can be challenging since the optimal trade-off between cost and performance may vary depending on network and workload conditions. Thus our hypothesis is that it is feasible to imbue a wide variety of cloud systems with adaptive and opportunistic mechanisms to efficiently navigate the cost-performance tradeoff space to meet desired tradeoffs. The types of cloud systems considered in this thesis include key-value stores, cloud-based disaster recovery systems, and multi-tenant graph computation engines. Our first contribution, PCAP is an adaptive distributed storage system. The foundation of the PCAP system is a probabilistic variation of the classical CAP theorem, which quantifies the (un-)achievable envelope of probabilistic consistency and latency under different network conditions characterized by a probabilistic partition model. Our PCAP system proposes adaptive mechanisms for tuning control knobs to meet desired consistency-latency tradeoffs expressed in terms in service-level agreements. Our second system, GeoPCAP is a geo-distributed extension of PCAP. In GeoPCAP, we propose generalized probabilistic composition rules for composing consistency-latency tradeoffs across geo-distributed instances of distributed key-value stores, each running on separate data-centers. GeoPCAP also includes a geo-distributed adaptive control system that adapts new controls knobs to meet SLAs across geo-distributed data-centers. Our third system, GCVM proposes a light-weight hypervisor-managed mechanism for taking crash consistent snapshots across VMs distributed over servers. This mechanism enables us to move the consistency group abstraction from hardware to software, and thus lowers reconfiguration cost while incurring modest VM pause times which impact application availability. Finally, our fourth contribution is a new opportunistic graph processing system called OPTiC for efficiently scheduling multiple graph analytics jobs sharing a multi-tenant cluster. By opportunistically creating at most 1 additional replica in the distributed file system (thus incurring cost), we show up to 50% reduction in median job completion time for graph processing jobs under realistic network and workload conditions. Thus with a modest increase in storage and bandwidth cost in disk, we can reduce job completion time (improve performance). For the first two systems (PCAP, and GeoPCAP), we exploit the cost-performance tradeoff space through efficient navigation of the tradeoff space to meet SLAs and perform close to the optimal tradeoff. For the third (GCVM) and fourth (OPTiC) systems, we move from one solution point to another solution point in the tradeoff space. For the last two systems, explicitly mapping out the tradeoff space allows us to consider new design tradeoffs for these systems

    Real-Time Reliable Middleware for Industrial Internet-of-Things

    Get PDF
    This dissertation contributes to the area of adaptive real-time and fault-tolerant systems research, applied to Industrial Internet-of-Things (IIoT) systems. Heterogeneous timing and reliability requirements arising from IIoT applications have posed challenges for IIoT services to efficiently differentiate and meet such requirements. Specifically, IIoT services must both differentiate processing according to applications\u27 timing requirements (including latency, event freshness, and relative consistency of each other) and enforce the needed levels of assurance for data delivery (even as far as ensuring zero data loss). It is nontrivial for an IIoT service to efficiently differentiate such heterogeneous IIoT timing/reliability requirements to fit each application, especially when facing increasingly large data traffic and when common fault-tolerant mechanisms tend to introduce latency and latency jitters. This dissertation presents a new adaptive real-time fault-tolerant framework for IIoT systems, along with efficient and adaptive strategies to meet each IIoT application\u27s timing/reliability requirements. The contributions of the framework are demonstrated by three new IIoT middleware services: (1) Cyber-Physical Event Processing (CPEP), which both differentiates application-specific latency requirements and enforces cyber-physical timing constraints, by prioritizing, sharing, and shedding event processing. (2) Fault-Tolerant Real-Time Messaging (FRAME), which integrates real-time capabilities with a primary-backup replication system, to fit each application\u27s unique timing and loss-tolerance requirements. (3) Adaptive Real-Time Reliable Edge Computing (ARREC), which leverages heterogeneous loss-tolerance requirements and their different temporal laxities, to perform selective and lazy (yet timely) data replication, thus allowing the system to meet needed levels of loss-tolerance while reducing both the latency and bandwidth penalties that are typical of fault-tolerant sub-systems
