61 research outputs found

    A software architecture for consensus based replication

    Get PDF
    Orientador: Luiz Eduardo BuzatoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Esta tese explora uma das ferramentas fundamentais para construção de sistemas distribuídos: a replicação de componentes de software. Especificamente, procuramos resolver o problema de como simplificar a construção de aplicações replicadas que combinem alto grau de disponibilidade e desempenho. Como ferramenta principal para alcançar o objetivo deste trabalho de pesquisa desenvolvemos Treplica, uma biblioteca de replicação voltada para construção de aplicações distribuídas, porém com semântica de aplicações centralizadas. Treplica apresenta ao programador uma interface simples baseada em uma especificação orientada a objetos de replicação ativa. A conclusão que defendemos nesta tese é que é possível desenvolver um suporte modular e de uso simples para replicação que exibe alto desempenho, baixa latência e que permite recuperação eficiente em caso de falhas. Acreditamos que a arquitetura de software proposta tem aplicabilidade em qualquer sistema distribuído, mas é de especial interesse para sistemas que não são distribuídos pela ausência de uma forma simples, eficiente e confiável de replicá-losAbstract: This thesis explores one of the fundamental tools for the construction of distributed systems: the replication of software components. Specifically, we attempted to solve the problem of simplifying the construction of high-performance and high-availability replicated applications. We have developed Treplica, a replication library, as the main tool to reach this research objective. Treplica allows the construction of distributed applications that behave as centralized applications, presenting the programmer a simple interface based on an object-oriented specification for active replication. The conclusion we reach in this thesis is that it is possible to create a modular and simple to use support for replication, providing high performance, low latency and fast recovery in the presence of failures. We believe our proposed software architecture is applicable to any distributed system, but it is particularly interesting to systems that remain centralized due to the lack of a simple, efficient and reliable replication mechanismDoutoradoSistemas de ComputaçãoDoutor em Ciência da Computaçã

    Using embedded hardware monitor cores in critical computer systems

    Get PDF
    The integration of FPGA devices in many different architectures and services makes monitoring and real time detection of errors an important concern in FPGA system design. A monitor is a tool, or a set of tools, that facilitate analytic measurements in observing a given system. The goal of these observations is usually the performance analysis and optimisation, or the surveillance of the system. However, System-on-Chip (SoC) based designs leave few points to attach external tools such as logic analysers. Thus, an embedded error detection core that allows observation of critical system nodes (such as processor cores and buses) should enforce the operation of the FPGA-based system, in order to prevent system failures. The core should not interfere with system performance and must ensure timely detection of errors. This thesis is an investigation onto how a robust hardware-monitoring module can be efficiently integrated in a target PCI board (with FPGA-based application processing features) which is part of a critical computing system. [Continues.

    Autonomic Overload Management For Large-Scale Virtualized Network Functions

    Get PDF
    The explosion of data traffic in telecommunication networks has been impressive in the last few years. To keep up with the high demand and staying profitable, Telcos are embracing the Network Function Virtualization (NFV) paradigm by shifting from hardware network appliances to software virtual network functions, which are expected to support extremely large scale architectures, providing both high performance and high reliability. The main objective of this dissertation is to provide frameworks and techniques to enable proper overload detection and mitigation for the emerging virtualized software-based network services. The thesis contribution is threefold. First, it proposes a novel approach to quickly detect performance anomalies in complex and large-scale VNF services. Second, it presents NFV-Throttle, an autonomic overload control framework to protect NFV services from overload within a short period of time, allowing to preserve the QoS of traffic flows admitted by network services in response to both traffic spikes (up to 10x the available capacity) and capacity reduction due to infrastructure problems (such as CPU contention). Third, it proposes DRACO, to manage overload problems arising in novel large-scale multi-tier applications, such as complex stateful network functions in which the state is spread across modern key-value stores to achieve both scalability and performance. DRACO performs a fine-grained admission control, by tuning the amount and type of traffic according to datastore node dependencies among the tiers (which are dynamically discovered at run-time), and to the current capacity of individual nodes, in order to mitigate overloads and preventing hot-spots. This thesis presents the implementation details and an extensive experimental evaluation for all the above overload management solutions, by means of a virtualized IP Multimedia Subsystem (IMS), which provides modern multimedia services for Telco operators, such as Videoconferencing and VoLTE, and which is one of the top use-cases of the NFV technology

    Design of a fault tolerant airborne digital computer. Volume 1: Architecture

    Get PDF
    This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive

    Doctor of Philosophy

    Get PDF
    dissertationThe next generation mobile network (i.e., 5G network) is expected to host emerging use cases that have a wide range of requirements; from Internet of Things (IoT) devices that prefer low-overhead and scalable network to remote machine operation or remote healthcare services that require reliable end-to-end communications. Improving scalability and reliability is among the most important challenges of designing the next generation mobile architecture. The current (4G) mobile core network heavily relies on hardware-based proprietary components. The core networks are expensive and therefore are available in limited locations in the country. This leads to a high end-to-end latency due to the long latency between base stations and the mobile core, and limitations in having innovations and an evolvable network. Moreover, at the protocol level the current mobile network architecture was designed for a limited number of smart-phones streaming a large amount of high quality traffic but not a massive number of low-capability devices sending small and sporadic traffic. This results in high-overhead control and data planes in the mobile core network that are not suitable for a massive number of future Internet-of-Things (IoT) devices. In terms of reliability, network operators already deployed multiple monitoring sys- tems to detect service disruptions and fix problems when they occur. However, detecting all service disruptions is challenging. First, there is a complex relationship between the network status and user-perceived service experience. Second, service disruptions could happen because of reasons that are beyond the network itself. With technology advancements in Software-defined Network (SDN) and Network Func- tion Virtualization (NFV), the next generation mobile network is expected to be NFV-based and deployed on NFV platforms. However, in contrast to telecom-grade hardware with built-in redundancy, commodity off-the-shell (COTS) hardware in NFV platforms often can't be comparable in term of reliability. Availability of Telecom-grade mobile core network hardwares is typically 99.999% (i.e., "five-9s" availability) while most NFV platforms only guarantee "three-9s" availability - orders of magnitude less reliable. Therefore, an NFV-based mobile core network needs extra mechanisms to guarantee its availability. This Ph.D. dissertation focuses on using SDN/NFV, data analytics and distributed system techniques to enhance scalability and reliability of the next generation mobile core network. The dissertation makes the following contributions. First, it presents SMORE, a practical offloading architecture that reduces end-to-end latency and enables new functionalities in mobile networks. It then presents SIMECA, a light-weight and scalable mobile core network designed for a massive number of future IoT devices. Second, it presents ABSENCE, a passive service monitoring system using customer usage and data analytics to detect silent failures in an operational mobile network. Lastly, it presents ECHO, a distributed mobile core network architecture to improve availability of NFV-based mobile core network in public clouds

    Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

    Get PDF
    This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR

    Data-driven resiliency assessment of medical cyber-physical systems

    Get PDF
    Advances in computing, networking, and sensing technologies have resulted in the ubiquitous deployment of medical cyber-physical systems in various clinical and personalized settings. The increasing complexity and connectivity of such systems, the tight coupling between their cyber and physical components, and the inevitable involvement of human operators in supervision and control have introduced major challenges in ensuring system reliability, safety, and security. This dissertation takes a data-driven approach to resiliency assessment of medical cyber-physical systems. Driven by large-scale studies of real safety incidents involving medical devices, we develop techniques and tools for (i) deeper understanding of incident causes and measurement of their impacts, (ii) validation of system safety mechanisms in the presence of realistic hazard scenarios, and (iii) preemptive real-time detection of safety hazards to mitigate adverse impacts on patients. We present a framework for automated analysis of structured and unstructured data from public FDA databases on medical device recalls and adverse events. This framework allows characterization of the safety issues originated from computer failures in terms of fault classes, failure modes, and recovery actions. We develop an approach for constructing ontology models that enable automated extraction of safety-related features from unstructured text. The proposed ontology model is defined based on device-specific human-in-the-loop control structures in order to facilitate the systems-theoretic causality analysis of adverse events. Our large-scale analysis of FDA data shows that medical devices are often recalled because of failure to identify all potential safety hazards, use of safety mechanisms that have not been rigorously validated, and limited capability in real-time detection and automated mitigation of hazards. To address those problems, we develop a safety hazard injection framework for experimental validation of safety mechanisms in the presence of accidental failures and malicious attacks. To reduce the test space for safety validation, this framework uses systems-theoretic accident causality models in order to identify the critical locations within the system to target software fault injection. For mitigation of safety hazards at run time, we present a model-based analysis framework that estimates the consequences of control commands sent from the software to the physical system through real-time computation of the system’s dynamics, and preemptively detects if a command is unsafe before its adverse consequences manifest in the physical system. The proposed techniques are evaluated on a real-world cyber-physical system for robot-assisted minimally invasive surgery and are shown to be more effective than existing methods in identifying system vulnerabilities and deficiencies in safety mechanisms as well as in preemptive detection of safety hazards caused by malicious attacks

    Survey and Analysis of Production Distributed Computing Infrastructures

    Full text link
    This report has two objectives. First, we describe a set of the production distributed infrastructures currently available, so that the reader has a basic understanding of them. This includes explaining why each infrastructure was created and made available and how it has succeeded and failed. The set is not complete, but we believe it is representative. Second, we describe the infrastructures in terms of their use, which is a combination of how they were designed to be used and how users have found ways to use them. Applications are often designed and created with specific infrastructures in mind, with both an appreciation of the existing capabilities provided by those infrastructures and an anticipation of their future capabilities. Here, the infrastructures we discuss were often designed and created with specific applications in mind, or at least specific types of applications. The reader should understand how the interplay between the infrastructure providers and the users leads to such usages, which we call usage modalities. These usage modalities are really abstractions that exist between the infrastructures and the applications; they influence the infrastructures by representing the applications, and they influence the ap- plications by representing the infrastructures
    corecore