24 research outputs found

    Byzantine fault-tolerant agreement protocols for wireless Ad hoc networks

    Get PDF
    Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2010.The thesis investigates the problem of fault- and intrusion-tolerant consensus in resource-constrained wireless ad hoc networks. This is a fundamental problem in distributed computing because it abstracts the need to coordinate activities among various nodes. It has been shown to be a building block for several other important distributed computing problems like state-machine replication and atomic broadcast. The thesis begins by making a thorough performance assessment of existing intrusion-tolerant consensus protocols, which shows that the performance bottlenecks of current solutions are in part related to their system modeling assumptions. Based on these results, the communication failure model is identified as a model that simultaneously captures the reality of wireless ad hoc networks and allows the design of efficient protocols. Unfortunately, the model is subject to an impossibility result stating that there is no deterministic algorithm that allows n nodes to reach agreement if more than n2 omission transmission failures can occur in a communication step. This result is valid even under strict timing assumptions (i.e., a synchronous system). The thesis applies randomization techniques in increasingly weaker variants of this model, until an efficient intrusion-tolerant consensus protocol is achieved. The first variant simplifies the problem by restricting the number of nodes that may be at the source of a transmission failure at each communication step. An algorithm is designed that tolerates f dynamic nodes at the source of faulty transmissions in a system with a total of n 3f + 1 nodes. The second variant imposes no restrictions on the pattern of transmission failures. The proposed algorithm effectively circumvents the Santoro- Widmayer impossibility result for the first time. It allows k out of n nodes to decide despite dn 2 e(nk)+k2 omission failures per communication step. This algorithm also has the interesting property of guaranteeing safety during arbitrary periods of unrestricted message loss. The final variant shares the same properties of the previous one, but relaxes the model in the sense that the system is asynchronous and that a static subset of nodes may be malicious. The obtained algorithm, called Turquois, admits f < n 3 malicious nodes, and ensures progress in communication steps where dnf 2 e(n k f) + k 2. The algorithm is subject to a comparative performance evaluation against other intrusiontolerant protocols. The results show that, as the system scales, Turquois outperforms the other protocols by more than an order of magnitude.Esta tese investiga o problema do consenso tolerante a faltas acidentais e maliciosas em redes ad hoc sem fios. Trata-se de um problema fundamental que captura a essência da coordenação em actividades envolvendo vários nós de um sistema, sendo um bloco construtor de outros importantes problemas dos sistemas distribuídos como a replicação de máquina de estados ou a difusão atómica. A tese começa por efectuar uma avaliação de desempenho a protocolos tolerantes a intrusões já existentes na literatura. Os resultados mostram que as limitações de desempenho das soluções existentes estão em parte relacionadas com o seu modelo de sistema. Baseado nestes resultados, é identificado o modelo de falhas de comunicação como um modelo que simultaneamente permite capturar o ambiente das redes ad hoc sem fios e projectar protocolos eficientes. Todavia, o modelo é restrito por um resultado de impossibilidade que afirma não existir algoritmo algum que permita a n nós chegaram a acordo num sistema que admita mais do que n2 transmissões omissas num dado passo de comunicação. Este resultado é válido mesmo sob fortes hipóteses temporais (i.e., em sistemas síncronos) A tese aplica técnicas de aleatoriedade em variantes progressivamente mais fracas do modelo até ser alcançado um protocolo eficiente e tolerante a intrusões. A primeira variante do modelo, de forma a simplificar o problema, restringe o número de nós que estão na origem de transmissões faltosas. É apresentado um algoritmo que tolera f nós dinâmicos na origem de transmissões faltosas em sistemas com um total de n 3f + 1 nós. A segunda variante do modelo não impõe quaisquer restrições no padrão de transmissões faltosas. É apresentado um algoritmo que contorna efectivamente o resultado de impossibilidade Santoro-Widmayer pela primeira vez e que permite a k de n nós efectuarem progresso nos passos de comunicação em que o número de transmissões omissas seja dn 2 e(n k) + k 2. O algoritmo possui ainda a interessante propriedade de tolerar períodos arbitrários em que o número de transmissões omissas seja superior a . A última variante do modelo partilha das mesmas características da variante anterior, mas com pressupostos mais fracos sobre o sistema. Em particular, assume-se que o sistema é assíncrono e que um subconjunto estático dos nós pode ser malicioso. O algoritmo apresentado, denominado Turquois, admite f < n 3 nós maliciosos e assegura progresso nos passos de comunicação em que dnf 2 e(n k f) + k 2. O algoritmo é sujeito a uma análise de desempenho comparativa com outros protocolos na literatura. Os resultados demonstram que, à medida que o número de nós no sistema aumenta, o desempenho do protocolo Turquois ultrapassa os restantes em mais do que uma ordem de magnitude.FC

    In-Production Continuous Testing for Future Telco Cloud

    Get PDF
    Software Defined Networking (SDN) is an emerging paradigm to design, build and operate networks. The driving motivation of SDN was the need for a major change in network technologies to support a configuration, management, operation, reconfiguration and evolution than in current computer networks. In the SDN world, performance it is not only related to the behaviour of the data plane. As the separation of control plane and data plane makes the latter significantly more agile, it lays off all the complex processing workload to the control plane. This is further exacerbated in distributed network controller, where the control plane is additionally loaded with the state synchronization overhead. Furthermore, the introduction of SDNs technologies has raised advanced challenges in achieving failure resilience, meant as the persistence of service delivery that can justifiably be trusted, when facing changes, and fault tolerance, meant as the ability to avoid service failures in the presence of faults. Therefore, along with the “softwarization” of network services, it is an important goal in the engineering of such services, e.g. SDNs and NFVs, to be able to test and assess the proper functioning not only in emulated conditions before release and deployment, but also “in-production”, when the system is under real operating conditions.   The goal of this thesis is to devise an approach to evaluate not only the performance, but also the effectiveness of the failure detection, and mitigation mechanisms provided by SDN controllers, as well as the capability of the SDNs to ultimately satisfy nonfunctional requirements, especially resiliency, availability, and reliability. The approach consists of exploiting benchmarking techniques, such as the failure injection, to get continuously feedback on the performance as well as capabilities of the SDN services to survive failures, which is of paramount importance to improve the effective- ness of the system internal mechanisms in reacting to anomalous situations potentially occurring in operation, while its services are regularly updated or improved. Within this vision, this dissertation first presents SCP-CLUB (SDN Control Plane CLoUd-based Benchmarking), a benchmarking frame- work designed to automate the characterization of SDN control plane performance, resilience and fault tolerance in telco cloud deployments. The idea is to provide the same level of automation available in deploying NFV function, for the testing of different configuration, using idle cycles of the telco cloud infrastructure. Then, the dissertation proposes an extension of the framework with mechanisms to evaluate the runtime behaviour of a Telco Cloud SDN under (possibly unforeseen) failure conditions, by exploiting the software failure injection

    Efficient fault-injection-based assessment of software-implemented hardware fault tolerance

    Get PDF
    With continuously shrinking semiconductor structure sizes and lower supply voltages, the per-device susceptibility to transient and permanent hardware faults is on the rise. A class of countermeasures with growing popularity is Software-Implemented Hardware Fault Tolerance (SIHFT), which avoids expensive hardware mechanisms and can be applied application-specifically. However, SIHFT can, against intuition, cause more harm than good, because its overhead in execution time and memory space also increases the figurative “attack surface” of the system – it turns out that application-specific configuration of SIHFT is in fact a necessity rather than just an advantage. Consequently, target programs need to be analyzed for particularly critical spots to harden. SIHFT-hardened programs need to be measured and compared throughout all development phases of the program to observe reliability improvements or deteriorations over time. Additionally, SIHFT implementations need to be tested. The contributions of this dissertation focus on Fault Injection (FI) as an assessment technique satisfying all these requirements – analysis, measurement and comparison, and test. I describe the design and implementation of an FI tool, named Fail*, that overcomes several shortcomings in the state of the art, and enables research on the general drawbacks of simulation-based FI. As demonstrated in four case studies in the context of SIHFT research, Fail* provides novel fine-grained analysis techniques that exploit the newly gained possibility to analyze FI results from complete fault-space exploration. These analysis techniques aid SIHFT design decisions on the level of program modules, functions, variables, source-code lines, or single machine instructions. Based on the experience from the case studies, I address the problem of large computation efforts that accompany exhaustive fault-space exploration from two different angles: Firstly, I develop a heuristical fault-space pruning technique that allows to freely trade the total FI-experiment count for result accuracy, while still providing information on all possible faultspace coordinates. Secondly, I speed up individual TAP-based FI experiments by improving the fast-forwarding operation by several orders of magnitude for most workloads. Finally, I dissect current practices in FI-based evaluation of SIHFT-hardened programs, identify three widespread pitfalls in the result interpretation, and advance the state of the art by defining a novel comparison metric

    Improving Network Performance Through Endpoint Diagnosis And Multipath Communications

    Get PDF
    Components of networks, and by extension the internet can fail. It is, therefore, important to find the points of failure and resolve existing issues as quickly as possible. Resolution, however, takes time and its important to maintain high quality of service (QoS) for existing clients while it is in progress. In this work, our goal is to provide clients with means of avoiding failures if/when possible to maintain high QoS while enabling them to assist in the diagnosis process to speed up the time to recovery. Fixing failures relies on first detecting that there is one and then identifying where it occurred so as to be able to remedy it. We take a two-step approach in our solution. First, we identify the entity (Client, Server, Network) responsible for the failure. Next, if a failure is identified as network related additional algorithms are triggered to detect the device responsible. To achieve the first step, we revisit the question: how much can you infer about a failure using TCP statistics collected at one of the endpoints in a connection? Using an agent that captures TCP statistics at one of the end points we devise a classification algorithm that identifies the root cause of failures. Using insights derived from this classification algorithm we identify dominant TCP metrics that indicate where/why problems occur. If/when a failure is identified as a network related problem, the second step is triggered, where the algorithm uses additional information that is collected from ``failed\u27\u27 connections to identify the device which resulted in the failure. Failures are also disruptive to user\u27s performance. Resolution may take time. Therefore, it is important to be able to shield clients from their effects as much as possible. One option for avoiding problems resulting from failures is to rely on multiple paths (they are unlikely to go bad at the same time). The use of multiple paths involves both selecting paths (routing) and using them effectively. The second part of this thesis explores the efficacy of multipath communication in such situations. It is expected that multi-path communications have monetary implications for the ISP\u27s and content providers. Our solution, therefore, aims to minimize such costs to the content providers while significantly improving user performance

    Quarantine-mode based live patching for zero downtime safety-critical systems

    Get PDF
    150 p.En esta tesis se presenta una arquitectura y diseño de software, llamado Cetratus, que permite las actualizaciones en caliente en sistemas críticos, donde se efectúan actualizaciones dinámicas de los componentes de la aplicación. La característica principal es la ejecución y monitorización en modo cuarentena, donde la nueva versión del software es ejecutada y monitorizada hasta que se compruebe la confiabilidad de esta nueva versión. Esta característica también ofrece protección contra posibles fallos de software y actualización, así como la propagación de esos fallos a través del sistema. Para este propósito, se emplean técnicas de particionamiento. Aunque la actualización del software es iniciada por el usuario Updater, se necesita la ratificación del auditor para poder proceder y realizar la actualización dinámica. Estos usuarios son autenticados y registrados antes de continuar con la actualización. También se verifica la autenticidad e integridad del parche dinámico. Cetratus está alineado con las normativas de seguridad funcional y de ciber-seguridad industriales respecto a las actualizaciones de software.Se proporcionan dos casos de estudio. Por una parte, en el caso de uso de energía inteligente, se analiza una aplicación de gestión de energía eléctrica, compuesta por un sistema de gestión de energía (BEMS por sus siglas en ingles) y un servicio de optimización de energía en la nube (BEOS por sus siglas en ingles). El BEMS monitoriza y controla las instalaciones de energía eléctrica en un edificio residencial. Toda la información relacionada con la generación, consumo y ahorro es enviada al BEOS, que estima y optimiza el consumo general del edificio para reducir los costes y aumentar la eficiencia energética. En este caso de estudio se incorpora una nueva capa de ciberseguridad para aumentar la ciber-seguridad y privacidad de los datos de los clientes. Específicamente, se utiliza la criptografía homomorfica. Después de la actualización, todos los datos son enviados encriptados al BEOS.Por otro lado, se presenta un caso de estudio ferroviario. En este ejemplo se actualiza el componente Euroradio, que es la que habilita las comunicaciones entre el tren y el equipamiento instalado en las vías en el sistema de gestión de tráfico ferroviario en Europa (ERTMS por sus siglas en ingles). En el ejemplo se actualiza el algoritmo utilizado para el código de autenticación del mensaje (MAC por sus siglas en inglés) basado en el algoritmo de encriptación AES, debido a los fallos de seguridad del algoritmo actual

    Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

    Get PDF
    Supercomputers have played an essential role in the progress of science and engineering research. As the high-performance computing (HPC) community moves towards the next generation of HPC computing, it faces several challenges, one of which is reliability of HPC systems. Error rates are expected to significantly increase on exascale systems to the point where traditional application-level checkpointing may no longer be a viable fault tolerance mechanism. This poses serious ramifications for a system's ability to guarantee reliability and availability of its resources. It is becoming increasingly important to understand fault-to-failure propagation and to identify key areas of instrumentation in HPC systems for avoidance, detection, diagnosis, mitigation, and recovery of faults. This thesis presents a software-implemented, prototype-based fault injection tool called HPCArrow and a fault injection methodology as a means to investigate and evaluate HPC application and system resiliency. We demonstrate HPCArrow's capabilities through four fault injection campaigns on a Cray XE/XK hybrid testbed, covering single injections, time-varying or delayed injections, and injections during recovery. These injections emulate failures on network and compute components. The results of these campaigns provide insight into application-level and system-level resiliencies. Across various HPC application frameworks, there are notable deficiencies in fault tolerance. Our experiments also revealed a failure phenomenon that was previously unobserved in field data: application hangs, in which forward progress is not made, but jobs are not terminated until the maximum allowed time has elapsed. At the system level, failover procedures prove highly robust on small-scale systems, able to handle both single and multiple faults in the network

    Data-Driven Detection and Diagnosis of System-Level Failures in Middleware-Based Service Compositions

    Get PDF
    Service-oriented technologies have simplified the development of large, complex software systems that span administrative boundaries. Developers have been enabled to build applications as compositions of services through middleware that hides much of the underlying complexity. The resulting applications inhabit complex, multi-tier operating environments that pose many challenges to their reliable operation and often lead to failures at runtime. Two key aspects of the time to repair a failure are the time to its detection and to the diagnosis of its cause. The prevalent approach to detection and diagnosis is primarily based on ad-hoc monitoring as well as operator experience and intuition. This is inefficient and leads to decreased availability. We propose an approach to data-driven detection and diagnosis in order to decrease the repair time of failures in middleware-based service compositions. Data-driven diagnosis supports system operators with information about the operation and structure of a service composition. We discuss how middleware-based service compositions can be monitored in a comprehensive, yet non-intrusive manner and present a process to discover system structure by processing deployment information that is commonly reified in such systems. We perform a controlled experiment that compares the performance of 22 participants using either a standard or the data-driven approach to diagnose several failures injected into a real-world service composition. We find that system operators using the latter approach are able to achieve significantly higher success rates and lower diagnosis times. Data-driven detection is based on the automation of failure detection through applying an outlier detection technique to multi-variate monitoring data. We evaluate the effectiveness of one-class classification for this purpose and determine a simple approach to select subsets of metrics that afford highly accurate failure detection

    Design of robust scheduling methodologies for high performance computing

    Get PDF
    Scientific applications are often large, complex, computationally-intensive, and irregular. Loops are often an abundant source of parallelism in scientific applications. Due to the ever-increasing computational needs of scientific applications, high performance computing (HPC) systems have become larger and more complex, offering increased parallelism at multiple hardware levels. Load imbalance, caused by irregular computational load per task and unpredictable computing system characteristics (system variability), often degrades the performance of applications. Besides, perturbations, such as reduced computing power, network latency availability, or failures, can severely impact the performance of the applications. System variability and perturbations are only expected to increase in future extreme-scale computing systems. Extrapolating the current failure rate to Exascale would result in a failure every 20 minutes. Such failure rate and perturbations would render the computing systems unusable. This doctoral thesis improves the performance of computationally-intensive scientific applications on HPC systems via robust load balancing. Robust scheduling ensures and maintains improved load balanced execution under unpredictable application and system characteristics. A number of dynamic loop self-scheduling (DLS) techniques have been introduced and successfully used in scientific applications between the 1980s and 2000s. These DLS techniques are not fault-tolerant as they were originally introduced. In this thesis, we identify three major research questions to achieve robust scheduling (1) How to ensure that the DLS techniques employed in scientific applications today adhere to their original design goals and specifications? (2) How to select a DLS technique that will achieve improved performance under perturbations? (3) How to tolerate perturbations during execution and maintain a load balanced execution on HPC systems? To answer the first question, we reproduced the original experiments that introduced the DLS techniques to verify their present implementation. Simulation is used to reproduce experiments on systems from the past. Realistic simulation induces a similar analysis and conclusions to the analysis of the native results. To this end, we devised an approach for bridging the native and simulative executions of parallel applications on HPC systems. This simulation approach is used to reproduce scheduling experiments on past and present systems to verify the implementation of DLS techniques. Given the multiple levels of parallelism offered by the present HPC systems, we analyzed the load imbalance in scientific applications, from computer vision, astrophysics, and mathematical kernels, at both thread and process levels. This analysis revealed a significant interplay between thread level and process level load balancing. We found that dynamic load balancing at the thread level propagates to the process level and vice versa. However, the best application performance is only achieved by two-level dynamic load balancing. Next, we examined the performance of applications under perturbations. We found that the most robust DLS technique does not deliver the best performance under various perturbations. The most efficient DLS technique changes by changing the application, the system, or perturbations during execution. This signifies the algorithm selection problem in the DLS. We leveraged realistic simulations to address the algorithm selection problem of scheduling under perturbations via a simulation assisted approach (SimAS), which answers the second question. SimAS dynamically selects DLS techniques that improve the performance depending on the application, system, and perturbations during the execution. To answer the third question, we introduced a robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications under failures (question 3). rDLB proactively reschedules already allocated tasks and requires no detection of perturbations. rDLB tolerates up to P −1 processor failures (P is the number of processors allocated to the application) and boosts the flexibility of applications against nonfatal perturbations, such as reduced availability of resources. This thesis is the first to provide insights into the interplay between thread and process level dynamic load balancing in scientific applications. Verified DLS techniques, SimAS, and rDLB are integrated into an MPI-based dynamic load balancing library (DLS4LB), which supports thirteen DLS techniques, for robust dynamic load balancing of scientific applications on HPC systems. Using the methods devised in this thesis, we improved the performance of scientific applications by up to 21% via two-level dynamic load balancing. Under perturbations, we enhanced their performance by a factor of 7 and their flexibility by a factor of 30. This thesis opens up the horizons into understanding the interplay of load balancing between various levels of software parallelism and lays the ground for robust multilevel scheduling for the upcoming Exascale HPC systems and beyond
    corecore