28 research outputs found

    Foreseeing the Impact of the Proposed AI Act on the Sustainability and Safety of Critical Infrastructures

    Get PDF
    The AI Act has been recently proposed by the European Commission to regulate the use of AI in the EU, especially on high-risk applications, i.e. systems intended to be used as safety components in the management and operation of road traffic and the supply of water, gas, heating and electricity. On the other hand, IEC 61508, one of the most adopted international standards for safety-critical electronic components, seem to mostly forbid the use of AI in such systems. Given this conflict between IEC 61508 and the proposed AI Act, also stressed by the fact that IEC 61508 is not an harmonised European standard, with the present paper we study and analyse what is going to happen to industry after the entry into force of the AI Act. In particular, we focus on how the proposed AI Act might positively impact on the sustainability of critical infrastructures by allowing the use of AI on an industry where it was previously forbidden. To do so, we provide several examples of AI-based solutions falling under the umbrella of IEC 61508 that might have a positive impact on sustainability in alignment with the current long-term goals of the EU and the Sustainable Development Goals of the United Nations, i.e., affordable and clean energy, sustainable cities and communities

    Operating System Support for Redundant Multithreading

    Get PDF
    Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors

    Explanation of the Model Checker Verification Results

    Get PDF
    Immer wenn neue Anforderungen an ein System gestellt werden, müssen die Korrektheit und Konsistenz der Systemspezifikation überprüft werden, was in der Praxis in der Regel manuell erfolgt. Eine mögliche Option, um die Nachteile dieser manuellen Analyse zu überwinden, ist das sogenannte Contract-Based Design. Dieser Entwurfsansatz kann den Verifikationsprozess zur Überprüfung, ob die Anforderungen auf oberster Ebene konsistent verfeinert wurden, automatisieren. Die Verifikation kann somit iterativ durchgeführt werden, um die Korrektheit und Konsistenz des Systems angesichts jeglicher Änderung der Spezifikationen sicherzustellen. Allerdings ist es aufgrund der mangelnden Benutzerfreundlichkeit und der Schwierigkeiten bei der Interpretation von Verifizierungsergebnissen immer noch eine Herausforderung, formale Ansätze in der Industrie einzusetzen. Stellt beispielsweise der Model Checker bei der Verifikation eine Inkonsistenz fest, generiert er ein Gegenbeispiel (Counterexample) und weist gleichzeitig darauf hin, dass die gegebenen Eingabespezifikationen inkonsistent sind. Hier besteht die gewaltige Herausforderung darin, das generierte Gegenbeispiel zu verstehen, das oft sehr lang, kryptisch und komplex ist. Darüber hinaus liegt es in der Verantwortung der Ingenieurin bzw. des Ingenieurs, die inkonsistente Spezifikation in einer potenziell großen Menge von Spezifikationen zu identifizieren. Diese Arbeit schlägt einen Ansatz zur Erklärung von Gegenbeispielen (Counterexample Explanation Approach) vor, der die Verwendung von formalen Methoden vereinfacht und fördert, indem benutzerfreundliche Erklärungen der Verifikationsergebnisse der Ingenieurin bzw. dem Ingenieur präsentiert werden. Der Ansatz zur Erklärung von Gegenbeispielen wird mittels zweier Methoden evaluiert: (1) Evaluation anhand verschiedener Anwendungsbeispiele und (2) eine Benutzerstudie in Form eines One-Group Pretest-Posttest Experiments.Whenever new requirements are introduced for a system, the correctness and consistency of the system specification must be verified, which is often done manually in industrial settings. One viable option to traverse disadvantages of this manual analysis is to employ the contract-based design, which can automate the verification process to determine whether the refinements of top-level requirements are consistent. Thus, verification can be performed iteratively to ensure the system’s correctness and consistency in the face of any change in specifications. Having said that, it is still challenging to deploy formal approaches in industries due to their lack of usability and their difficulties in interpreting verification results. For instance, if the model checker identifies inconsistency during the verification, it generates a counterexample while also indicating that the given input specifications are inconsistent. Here, the formidable challenge is to comprehend the generated counterexample, which is often lengthy, cryptic, and complex. Furthermore, it is the engineer’s responsibility to identify the inconsistent specification among a potentially huge set of specifications. This PhD thesis proposes a counterexample explanation approach for formal methods that simplifies and encourages their use by presenting user-friendly explanations of the verification results. The proposed counterexample explanation approach identifies and explains relevant information from the verification result in what seems like a natural language statement. The counterexample explanation approach extracts relevant information by identifying inconsistent specifications from among the set of specifications, as well as erroneous states and variables from the counterexample. The counterexample explanation approach is evaluated using two methods: (1) evaluation with different application examples, and (2) a user-study known as one-group pretest and posttest experiment

    From Resilience-Building to Resilience-Scaling Technologies: Directions -- ReSIST NoE Deliverable D13

    Get PDF
    This document is the second product of workpackage WP2, "Resilience-building and -scaling technologies", in the programme of jointly executed research (JER) of the ReSIST Network of Excellence. The problem that ReSIST addresses is achieving sufficient resilience in the immense systems of ever evolving networks of computers and mobile devices, tightly integrated with human organisations and other technology, that are increasingly becoming a critical part of the information infrastructure of our society. This second deliverable D13 provides a detailed list of research gaps identified by experts from the four working groups related to assessability, evolvability, usability and diversit

    Flexible management of bandwidth and redundancy in fieldbuses

    Get PDF
    Doutoramento em Engenharia ElectrotécnicaOs sistemas distribuídos embarcados (Distributed Embedded Systems – DES) têm sido usados ao longo dos últimos anos em muitos domínios de aplicação, da robótica, ao controlo de processos industriais passando pela aviónica e pelas aplicações veiculares, esperando-se que esta tendência continue nos próximos anos. A confiança no funcionamento é uma propriedade importante nestes domínios de aplicação, visto que os serviços têm de ser executados em tempo útil e de forma previsível, caso contrário, podem ocorrer danos económicos ou a vida de seres humanos poderá ser posta em causa. Na fase de projecto destes sistemas é impossível prever todos os cenários de falhas devido ao não determinismo do ambiente envolvente, sendo necessária a inclusão de mecanismos de tolerância a falhas. Adicionalmente, algumas destas aplicações requerem muita largura de banda, que também poderá ser usada para a evolução dos sistemas, adicionandolhes novas funcionalidades. A flexibilidade de um sistema é uma propriedade importante, pois permite a sua adaptação às condições e requisitos envolventes, contribuindo também para a simplicidade de manutenção e reparação. Adicionalmente, nos sistemas embarcados, a flexibilidade também é importante por potenciar uma melhor utilização dos, muitas vezes escassos, recursos existentes. Uma forma evidente de aumentar a largura de banda e a tolerância a falhas dos sistemas embarcados distribuídos é a replicação dos barramentos do sistema. Algumas soluções existentes, quer comerciais quer académicas, propõem a replicação dos barramentos para aumento da largura de banda ou para aumento da tolerância a falhas. No entanto e quase invariavelmente, o propósito é apenas um, sendo raras as soluções que disponibilizam uma maior largura de banda e um aumento da tolerância a falhas. Um destes raros exemplos é o FlexRay, com a limitação de apenas ser permitido o uso de dois barramentos. Esta tese apresentada e discute uma proposta para usar a replicação de barramentos de uma forma flexível com o objectivo duplo de aumentar a largura de banda e a tolerância a falhas. A flexibilidade dos protocolos propostos também permite a gestão dinâmica da topologia da rede, sendo o número de barramentos apenas limitado pelo hardware/software. As propostas desta tese foram validadas recorrendo ao barramento de campo CAN – Controller Area Network, escolhido devido à sua grande implantação no mercado. Mais especificamente, as soluções propostas foram implementadas e validadas usando um paradigma que combina flexibilidade com comunicações event-triggered e time-triggered: o FTT – Flexible Time- Triggered. No entanto, uma generalização para CAN nativo é também apresentada e discutida. A inclusão de mecanismos de replicação do barramento impõe a alteração dos antigos protocolos de replicação e substituição do nó mestre, bem como a definição de novos protocolos para esta finalidade. Este trabalho tira partido da arquitectura centralizada e da replicação do nó mestre para suportar de forma eficiente e flexível a replicação de barramentos. Em caso de ocorrência de uma falta num barramento (ou barramentos) que poderia provocar uma falha no sistema, os protocolos e componentes propostos nesta tese fazem com que o sistema reaja, mudando para um modo de funcionamento degradado. As mensagens que estavam a ser transmitidas nos barramentos onde ocorreu a falta são reencaminhadas para os outros barramentos. A replicação do nó mestre baseia-se numa estratégia líder-seguidores (leaderfollowers), onde o líder (leader) controla todo o sistema enquanto os seguidores (followers) servem como nós de reserva. Se um erro ocorrer no nó líder, um dos nós seguidores passará a controlar o sistema de uma forma transparente e mantendo as mesmas funcionalidades. As propostas desta tese foram também generalizadas para CAN nativo, tendo sido para tal propostos dois componentes adicionais. É, desta forma possível ter as mesmas capacidades de tolerância a falhas ao nível dos barramentos juntamente com a gestão dinâmica da topologia de rede. Todas as propostas desta tese foram implementadas e avaliadas. Uma implementação inicial, apenas com um barramento foi avaliada recorrendo a uma aplicação real, uma equipa de futebol robótico onde o protocolo FTT-CAN foi usado no controlo de movimento e da odometria. A avaliação do sistema com múltiplos barramentos foi feita numa plataforma de teste em laboratório. Para tal foi desenvolvido um sistema de injecção de faltas que permite impor faltas nos barramentos e nos nós mestre, e um sistema de medida de atrasos destinado a medir o tempo de resposta após a ocorrência de uma falta.Distributed embedded systems (DES) have been widely used in the last few decades in several application domains, from robotics, industrial process control, avionics and automotive. In fact, it is expectable that this trend will continue in the next years. In some of these application fields the dependability requirements are very important since the fail to provide services in a timely and predictable manner may cause important economic losses or even put humans in risk. In the design phase it is impossible to predict all the possible scenarios of faults, due to the non deterministic behaviour of the surrounding environment. In that way, the fault tolerance mechanisms must be included in the distributed embedded system to prevent failures occurrence. Also, many application domains require a high available bandwidth to perform the desired functions, or to turn possible the scaling with the addition of new features. The flexibility of a system also plays an important role, since it improves the capability to adapt to the surrounding world, and to the simplicity of the repair and maintenance. The flexibility improves the efficiency of all the system by providing a way to efficiently manage the available resources. This is very important in embedded systems due to the limited resources often available. A natural way to improve the bandwidth and the fault tolerance in distributed systems is to use replicated buses. Commercial and academic solutions propose the use of replicated fieldbuses for a single purpose only, either to improve the fault tolerance or to improve the available bandwidth, being the first the most common. One illustrative exception is FlexRay where the bus replica can be used to improve the bandwidth of the overall system, besides enabling redundant communications. However, only one bus replica can be used. In this thesis, a flexible bus replication scheme to improve both the dependability and the throughput of fieldbuses is presented and studied. It can be applied to any number of replicated buses, provided the required hardware support is available. The flexible use of the replicated buses can achieve an also flexible management of the network topology. This claim has been validated using the Controller Area Network (CAN) fieldbus, which has been chosen because it is widely spread in millions of systems. In fact, the proposed solution uses a paradigm that combines flexibility, time and event triggered communication, that is the Flexible Time- Triggered over CAN network (FTT-CAN). However, a generalization to native CAN is also presented and studied. The inclusion of bus replication in FTT-CAN imposes not only new mechanisms but also changes of the mechanisms associated with the master replication, which has been already studied in previous research work. In this work, these mechanisms were combined and take advantage of the centralized architecture and of the redundant masters to support an efficient and flexible bus replication. When considering the system operation, if a fault in the bus (or buses) occurs, and the consequent error leads to a system failure, the system reacts, switching to a degraded mode, where the message flows that were transmitted in the faulty bus (or buses) change to the non-faulty ones. The central node replication uses a leader-follower strategy, where the leader controls the system while the followers serve as backups. If an error occurs in the leader, a backup will take the system control maintaining the system with the same functionalities. The system has been generalized for native CAN, using two additional components that provide the same fault tolerance capabilities at the bus level, and also enable the dynamic management of the network topology. All the referred proposals were implemented and assessed in the scope of this work. The single bus version of FTT-CAN was assessed using a real application, a robotic soccer team, which has obtained excellent results in international competitions. There, the FTT-CAN based embedded system has been applied in the low level control, where, mainly it is responsible for the motion control and odometry. For the case of the multiple buses system, the assessment was performed in a laboratory test bed. For this, a fault injector was developed in order to impose faults in the buses and in the central nodes. To measure the time reaction of the system, a special hardware has been developed: a delay measurement system. It is able to measure delays between two important time marks for posterior offline analysis of the obtained values

    Development and certification of mixed-criticality embedded systems based on probabilistic timing analysis

    Get PDF
    An increasing variety of emerging systems relentlessly replaces or augments the functionality of mechanical subsystems with embedded electronics. For quantity, complexity, and use, the safety of such subsystems is an increasingly important matter. Accordingly, those systems are subject to safety certification to demonstrate system's safety by rigorous development processes and hardware/software constraints. The massive augment in embedded processors' complexity renders the arduous certification task significantly harder to achieve. The focus of this thesis is to address the certification challenges in multicore architectures: despite their potential to integrate several applications on a single platform, their inherent complexity imperils their timing predictability and certification. Recently, the Measurement-Based Probabilistic Timing Analysis (MBPTA) technique emerged as an alternative to deal with hardware/software complexity. The innovation that MBPTA brings about is, however, a major step from current certification procedures and standards. The particular contributions of this Thesis include: (i) the definition of certification arguments for mixed-criticality integration upon multicore processors. In particular we propose a set of safety mechanisms and procedures as required to comply with functional safety standards. For timing predictability, (ii) we present a quantitative approach to assess the likelihood of execution-time exceedance events with respect to the risk reduction requirements on safety standards. To this end, we build upon the MBPTA approach and we present the design of a safety-related source of randomization (SoR), that plays a key role in the platform-level randomization needed by MBPTA. And (iii) we evaluate current certification guidance with respect to emerging high performance design trends like caches. Overall, this Thesis pushes the certification limits in the use of multicore and MBPTA technology in Critical Real-Time Embedded Systems (CRTES) and paves the way towards their adoption in industry.Una creciente variedad de sistemas emergentes reemplazan o aumentan la funcionalidad de subsistemas mecánicos con componentes electrónicos embebidos. El aumento en la cantidad y complejidad de dichos subsistemas electrónicos así como su cometido, hacen de su seguridad una cuestión de creciente importancia. Tanto es así que la comercialización de estos sistemas críticos está sujeta a rigurosos procesos de certificación donde se garantiza la seguridad del sistema mediante estrictas restricciones en el proceso de desarrollo y diseño de su hardware y software. Esta tesis trata de abordar los nuevos retos y dificultades dadas por la introducción de procesadores multi-núcleo en dichos sistemas críticos: aunque su mayor rendimiento despierta el interés de la industria para integrar múltiples aplicaciones en una sola plataforma, suponen una mayor complejidad. Su arquitectura desafía su análisis temporal mediante los métodos tradicionales y, asimismo, su certificación es cada vez más compleja y costosa. Con el fin de lidiar con estas limitaciones, recientemente se ha desarrollado una novedosa técnica de análisis temporal probabilístico basado en medidas (MBPTA). La innovación de esta técnica, sin embargo, supone un gran cambio cultural respecto a los estándares y procedimientos tradicionales de certificación. En esta línea, las contribuciones de esta tesis están agrupadas en tres ejes principales: (i) definición de argumentos de seguridad para la certificación de aplicaciones de criticidad-mixta sobre plataformas multi-núcleo. Se definen, en particular, mecanismos de seguridad, técnicas de diagnóstico y reacción de faltas acorde con el estándar IEC 61508 sobre una arquitectura multi-núcleo de referencia. Respecto al análisis temporal, (ii) presentamos la cuantificación de la probabilidad de exceder un límite temporal y su relación con los requisitos de reducción de riesgos derivados de los estándares de seguridad funcional. Con este fin, nos basamos en la técnica MBPTA y presentamos el diseño de una fuente de números aleatorios segura; un componente clave para conseguir las propiedades aleatorias requeridas por MBPTA a nivel de plataforma. Por último, (iii) extrapolamos las guías actuales para la certificación de arquitecturas multi-núcleo a una solución comercial de 8 núcleos y las evaluamos con respecto a las tendencias emergentes de diseño de alto rendimiento (caches). Con estas contribuciones, esta tesis trata de abordar los retos que el uso de procesadores multi-núcleo y MBPTA implican en el proceso de certificación de sistemas críticos de tiempo real y facilita, de esta forma, su adopción por la industria.Postprint (published version

    Speculation in Parallel and Distributed Event Processing Systems

    Get PDF
    Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data. ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful. As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs. Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations. For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1 1.1 Event stream processing systems ......................... 1 1.2 Running example ................................. 3 1.3 Challenges and contributions ........................... 4 1.4 Outline ...................................... 6 2 Background 7 2.1 Event stream processing ............................. 7 2.1.1 State in operators: Windows and synopses ............................ 8 2.1.2 Types of operators ............................ 12 2.1.3 Our prototype system........................... 13 2.2 Software transactional memory.......................... 18 2.2.1 Overview ................................. 18 2.2.2 Memory operations............................ 19 2.3 Fault tolerance in distributed systems ...................................... 23 2.3.1 Failure model and failure detection ...................................... 23 2.3.2 Recovery semantics............................ 24 2.3.3 Active and passive replication ...................... 24 2.4 Summary ..................................... 26 3 Extending event stream processing systems with speculation 27 3.1 Motivation..................................... 27 3.2 Goals ....................................... 28 3.3 Local versus distributed speculation ....................... 29 3.4 Models and assumptions ............................. 29 3.4.1 Operators................................. 30 3.4.2 Events................................... 30 3.4.3 Failures .................................. 31 4 Local speculation 33 4.1 Overview ..................................... 33 4.2 Requirements ................................... 35 4.2.1 Order ................................... 35 4.2.2 Aborts................................... 37 4.2.3 Optimism control ............................. 38 4.2.4 Notifications ............................... 39 4.3 Applications.................................... 40 4.3.1 Out-of-order processing ......................... 40 4.3.2 Optimistic parallelization......................... 42 4.4 Extensions..................................... 44 4.4.1 Avoiding unnecessary aborts ....................... 44 4.4.2 Making aborts unnecessary........................ 45 4.5 Evaluation..................................... 47 4.5.1 Overhead of speculation ......................... 47 4.5.2 Cost of misspeculation .......................... 50 4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53 4.5.4 Behavior with example operators .................... 57 4.6 Summary ..................................... 60 5 Distributed speculation 63 5.1 Overview ..................................... 63 5.2 Requirements ................................... 64 5.2.1 Speculative events ............................ 64 5.2.2 Speculative accesses ........................... 69 5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72 5.3 Applications .................................... 75 5.3.1 Passive replication and rollback recovery ................................ 75 5.3.2 Active replication ............................. 80 5.4 Extensions ..................................... 82 5.4.1 Active replication and software bugs ..................................... 82 5.4.2 Enabling operators to output multiple events ........................ 87 5.5 Evaluation .................................... 87 5.5.1 Passive replication ............................ 88 5.5.2 Active replication ............................. 88 5.6 Summary ..................................... 93 6 Related work 95 6.1 Event stream processing engines ......................... 95 6.2 Parallelization and optimistic computing ................................ 97 6.2.1 Speculation ................................ 97 6.2.2 Optimistic parallelization ......................... 98 6.2.3 Parallelization in event processing .................................... 99 6.2.4 Speculation in event processing ..................... 99 6.3 Fault tolerance .................................. 100 6.3.1 Passive replication and rollback recovery ............................... 100 6.3.2 Active replication ............................ 101 6.3.3 Fault tolerance in event stream processing systems ............. 103 7 Conclusions 105 7.1 Summary of contributions ............................ 105 7.2 Challenges and future work ............................ 106 Appendices Publications 107 Pseudocode for the consensus protocol 10

    Speculation in Parallel and Distributed Event Processing Systems

    Get PDF
    Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data. ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful. As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs. Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations. For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1 1.1 Event stream processing systems ......................... 1 1.2 Running example ................................. 3 1.3 Challenges and contributions ........................... 4 1.4 Outline ...................................... 6 2 Background 7 2.1 Event stream processing ............................. 7 2.1.1 State in operators: Windows and synopses ............................ 8 2.1.2 Types of operators ............................ 12 2.1.3 Our prototype system........................... 13 2.2 Software transactional memory.......................... 18 2.2.1 Overview ................................. 18 2.2.2 Memory operations............................ 19 2.3 Fault tolerance in distributed systems ...................................... 23 2.3.1 Failure model and failure detection ...................................... 23 2.3.2 Recovery semantics............................ 24 2.3.3 Active and passive replication ...................... 24 2.4 Summary ..................................... 26 3 Extending event stream processing systems with speculation 27 3.1 Motivation..................................... 27 3.2 Goals ....................................... 28 3.3 Local versus distributed speculation ....................... 29 3.4 Models and assumptions ............................. 29 3.4.1 Operators................................. 30 3.4.2 Events................................... 30 3.4.3 Failures .................................. 31 4 Local speculation 33 4.1 Overview ..................................... 33 4.2 Requirements ................................... 35 4.2.1 Order ................................... 35 4.2.2 Aborts................................... 37 4.2.3 Optimism control ............................. 38 4.2.4 Notifications ............................... 39 4.3 Applications.................................... 40 4.3.1 Out-of-order processing ......................... 40 4.3.2 Optimistic parallelization......................... 42 4.4 Extensions..................................... 44 4.4.1 Avoiding unnecessary aborts ....................... 44 4.4.2 Making aborts unnecessary........................ 45 4.5 Evaluation..................................... 47 4.5.1 Overhead of speculation ......................... 47 4.5.2 Cost of misspeculation .......................... 50 4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53 4.5.4 Behavior with example operators .................... 57 4.6 Summary ..................................... 60 5 Distributed speculation 63 5.1 Overview ..................................... 63 5.2 Requirements ................................... 64 5.2.1 Speculative events ............................ 64 5.2.2 Speculative accesses ........................... 69 5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72 5.3 Applications .................................... 75 5.3.1 Passive replication and rollback recovery ................................ 75 5.3.2 Active replication ............................. 80 5.4 Extensions ..................................... 82 5.4.1 Active replication and software bugs ..................................... 82 5.4.2 Enabling operators to output multiple events ........................ 87 5.5 Evaluation .................................... 87 5.5.1 Passive replication ............................ 88 5.5.2 Active replication ............................. 88 5.6 Summary ..................................... 93 6 Related work 95 6.1 Event stream processing engines ......................... 95 6.2 Parallelization and optimistic computing ................................ 97 6.2.1 Speculation ................................ 97 6.2.2 Optimistic parallelization ......................... 98 6.2.3 Parallelization in event processing .................................... 99 6.2.4 Speculation in event processing ..................... 99 6.3 Fault tolerance .................................. 100 6.3.1 Passive replication and rollback recovery ............................... 100 6.3.2 Active replication ............................ 101 6.3.3 Fault tolerance in event stream processing systems ............. 103 7 Conclusions 105 7.1 Summary of contributions ............................ 105 7.2 Challenges and future work ............................ 106 Appendices Publications 107 Pseudocode for the consensus protocol 10
    corecore