30,300 research outputs found
Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement
A new technique is proposed for fault-tolerant linear, sesquilinear and
bijective (LSB) operations on integer data streams (), such as:
scaling, additions/subtractions, inner or outer vector products, permutations
and convolutions. In the proposed method, the input integer data streams
are linearly superimposed to form numerically-entangled integer data
streams that are stored in-place of the original inputs. A series of LSB
operations can then be performed directly using these entangled data streams.
The results are extracted from the entangled output streams by additions
and arithmetic shifts. Any soft errors affecting any single disentangled output
stream are guaranteed to be detectable via a specific post-computation
reliability check. In addition, when utilizing a separate processor core for
each of the streams, the proposed approach can recover all outputs after
any single fail-stop failure. Importantly, unlike algorithm-based fault
tolerance (ABFT) methods, the number of operations required for the
entanglement, extraction and validation of the results is linearly related to
the number of the inputs and does not depend on the complexity of the performed
LSB operations. We have validated our proposal in an Intel processor (Haswell
architecture with AVX2 support) via fast Fourier transforms, circular
convolutions, and matrix multiplication operations. Our analysis and
experiments reveal that the proposed approach incurs between to
reduction in processing throughput for a wide variety of LSB operations. This
overhead is 5 to 1000 times smaller than that of the equivalent ABFT method
that uses a checksum stream. Thus, our proposal can be used in fault-generating
processor hardware or safety-critical applications, where high reliability is
required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201
Speculation in Parallel and Distributed Event Processing Systems
Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data.
ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful.
As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs.
Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations.
For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1
1.1 Event stream processing systems ......................... 1
1.2 Running example ................................. 3
1.3 Challenges and contributions ........................... 4
1.4 Outline ...................................... 6
2 Background 7
2.1 Event stream processing ............................. 7
2.1.1 State in operators: Windows and synopses ............................ 8
2.1.2 Types of operators ............................ 12
2.1.3 Our prototype system........................... 13
2.2 Software transactional memory.......................... 18
2.2.1 Overview ................................. 18
2.2.2 Memory operations............................ 19
2.3 Fault tolerance in distributed systems ...................................... 23
2.3.1 Failure model and failure detection ...................................... 23
2.3.2 Recovery semantics............................ 24
2.3.3 Active and passive replication ...................... 24
2.4 Summary ..................................... 26
3 Extending event stream processing systems with speculation 27
3.1 Motivation..................................... 27
3.2 Goals ....................................... 28
3.3 Local versus distributed speculation ....................... 29
3.4 Models and assumptions ............................. 29
3.4.1 Operators................................. 30
3.4.2 Events................................... 30
3.4.3 Failures .................................. 31
4 Local speculation 33
4.1 Overview ..................................... 33
4.2 Requirements ................................... 35
4.2.1 Order ................................... 35
4.2.2 Aborts................................... 37
4.2.3 Optimism control ............................. 38
4.2.4 Notifications ............................... 39
4.3 Applications.................................... 40
4.3.1 Out-of-order processing ......................... 40
4.3.2 Optimistic parallelization......................... 42
4.4 Extensions..................................... 44
4.4.1 Avoiding unnecessary aborts ....................... 44
4.4.2 Making aborts unnecessary........................ 45
4.5 Evaluation..................................... 47
4.5.1 Overhead of speculation ......................... 47
4.5.2 Cost of misspeculation .......................... 50
4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53
4.5.4 Behavior with example operators .................... 57
4.6 Summary ..................................... 60
5 Distributed speculation 63
5.1 Overview ..................................... 63
5.2 Requirements ................................... 64
5.2.1 Speculative events ............................ 64
5.2.2 Speculative accesses ........................... 69
5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72
5.3 Applications .................................... 75
5.3.1 Passive replication and rollback recovery ................................ 75
5.3.2 Active replication ............................. 80
5.4 Extensions ..................................... 82
5.4.1 Active replication and software bugs ..................................... 82
5.4.2 Enabling operators to output multiple events ........................ 87
5.5 Evaluation .................................... 87
5.5.1 Passive replication ............................ 88
5.5.2 Active replication ............................. 88
5.6 Summary ..................................... 93
6 Related work 95
6.1 Event stream processing engines ......................... 95
6.2 Parallelization and optimistic computing ................................ 97
6.2.1 Speculation ................................ 97
6.2.2 Optimistic parallelization ......................... 98
6.2.3 Parallelization in event processing .................................... 99
6.2.4 Speculation in event processing ..................... 99
6.3 Fault tolerance .................................. 100
6.3.1 Passive replication and rollback recovery ............................... 100
6.3.2 Active replication ............................ 101
6.3.3 Fault tolerance in event stream processing systems ............. 103
7 Conclusions 105
7.1 Summary of contributions ............................ 105
7.2 Challenges and future work ............................ 106
Appendices
Publications 107
Pseudocode for the consensus protocol 10
Speculation in Parallel and Distributed Event Processing Systems
Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data.
ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful.
As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs.
Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations.
For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1
1.1 Event stream processing systems ......................... 1
1.2 Running example ................................. 3
1.3 Challenges and contributions ........................... 4
1.4 Outline ...................................... 6
2 Background 7
2.1 Event stream processing ............................. 7
2.1.1 State in operators: Windows and synopses ............................ 8
2.1.2 Types of operators ............................ 12
2.1.3 Our prototype system........................... 13
2.2 Software transactional memory.......................... 18
2.2.1 Overview ................................. 18
2.2.2 Memory operations............................ 19
2.3 Fault tolerance in distributed systems ...................................... 23
2.3.1 Failure model and failure detection ...................................... 23
2.3.2 Recovery semantics............................ 24
2.3.3 Active and passive replication ...................... 24
2.4 Summary ..................................... 26
3 Extending event stream processing systems with speculation 27
3.1 Motivation..................................... 27
3.2 Goals ....................................... 28
3.3 Local versus distributed speculation ....................... 29
3.4 Models and assumptions ............................. 29
3.4.1 Operators................................. 30
3.4.2 Events................................... 30
3.4.3 Failures .................................. 31
4 Local speculation 33
4.1 Overview ..................................... 33
4.2 Requirements ................................... 35
4.2.1 Order ................................... 35
4.2.2 Aborts................................... 37
4.2.3 Optimism control ............................. 38
4.2.4 Notifications ............................... 39
4.3 Applications.................................... 40
4.3.1 Out-of-order processing ......................... 40
4.3.2 Optimistic parallelization......................... 42
4.4 Extensions..................................... 44
4.4.1 Avoiding unnecessary aborts ....................... 44
4.4.2 Making aborts unnecessary........................ 45
4.5 Evaluation..................................... 47
4.5.1 Overhead of speculation ......................... 47
4.5.2 Cost of misspeculation .......................... 50
4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53
4.5.4 Behavior with example operators .................... 57
4.6 Summary ..................................... 60
5 Distributed speculation 63
5.1 Overview ..................................... 63
5.2 Requirements ................................... 64
5.2.1 Speculative events ............................ 64
5.2.2 Speculative accesses ........................... 69
5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72
5.3 Applications .................................... 75
5.3.1 Passive replication and rollback recovery ................................ 75
5.3.2 Active replication ............................. 80
5.4 Extensions ..................................... 82
5.4.1 Active replication and software bugs ..................................... 82
5.4.2 Enabling operators to output multiple events ........................ 87
5.5 Evaluation .................................... 87
5.5.1 Passive replication ............................ 88
5.5.2 Active replication ............................. 88
5.6 Summary ..................................... 93
6 Related work 95
6.1 Event stream processing engines ......................... 95
6.2 Parallelization and optimistic computing ................................ 97
6.2.1 Speculation ................................ 97
6.2.2 Optimistic parallelization ......................... 98
6.2.3 Parallelization in event processing .................................... 99
6.2.4 Speculation in event processing ..................... 99
6.3 Fault tolerance .................................. 100
6.3.1 Passive replication and rollback recovery ............................... 100
6.3.2 Active replication ............................ 101
6.3.3 Fault tolerance in event stream processing systems ............. 103
7 Conclusions 105
7.1 Summary of contributions ............................ 105
7.2 Challenges and future work ............................ 106
Appendices
Publications 107
Pseudocode for the consensus protocol 10
Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement
A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on integer data streams ( ), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, input integer data streams are linearly superimposed to form numerically-entangled integer data streams that are stored in-place of the original inputs. LSB operations can then be performed directly using these entangled data streams. The results are extracted from the entangled output streams by additions and arithmetic shifts. Any soft errors affecting one disentangled output stream are guaranteed to be detectable via a post-computation reliability check. Additionally, when utilizing a separate processor core for each stream, our approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entire process is linearly related to the number of inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor via several types of operations: fast Fourier transforms, convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03% to 7% reduction in processing throughput for numerous LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM
Tolerating Correlated Failures in Massively Parallel Stream Processing Engines
Fault-tolerance techniques for stream processing engines can be categorized
into passive and active approaches. A typical passive approach periodically
checkpoints a processing task's runtime states and can recover a failed task by
restoring its runtime state using its latest checkpoint. On the other hand, an
active approach usually employs backup nodes to run replicated tasks. Upon
failure, the active replica can take over the processing of the failed task
with minimal latency. However, both approaches have their own inadequacies in
Massively Parallel Stream Processing Engines (MPSPE). The passive approach
incurs a long recovery latency especially when a number of correlated nodes
fail simultaneously, while the active approach requires extra replication
resources. In this paper, we propose a new fault-tolerance framework, which is
Passive and Partially Active (PPA). In a PPA scheme, the passive approach is
applied to all tasks while only a selected set of tasks will be actively
replicated. The number of actively replicated tasks depends on the available
resources. If tasks without active replicas fail, tentative outputs will be
generated before the completion of the recovery process. We also propose
effective and efficient algorithms to optimize a partially active replication
plan to maximize the quality of tentative outputs. We implemented PPA on top of
Storm, an open-source MPSPE and conducted extensive experiments using both real
and synthetic datasets to verify the effectiveness of our approach
Stateful Detection in High Throughput Distributed Systems
With the increasing speed of computers, complexity of applications and large scale of applications, many of today’s distributed systems exchange data at a high rate. It is important to provide error detection capabilities to such applications that provide critical functionality. Significant prior work has been done in software implemented error detection achieved through a fault tolerance system separate from the application system. However, the high rate of data coupled with complex detection can cause the capacity of the fault tolerance system to be exhausted resulting in low detection accuracy. This is particularly the case when the detection is done against rules based on state that has been generated in the system. We present a new stateful detection mechanism which is based on observing messages exchanged between the protocol participants, deducing the application state from them, and matching against anomaly based rules. We have previously shown the capacity constraint of the detection framework called the Monitor. Here we extend the Monitor framework to incorporate a sampling approach which adjusts the rate of messages to be verified by sampling the incoming application stream of messages. The adjustment is such that the breakdown in the Monitor capacity is avoided. The cost of processing each message increases because the application state is no longer accurately known at the Monitor. However, the overall detection cost is reduced due to the lower rate of messages processed. We show that even with sampling, the Monitor is able to track the possible state of the protocol entity and provide stateful detection. We implement the approach and apply it to a reliable multicast protocol called TRAM. We demonstrate the gains of the approach by comparing the latency and accuracy of fault detection to the baseline Monitor system
- …