2,081 research outputs found
Shared Arrangements: practical inter-query sharing for streaming dataflows
Current systems for data-parallel, incremental processing and view
maintenance over high-rate streams isolate the execution of independent
queries. This creates unwanted redundancy and overhead in the presence of
concurrent incrementally maintained queries: each query must independently
maintain the same indexed state over the same input streams, and new queries
must build this state from scratch before they can begin to emit their first
results. This paper introduces shared arrangements: indexed views of maintained
state that allow concurrent queries to reuse the same in-memory state without
compromising data-parallel performance and scaling. We implement shared
arrangements in a modern stream processor and show order-of-magnitude
improvements in query response time and resource consumption for interactive
queries against high-throughput streams, while also significantly improving
performance in other domains including business analytics, graph processing,
and program analysis
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM
Benchmarking Distributed Stream Data Processing Systems
The need for scalable and efficient stream analysis has led to the
development of many open-source streaming data processing systems (SDPSs) with
highly diverging capabilities and performance characteristics. While first
initiatives try to compare the systems for simple workloads, there is a clear
gap of detailed analyses of the systems' performance characteristics. In this
paper, we propose a framework for benchmarking distributed stream processing
engines. We use our suite to evaluate the performance of three widely used
SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our
evaluation focuses in particular on measuring the throughput and latency of
windowed operations, which are the basic type of operations in stream
analytics. For this benchmark, we design workloads based on real-life,
industrial use-cases inspired by the online gaming industry. The contribution
of our work is threefold. First, we give a definition of latency and throughput
for stateful operators. Second, we carefully separate the system under test and
driver, in order to correctly represent the open world model of typical stream
processing deployments and can, therefore, measure system performance under
realistic conditions. Third, we build the first benchmarking framework to
define and test the sustainable performance of streaming systems.
Our detailed evaluation highlights the individual characteristics and
use-cases of each system.Comment: Published at ICDE 201
Adaptive Energy-aware Scheduling of Dynamic Event Analytics across Edge and Cloud Resources
The growing deployment of sensors as part of Internet of Things (IoT) is
generating thousands of event streams. Complex Event Processing (CEP) queries
offer a useful paradigm for rapid decision-making over such data sources. While
often centralized in the Cloud, the deployment of capable edge devices on the
field motivates the need for cooperative event analytics that span Edge and
Cloud computing. Here, we identify a novel problem of query placement on edge
and Cloud resources for dynamically arriving and departing analytic dataflows.
We define this as an optimization problem to minimize the total makespan for
all event analytics, while meeting energy and compute constraints of the
resources. We propose 4 adaptive heuristics and 3 rebalancing strategies for
such dynamic dataflows, and validate them using detailed simulations for 100 -
1000 edge devices and VMs. The results show that our heuristics offer
O(seconds) planning time, give a valid and high quality solution in all cases,
and reduce the number of query migrations. Furthermore, rebalance strategies
when applied in these heuristics have significantly reduced the makespan by
around 20 - 25%.Comment: 11 pages, 7 figure
Tolerating Correlated Failures in Massively Parallel Stream Processing Engines
Fault-tolerance techniques for stream processing engines can be categorized
into passive and active approaches. A typical passive approach periodically
checkpoints a processing task's runtime states and can recover a failed task by
restoring its runtime state using its latest checkpoint. On the other hand, an
active approach usually employs backup nodes to run replicated tasks. Upon
failure, the active replica can take over the processing of the failed task
with minimal latency. However, both approaches have their own inadequacies in
Massively Parallel Stream Processing Engines (MPSPE). The passive approach
incurs a long recovery latency especially when a number of correlated nodes
fail simultaneously, while the active approach requires extra replication
resources. In this paper, we propose a new fault-tolerance framework, which is
Passive and Partially Active (PPA). In a PPA scheme, the passive approach is
applied to all tasks while only a selected set of tasks will be actively
replicated. The number of actively replicated tasks depends on the available
resources. If tasks without active replicas fail, tentative outputs will be
generated before the completion of the recovery process. We also propose
effective and efficient algorithms to optimize a partially active replication
plan to maximize the quality of tentative outputs. We implemented PPA on top of
Storm, an open-source MPSPE and conducted extensive experiments using both real
and synthetic datasets to verify the effectiveness of our approach
Multi-tenant Pub/Sub processing for real-time data streams
Devices and sensors generate streams of data across a diversity of locations and protocols. That data usually reaches a central platform that is used to store and process the streams. Processing can be done in real time, with transformations and enrichment happening on-the-fly, but it can also happen after data is stored and organized in repositories. In the former case, stream processing technologies are required to operate on the data; in the latter batch analytics and queries are of common use.
This paper introduces a runtime to dynamically construct data stream processing topologies based on user-supplied code. These dynamic topologies are built on-the-fly using a data subscription model defined by the applications that consume data. Each user-defined processing unit is called a Service Object. Every Service Object consumes input data streams and may produce output streams that others can consume. The subscription-based programing model enables multiple users to deploy their own data-processing services. The runtime does the dynamic forwarding of data and execution of Service Objects from different users. Data streams can originate in real-world devices or they can be the outputs of Service Objects.
The runtime leverages Apache STORM for parallel data processing, that combined with dynamic user-code injection provides multi-tenant stream processing topologies. In this work we describe the runtime, its features and implementation details, as well as we include a performance evaluation of some of its core components.This work is partially supported by the European Research Council (ERC) un-
der the EU Horizon 2020 programme (GA 639595), the Spanish Ministry of
Economy, Industry and Competitivity (TIN2015-65316-P) and the Generalitat
de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
Muppet: MapReduce-Style Processing of Fast Data
MapReduce has emerged as a popular method to process big data. In the past
few years, however, not just big data, but fast data has also exploded in
volume and availability. Examples of such data include sensor data streams, the
Twitter Firehose, and Facebook updates. Numerous applications must process fast
data. Can we provide a MapReduce-style framework so that developers can quickly
write such applications and execute them over a cluster of machines, to achieve
low latency and high scalability? In this paper we report on our investigation
of this question, as carried out at Kosmix and WalmartLabs. We describe
MapUpdate, a framework like MapReduce, but specifically developed for fast
data. We describe Muppet, our implementation of MapUpdate. Throughout the
description we highlight the key challenges, argue why MapReduce is not well
suited to address them, and briefly describe our current solutions. Finally, we
describe our experience and lessons learned with Muppet, which has been used
extensively at Kosmix and WalmartLabs to power a broad range of applications in
social media and e-commerce.Comment: VLDB201
- âŠ