271 research outputs found
Quality-Driven Disorder Handling for M-way Sliding Window Stream Joins
Sliding window join is one of the most important operators for stream
applications. To produce high quality join results, a stream processing system
must deal with the ubiquitous disorder within input streams which is caused by
network delay, asynchronous source clocks, etc. Disorder handling involves an
inevitable tradeoff between the latency and the quality of produced join
results. To meet different requirements of stream applications, it is desirable
to provide a user-configurable result-latency vs. result-quality tradeoff.
Existing disorder handling approaches either do not provide such
configurability, or support only user-specified latency constraints.
In this work, we advocate the idea of quality-driven disorder handling, and
propose a buffer-based disorder handling approach for sliding window joins,
which minimizes sizes of input-sorting buffers, thus the result latency, while
respecting user-specified result-quality requirements. The core of our approach
is an analytical model which directly captures the relationship between sizes
of input buffers and the produced result quality. Our approach is generic. It
supports m-way sliding window joins with arbitrary join conditions. Experiments
on real-world and synthetic datasets show that, compared to the state of the
art, our approach can reduce the result latency incurred by disorder handling
by up to 95% while providing the same level of result quality.Comment: 12 pages, 11 figures, IEEE ICDE 201
SQPR: Stream Query Planning with Reuse
When users submit new queries to a distributed stream processing system (DSPS), a query planner must allocate physical resources, such as CPU cores, memory and network bandwidth, from a set of hosts to queries. Allocation decisions must provide the correct mix of resources required by queries, while achieving an efficient overall allocation to scale in the number of admitted queries. By exploiting overlap between queries and reusing partial results, a query planner can conserve resources but has to carry out more complex planning decisions. In this paper, we describe SQPR, a query planner that targets DSPSs in data centre environments with heterogeneous resources. SQPR models query admission, allocation and reuse as a single constrained optimisation problem and solves an approximate version to achieve scalability. It prevents individual resources from becoming bottlenecks by re-planning past allocation decisions and supports different allocation objectives. As our experimental evaluation in comparison with a state-of-the-art planner shows SQPR makes efficient resource allocation decisions, even with a high utilisation of resources, with acceptable overheads
Recommended from our members
Dynamic Optimization and Migration of Continuous Queries Over Data Streams
Continuous queries process real-time streaming data and output results in streams for a wide range of applications. Due to the fluctuating stream characteristics, a streaming database system needs to dynamically adapt query execution. This dissertation proposes novel solutions to continuous query adaptation in three core areas, namely dynamic query optimization, dynamic plan migration and partitioned query adaptation. Runtime query optimization needs to efficiently generate plans that satisfy both CPU and memory resource constraints. Existing work focus on minimizing intermediate query results, which decreases memory and CPU usages simultaneously. However, doing so cannot assure that both resource constraints are being satisfied, because memory and CPU can be either positively or negatively correlated. This part of the dissertation proposes efficient optimization strategies that utilize both types of correlations to search the entire query plan space in polynomial time when a typical exhaustive search would take at least exponential time. Extensive experimental evaluations have demonstrated the effectiveness of the proposed strategies. Dynamic plan migration is concerned with on-the-fly transition from one continuous plan to a semantically equivalent yet more efficient plan. It is a must to guarantee the continuation and repeatability of dynamic query optimization. However, this research area has been largely neglected in the current literature. The second part of this dissertation proposes migration strategies that dynamically migrate continuous queries while guaranteeing the integrity of the query results, meaning there are no missing, duplicate or incorrect results. The extensive experimental evaluations show that the proposed strategies vary significantly in terms of output rates and memory usages given distinct system configurations and stream workloads. Partitioned query processing is effective to process continuous queries with large stateful operators in a distributed system. Dynamic load redistribution is necessary to balance uneven workload across machines due to changing stream properties. However, existing solutions generally assume static query plans without runtime query optimization. This part of the dissertation evaluates the benefits of applying query optimization in partitioned query processing and shows dramatic performance improvement of more than 300%. Several load balancing strategies are then proposed to consider the heterogeneity of plan shapes across machines caused by dynamic query optimization. The effectiveness of the proposed strategies is analyzed through extensive experiments using a cluster
A peer to peer approach to large scale information monitoring
Issued as final reportNational Science Foundation (U.S.
A Survey on the Evolution of Stream Processing Systems
Stream processing has been an active research field for more than 20 years,
but it is now witnessing its prime time due to recent successful efforts by the
research community and numerous worldwide open-source communities. This survey
provides a comprehensive overview of fundamental aspects of stream processing
systems and their evolution in the functional areas of out-of-order data
management, state management, fault tolerance, high availability, load
management, elasticity, and reconfiguration. We review noteworthy past research
findings, outline the similarities and differences between early ('00-'10) and
modern ('11-'18) streaming systems, and discuss recent trends and open
problems.Comment: 34 pages, 15 figures, 5 table
Processing Exact Results for Queries over Data Streams
In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources.
This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization.
The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms
- âŚ