111 research outputs found

    GreedyDual-Join: Locality-Aware Buffer Management for Approximate Join Processing Over Data Streams

    Full text link
    We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques

    Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster

    Full text link
    The availability of large number of processing nodes in a parallel and distributed computing environment enables sophisticated real time processing over high speed data streams, as required by many emerging applications. Sliding window stream joins are among the most important operators in a stream processing system. In this paper, we consider the issue of parallelizing a sliding window stream join operator over a shared nothing cluster. We propose a framework, based on fixed or predefined communication pattern, to distribute the join processing loads over the shared-nothing cluster. We consider various overheads while scaling over a large number of nodes, and propose solution methodologies to cope with the issues. We implement the algorithm over a cluster using a message passing system, and present the experimental results showing the effectiveness of the join processing algorithm.Comment: 11 page

    Quality-Driven Disorder Handling for M-way Sliding Window Stream Joins

    Full text link
    Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, asynchronous source clocks, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.Comment: 12 pages, 11 figures, IEEE ICDE 201

    A peer to peer approach to large scale information monitoring

    Get PDF
    Issued as final reportNational Science Foundation (U.S.

    Efficient Processing of Continuous Join Queries using Distributed Hash Tables

    Get PDF
    International audienceThis paper addresses the problem of computing approximate answers to continuous join queries. We present a new method, called DHTJoin, which combines hash-based placement of tuples in a Distributed Hash Table (DHT) and dissemination of queries using a gossip style protocol. We provide a performance evaluation of DHTJoin which shows that DHTJoin can achieve significant performance gains in terms of network traffic

    Metrics and Algorithms for Processing Multiple Continuous Queries

    Get PDF
    Data streams processing is an emerging research area that is driven by the growing need for monitoring applications. A monitoring application continuously processes streams of data for interesting, significant, or anomalous events. Such applications include tracking the stock market, real-time detection of diseaseoutbreaks, and environmental monitoring via sensor networks.Efficient employment of those monitoring applications requires advanced data processing techniques that can support the continuous processing of unbounded rapid data streams. Such techniques go beyond the capabilities of the traditional store-then-query Data BaseManagement Systems. This need has led to a new data processing paradigm and created a new generation of data processing systems,supporting continuous queries (CQ) on data streams.Primary emphasis in the development of first generation Data Stream Management Systems (DSMSs) was given to basic functionality. However, in order to support large-scale heterogeneous applications that are envisioned for subsequent generations of DSMSs, greater attention willhave to be paid to performance issues. Towards this, this thesis introduces new algorithms and metrics to the current design of DSMSs.This thesis identifies a collection of quality ofservice (QoS) and quality of data (QoD) metrics that are suitable for a wide range of monitoring applications. The establishment of well-defined metrics aids in the development of novel algorithms that are optimal with respect to a particular metric. Our proposed algorithms exploit the valuable chances for optimization that arise in the presence of multiple applications. Additionally, they aim to balance the trade-off between the DSMS's overall performance and the performance perceived by individual applications. Furthermore, we provide efficient implementations of the proposed algorithms and we also extend them to exploit sharing in optimized multi-query plans and multi-stream CQs. Finally, we experimentally show that our algorithms consistently outperform the current state of the art

    A PROCRUSTEAN APPROACH TO STREAM PROCESSING

    Get PDF
    The increasing demand for real-time data processing and the constantly growing data volume have contributed to the rapid evolution of Stream Processing Engines (SPEs), which are designed to continuously process data as it arrives. Low operational cost and timely delivery of results are both objectives of paramount importance for SPEs. Given the volatile and uncharted nature of data streams, achieving the aforementioned goals under fixed resources is a challenge. This calls for adaptable SPEs, which can react to fluctuations in processing demands. In the past, three techniques have been developed for improving an SPE’s ability to adapt. Those techniques are classified based on applications’ requirements on exact or approximate results: stream partitioning, and re-partitioning target exact, and load shedding targets approximate processing. Stream partitioning strives to balance load among processors, and previous techniques neglected hidden costs of distributed execution. Load Shedding lowers the accuracy of results by dropping part of the input, and previous techniques did not cope with evolving streams. Stream re-partitioning is used to reconfigure execution while processing takes place, and previous techniques did not fully utilize window semantics. In this dissertation, we put stream processing in a procrustean bed, in terms of the manner and the degree that processing takes place. To this end, we present new approaches, for window-based aggregate operators, which are applicable to both exact and approximate stream processing in modern SPEs. Our stream partitioning, re-partitioning, and load shedding solutions offer improvements in performance and accuracy on real-world data by exploiting the semantics of both data and operations. In addition, we present SPEAr, the design of an SPE that accelerates processing by delivering approximate results with accuracy guarantees and avoiding unnecessary load. Finally, we contribute a hybrid technique, ShedPart, which can further improve load balance and performance of an SPE

    Processing Exact Results for Queries over Data Streams

    Get PDF
    In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms

    gSPICE: Model-Based Event Shedding in Complex Event Processing

    Full text link
    Overload situations, in the presence of resource limitations, in complex event processing (CEP) systems are typically handled using load shedding to maintain a given latency bound. However, load shedding might negatively impact the quality of results (QoR). To minimize the shedding impact on QoR, CEP researchers propose shedding approaches that drop events/internal state with the lowest importances/utilities. In both black-box and white-box shedding approaches, different features are used to predict these utilities. In this work, we propose a novel black-box shedding approach that uses a new set of features to drop events from the input event stream to maintain a given latency bound. Our approach uses a probabilistic model to predict these event utilities. Moreover, our approach uses Zobrist hashing and well-known machine learning models, e.g., decision trees and random forests, to handle the predicted event utilities. Through extensive evaluations on several synthetic and two real-world datasets and a representative set of CEP queries, we show that, in the majority of cases, our load shedding approach outperforms state-of-the-art black-box load shedding approaches, w.r.t. QoR

    Continuously Providing Approximate Results under Limited Resources: Load Shedding and Spilling in XML Streams

    Get PDF
    Because of the high volume and unpredictable arrival rates, stream processing systems may not always be able to keep up with the input data streams, resulting in buffer overflow and uncontrolled loss of data. To continuously supply online results, two alternate solutions to tackle this problem of unpredictable failures of such overloaded systems can be identified. One technique, called load shedding, drops some fractions of data from the input stream to reduce the memory and CPU requirements of the workload. However, dropping some portions of the input data means that the accuracy of the output is reduced since some data is lost. To produce eventually complete results, the second technique, called data spilling, pushes some fractions of data to persistent storage temporarily when the processing speed cannot keep up with the arrival rate. The processing of the disk resident data is then postponed until a later time when system resources become available. This dissertation explores these load reduction technologies in the context of XML stream systems. Load shedding in the specific context of XML streams poses several unique opportunities and challenges. Since XML data is hierarchical, subelements, extracted from different positions of the XML tree structure, may vary in their importance. Further, dropping different subelements may vary in their savings of storage and computation. Hence, unlike prior work in the literature that drops data completely or not at all, in this dissertation we introduce the notion of structure-oriented load shedding, meaning selectively some XML subelements are shed from the possibly complex XML objects in the XML stream. First we develop a preference model that enables users to specify the relative importance of preserving different subelements within the XML result structure. This transforms shedding into the problem of rewriting the user query into shed queries that return approximate answers with their utility as measured by the user preference model. Our optimizer finds the appropriate shed queries to maximize the output utility driven by our structure-based preference model under the limitation of available computation resources. The experimental results demonstrate that our proposed XML-specific shedding solution consistently achieves higher utility results compared to the existing relational shedding techniques. Second, we introduces structure-based spilling, a spilling technique customized for XML streams by considering the spilling of partial substructures of possibly complex XML elements. Several new challenges caused by structure-based spilling are addressed. When a path is spilled, multiple other paths may be affected. We categorize varying types of spilling side effects on the query caused by spilling. How to execute the reduced query to produce the correct runtime output is also studied. Three optimization strategies are developed to select the reduced query that maximizes the output quality. We also examine the clean-up stage to guarantee that an entire result set is eventually generated by producing supplementary results to complement the partial results output earlier. The experimental study demonstrates that our proposed solutions consistently achieve higher quality results compared to the state-of-the-art techniques. Third, we design an integrated framework that combines both shedding and spilling policies into one comprehensive methodology. Decisions on the choice of whether to shed or spill data may be affected by the application needs and data arrival patterns. For some input data, it may be worth to flush it to disk if a delayed output of its result will be important, while other data would best directly dropped from the system given that a delayed delivery of these results would no longer be meaningful to the application. Therefore we need sophisticated technologies capable of deploying both shedding and spilling techniques within one integrated strategy with the ability to deliver the most appropriate decision customers need for each specific circumstance. We propose a novel flexible framework for structure-based shed and spill approaches, applicable in any XML stream system. We propose a solution space that represents all the shed and spill candidates. An age-based quality model is proposed for evaluating the output quality for different reduced query and supplementary query pairs. We also propose a family of four optimization strategies, OptF, OptSmart, HiX and Fex. OptF and OptSmart are both guaranteed to identify an optimal solution of reduced and supplementary query pair, with OptSmart exhibiting significantly less overhead than OptF. HiX and Fex use heuristic-based approaches that are much more efficient than OptF and OptSmart
    • …
    corecore