28,134 research outputs found

    New Algorithms for Distributed Sliding Windows

    Get PDF
    Computing functions over a distributed stream of data is a significant problem with practical applications. The distributed streaming model is a natural computational model to deal with such scenarios. The goal in this model is to maintain an approximate value of a function of interest over a data stream distributed across several computational nodes. These computational nodes have a two-way communication channel with a coordinator node that maintains an approximation of the function over the entire data stream seen so far. The resources of interest, which need to be minimized, are communication (primary), space, and update time. A practical variant of this model is that of distributed sliding window (dsw), where the computation is limited to the last W items, where W is the window size. Important problems such as sampling and counting have been investigated in this model. However, certain problems including computing frequency moments and metric clustering, that are well studied in other streaming models, have not been considered in the distributed sliding window model. We give the first algorithms for computing the frequency moments and metric clustering problems in the distributed sliding window model. Our algorithms for these problems are a result of a general transfer theorem we establish that transforms any algorithm in the distributed infinite window model to an algorithm in the distributed sliding window model, for a large class of functions. In particular, we show an efficient adaptation of the smooth histogram technique of Braverman and Ostrovsky, to the distributed streaming model. Our construction allows trade-offs between communication and space. If we optimize for communication, we get algorithms that are as communication efficient as their infinite window counter parts (upto polylogarithmic factors)

    Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window

    Get PDF
    The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is O(kϵlogϵNk)O(\frac{k} {\epsilon} \log \frac{\epsilon N}{k}) bits for basic counting and O(kϵlogNk)O(\frac{k}{\epsilon} \log \frac{N}{k}) words for the remainings, where kk is the number of distributed data streams, NN is the total number of items in the streams that arrive or expire in the window, and ϵ<1\epsilon < 1 is the desired error bound. Matching and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on Theoretical Aspects of Computer Science (STACS), 201

    Continuous Nearest Neighbor Queries over Sliding Windows

    Get PDF
    Abstract—This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (count-based) or 2) the arrivals within a fixed interval W covering the most recent time stamps (time-based). The task of the query processor is to constantly maintain the result of long-running NN queries among the valid data. We present two processing techniques that apply to both count-based and time-based windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distance-time space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skyline-based algorithm achieves lower CPU cost, at the expense of slightly larger space overhead. Index Terms—Location-dependent and sensitive, spatial databases, query processing, nearest neighbors, data streams, sliding windows.
    corecore