12 research outputs found

    STREAMING ALGORITHMS FOR MINING FREQUENT ITEMS

    Get PDF
    Streaming model supplies solutions for handling enormous data flows for over 20 years now. The model works with sequential data access and states sublinear memory as its primary restriction. Although the majority of the algorithms are randomized and approximate, the field facilitates numerous applications from handling networking traffic to analyzing cosmology simulations and beyond. This thesis focuses on one of the most foundational and well-studied problems of finding heavy hitters, i.e. frequent items: 1.We challenge the long-lasting complexity gap in finding heavy hitters with L2 guarantee in the insertion-only stream and present the first optimal algorithm with a space complexity of O(1) words and O(1) update time. Our result improves on Count Sketch algorithm with space and time complexity of O(log n) by Charikar et al. 2002 [39]. 2. We consider the L2-heavy hitter problem in the interval query settings, rapidly emerging in the field. Compared to well known sliding window model where an algorithm is required to report the function of interest computed over the last N updates,interval query provides query flexibility, such that at any moment t one can query the function value on any interval (t1,t2)⊆(t−N,t). We present the first L2-heavy hitter algorithm in that model and extend the result to estimation all streamable functions of a frequency vector. 3. We provide the experimental study for the recent space optimal result on streaming quantiles by Karnin et al. 2016 [85]. The problem can be considered as a generalization to the heavy hitters. Additionally, we suggest several variations to the algorithms which improve the running time from O(1/ε) to O(log 1/ε), provide twice better space vs. precision trade-off, and extend the algorithm for the case of weighted updates. 4. We establish the connection between finding "halos", i.e. dense areas, in cosmology N-body simulation and finding heavy hitters. We build the first halo finder and scale it up to handle data sets with up-to 10^12 particles via GPU boosting, sampling and parallel I/O. We investigate its behavior and compare it to traditional in-memory halo finders. Our solution pushes the memory footprint from several terabytes down to less than a gigabyte, therefore, make the problem feasible for small servers and even desktops

    Approximating Properties of Data Streams

    Get PDF
    In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs

    Resolving the Complexity of Some Fundamental Problems in Computational Social Choice

    Get PDF
    This thesis is in the area called computational social choice which is an intersection area of algorithms and social choice theory.Comment: Ph.D. Thesi
    corecore