8 research outputs found

    Pseudo-Deterministic Streaming

    Get PDF
    A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, ?_2 approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm A, there exists a stream x so that running A twice on x (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for ?_2 norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted

    Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams

    Full text link
    In the communication problem UR\mathbf{UR} (universal relation) [KRW95], Alice and Bob respectively receive x,y{0,1}nx, y \in\{0,1\}^n with the promise that xyx\neq y. The last player to receive a message must output an index ii such that xiyix_i\neq y_i. We prove that the randomized one-way communication complexity of this problem in the public coin model is exactly Θ(min{n,log(1/δ)log2(nlog(1/δ))})\Theta(\min\{n,\log(1/\delta)\log^2(\frac n{\log(1/\delta)})\}) for failure probability δ\delta. Our lower bound holds even if promised support(y)support(x)\mathop{support}(y)\subset \mathop{support}(x). As a corollary, we obtain optimal lower bounds for p\ell_p-sampling in strict turnstile streams for 0p<20\le p < 2, as well as for the problem of finding duplicates in a stream. Our lower bounds do not need to use large weights, and hold even if promised x{0,1}nx\in\{0,1\}^n at all points in the stream. We give two different proofs of our main result. The first proof demonstrates that any algorithm A\mathcal A solving sampling problems in turnstile streams in low memory can be used to encode subsets of [n][n] of certain sizes into a number of bits below the information theoretic minimum. Our encoder makes adaptive queries to A\mathcal A throughout its execution, but done carefully so as to not violate correctness. This is accomplished by injecting random noise into the encoder's interactions with A\mathcal A, which is loosely motivated by techniques in differential privacy. Our second proof is via a novel randomized reduction from Augmented Indexing [MNSW98] which needs to interact with A\mathcal A adaptively. To handle the adaptivity we identify certain likely interaction patterns and union bound over them to guarantee correct interaction on all of them. To guarantee correctness, it is important that the interaction hides some of its randomness from A\mathcal A in the reduction.Comment: merge of arXiv:1703.08139 and of work of Kapralov, Woodruff, and Yahyazade

    A Simple Proof of a New Set Disjointness with Applications to Data Streams

    Get PDF

    Approximating Properties of Data Streams

    Get PDF
    In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs

    Finding duplicates in a data stream

    No full text
    Given a data stream of length n over an alphabet [m] where n&gt; m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n&gt; m, under which one can find duplicates efficiently.
    corecore