653 research outputs found

    Pseudo-Deterministic Streaming

    Get PDF
    A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, ?_2 approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm A, there exists a stream x so that running A twice on x (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for ?_2 norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted

    On the Pseudo-Deterministic Query Complexity of NP Search Problems

    Get PDF
    We study pseudo-deterministic query complexity - randomized query algorithms that are required to output the same answer with high probability on all inputs. We prove Ω(√n) lower bounds on the pseudo-deterministic complexity of a large family of search problems based on unsatisfiable random CNF instances, and also for the promise problem (FIND1) of finding a 1 in a vector populated with at least half one’s. This gives an exponential separation between randomized query complexity and pseudo-deterministic complexity, which is tight in the quantum setting. As applications we partially solve a related combinatorial coloring problem, and we separate random tree-like Resolution from its pseudo-deterministic version. In contrast to our lower bound, we show, surprisingly, that in the zero-error, average case setting, the three notions (deterministic, randomized, pseudo-deterministic) collapse

    Lower Bounds for Pseudo-Deterministic Counting in a Stream

    Get PDF
    Many streaming algorithms provide only a high-probability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary -- for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudo-deterministic algorithms combat this issue, and for every input, they output with high probability the same ``canonical'' solution. We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most nn. Morris's counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses O(loglogn)O(\log\log n) bits of space, for every fixed approximation factor (greater than 11). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudo-deterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use Ω(logn/loglogn)\Omega(\sqrt{\log n / \log\log n}) bits of space. Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string F{0,1}3nF\in\{0,1\}^{3n}, which is guaranteed to start with nn zeros and end with nn ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses O(n)O(\sqrt{n}) queries. It remains open whether poly(logn)poly(\log n) queries suffice; if true, then our techniques immediately imply a nearly-tight Ω(logn/loglogn)\Omega(\log n/\log\log n) space bound for pseudo-deterministic approximate counting.Comment: 14 page

    Complete Problems for Multi-Pseudodeterministic Computations

    Get PDF
    We exhibit several computational problems that are complete for multi-pseudodeterministic computations in the following sense: (1) these problems admit 2-pseudodeterministic algorithms (2) if there exists a pseudodeterministic algorithm for any of these problems, then any multi-valued function that admits a k-pseudodeterministic algorithm for a constant k, also admits a pseudodeterministic algorithm. We also show that these computational problems are complete for Search-BPP: a pseudodeterministic algorithm for any of these problems implies a pseudodeterministic algorithm for all problems in Search-BPP

    Reproducible genomics analysis pipelines with GNU Guix

    Get PDF
    In bioinformatics, as well as other computationally-intensive research fields, there is a need for workflows that can reliably produce consistent output, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations or for the wider dissemination of workflows. Providing this type of reproducibility, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally come in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data, and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pig

    List and Certificate Complexities in Replicable Learning

    Full text link
    We investigate replicable learning algorithms. Ideally, we would like to design algorithms that output the same canonical model over multiple runs, even when different runs observe a different set of samples from the unknown data distribution. In general, such a strong notion of replicability is not achievable. Thus we consider two feasible notions of replicability called list replicability and certificate replicability. Intuitively, these notions capture the degree of (non) replicability. We design algorithms for certain learning problems that are optimal in list and certificate complexity. We establish matching impossibility results
    corecore