3,317 research outputs found

    Near-Optimal Closeness Testing of Discrete Histogram Distributions

    Get PDF
    We investigate the problem of testing the equivalence between two discrete histograms. A {\em kk-histogram} over [n][n] is a probability distribution that is piecewise constant over some set of kk intervals over [n][n]. Histograms have been extensively studied in computer science and statistics. Given a set of samples from two kk-histogram distributions p,qp, q over [n][n], we want to distinguish (with high probability) between the cases that p=qp = q and pq1ϵ\|p-q\|_1 \geq \epsilon. The main contribution of this paper is a new algorithm for this testing problem and a nearly matching information-theoretic lower bound. Specifically, the sample complexity of our algorithm matches our lower bound up to a logarithmic factor, improving on previous work by polynomial factors in the relevant parameters. Our algorithmic approach applies in a more general setting and yields improved sample upper bounds for testing closeness of other structured distributions as well

    Sampling Correctors

    Full text link
    In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log2n)O(1/\log^2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We also consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process

    Testing probability distributions underlying aggregated data

    Full text link
    In this paper, we analyze and study a hybrid model for testing and learning probability distributions. Here, in addition to samples, the testing algorithm is provided with one of two different types of oracles to the unknown distribution DD over [n][n]. More precisely, we define both the dual and cumulative dual access models, in which the algorithm AA can both sample from DD and respectively, for any i[n]i\in[n], - query the probability mass D(i)D(i) (query access); or - get the total mass of {1,,i}\{1,\dots,i\}, i.e. j=1iD(j)\sum_{j=1}^i D(j) (cumulative access) These two models, by generalizing the previously studied sampling and query oracle models, allow us to bypass the strong lower bounds established for a number of problems in these settings, while capturing several interesting aspects of these problems -- and providing new insight on the limitations of the models. Finally, we show that while the testing algorithms can be in most cases strictly more efficient, some tasks remain hard even with this additional power

    Learning mixtures of structured distributions over discrete domains

    Full text link
    Let C\mathfrak{C} be a class of probability distributions over the discrete domain [n]={1,...,n}.[n] = \{1,...,n\}. We show that if C\mathfrak{C} satisfies a rather general condition -- essentially, that each distribution in C\mathfrak{C} can be well-approximated by a variable-width histogram with few bins -- then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of kk unknown distributions from C.\mathfrak{C}. We analyze several natural types of distributions over [n][n], including log-concave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being well-approximated by a histogram with few bins. Applying our general algorithm, we obtain near-optimally efficient algorithms for all these mixture learning problems.Comment: preliminary full version of soda'13 pape

    New Algorithms for Large Datasets and Distributions

    Get PDF
    In this dissertation, we make progress on certain algorithmic problems broadly over two computational models: the streaming model for large datasets and the distribution testing model for large probability distributions. First we consider the streaming model, where a large sequence of data items arrives one by one. The computer needs to make one pass over this sequence, processing every item quickly, in a limited space. In Chapter 2 motivated by a bioinformatics application, we consider the problem of estimating the number of low-frequency items in a stream, which has received only a limited theoretical work so far. We give an efficient streaming algorithm for this problem and show its complexity is almost optimal. In Chapter 3 we consider a distributed variation of the streaming model, where each item of the data sequence arrives arbitrarily to one among a set of computers, who together need to compute certain functions over the entire stream. In such scenarios combining the data at a computer is infeasible due to large communication overhead. We give the first algorithm for k-median clustering in this model. Moreover, we give new algorithms for frequency moments and clustering functions in the distributed sliding window model, where the computation is limited to the most recent W items, as the items arrive in the stream. In Chapter 5, in our identity testing problem, given two distributions P (unknown, only samples are obtained) and Q (known) over a common sample space of exponential size, we need to distinguish P = Q (output ‘yes’) versus P is far from Q (output ‘no’). This problem requires an exponential number of samples. To circumvent this lower bound, this problem was recently studied with certain structural assumptions. In particular, optimally efficient testers were given assuming P and Q are product distributions. For such product distributions, we give the first tolerant testers, which not only output yes when P = Q but also when P is close to Q, in Chapter 5. Likewise, we study the tolerant closeness testing problem for such product distributions, where Q too is accessed only by samples. Adviser: Vinodchandran N. Variya
    corecore