79,617 research outputs found
Computing Approximate Statistical Discrepancy
Consider a geometric range space (X,A) where X is comprised of the union of a red set R and blue set B. Let Phi(A) define the absolute difference between the fraction of red and fraction of blue points which fall in the range A. The maximum discrepancy range A^* = arg max_{A in (X,A)} Phi(A). Our goal is to find some A^ in (X,A) such that Phi(A^*) - Phi(A^) <= epsilon. We develop general algorithms for this approximation problem for range spaces with bounded VC-dimension, as well as significant improvements for specific geometric range spaces defined by balls, halfspaces, and axis-aligned rectangles. This problem has direct applications in discrepancy evaluation and classification, and we also show an improved reduction to a class of problems in spatial scan statistics
The Hunting of the Bump: On Maximizing Statistical Discrepancy
Anomaly detection has important applications in biosurveilance and
environmental monitoring. When comparing measured data to data drawn from a
baseline distribution, merely, finding clusters in the measured data may not
actually represent true anomalies. These clusters may likely be the clusters of
the baseline distribution. Hence, a discrepancy function is often used to
examine how different measured data is to baseline data within a region. An
anomalous region is thus defined to be one with high discrepancy.
In this paper, we present algorithms for maximizing statistical discrepancy
functions over the space of axis-parallel rectangles. We give provable
approximation guarantees, both additive and relative, and our methods apply to
any convex discrepancy function. Our algorithms work by connecting statistical
discrepancy to combinatorial discrepancy; roughly speaking, we show that in
order to maximize a convex discrepancy function over a class of shapes, one
needs only maximize a linear discrepancy function over the same set of shapes.
We derive general discrepancy functions for data generated from a one-
parameter exponential family. This generalizes the widely-used Kulldorff scan
statistic for data from a Poisson distribution. We present an algorithm running
in that computes the maximum
discrepancy rectangle to within additive error , for the Kulldorff
scan statistic. Similar results hold for relative error and for discrepancy
functions for data coming from Gaussian, Bernoulli, and gamma distributions.
Prior to our work, the best known algorithms were exact and ran in time
.Comment: 11 pages. A short version of this paper will appear in SODA06. This
full version contains an additional short appendi
A Linear-Time Kernel Goodness-of-Fit Test
We propose a novel adaptive test of goodness-of-fit, with computational cost
linear in the number of samples. We learn the test features that best indicate
the differences between observed samples and a reference model, by minimizing
the false negative rate. These features are constructed via Stein's method,
meaning that it is not necessary to compute the normalising constant of the
model. We analyse the asymptotic Bahadur efficiency of the new test, and prove
that under a mean-shift alternative, our test always has greater relative
efficiency than a previous linear-time kernel test, regardless of the choice of
parameters for that test. In experiments, the performance of our method exceeds
that of the earlier linear-time test, and matches or exceeds the power of a
quadratic-time kernel test. In high dimensions and where model structure may be
exploited, our goodness of fit test performs far better than a quadratic-time
two-sample test based on the Maximum Mean Discrepancy, with samples drawn from
the model.Comment: Accepted to NIPS 201
The Geometry of Differential Privacy: the Sparse and Approximate Cases
In this work, we study trade-offs between accuracy and privacy in the context
of linear queries over histograms. This is a rich class of queries that
includes contingency tables and range queries, and has been a focus of a long
line of work. For a set of linear queries over a database , we
seek to find the differentially private mechanism that has the minimum mean
squared error. For pure differential privacy, an approximation to
the optimal mechanism is known. Our first contribution is to give an approximation guarantee for the case of (\eps,\delta)-differential
privacy. Our mechanism is simple, efficient and adds correlated Gaussian noise
to the answers. We prove its approximation guarantee relative to the hereditary
discrepancy lower bound of Muthukrishnan and Nikolov, using tools from convex
geometry.
We next consider this question in the case when the number of queries exceeds
the number of individuals in the database, i.e. when . It is known that better mechanisms exist in this setting. Our second
main contribution is to give an (\eps,\delta)-differentially private
mechanism which is optimal up to a \polylog(d,N) factor for any given query
set and any given upper bound on . This approximation is
achieved by coupling the Gaussian noise addition approach with a linear
regression step. We give an analogous result for the \eps-differential
privacy setting. We also improve on the mean squared error upper bound for
answering counting queries on a database of size by Blum, Ligett, and Roth,
and match the lower bound implied by the work of Dinur and Nissim up to
logarithmic factors.
The connection between hereditary discrepancy and the privacy mechanism
enables us to derive the first polylogarithmic approximation to the hereditary
discrepancy of a matrix
Recommended from our members
ABC for Climate: Dealing with Expensive Simulators
A single molecule or molecule complex detection method is disclosed in certain aspects, comprising nano- or micro-fluidic channels.U
Bayesian optimisation for likelihood-free cosmological inference
Many cosmological models have only a finite number of parameters of interest,
but a very expensive data-generating process and an intractable likelihood
function. We address the problem of performing likelihood-free Bayesian
inference from such black-box simulation-based models, under the constraint of
a very limited simulation budget (typically a few thousand). To do so, we adopt
an approach based on the likelihood of an alternative parametric model.
Conventional approaches to approximate Bayesian computation such as
likelihood-free rejection sampling are impractical for the considered problem,
due to the lack of knowledge about how the parameters affect the discrepancy
between observed and simulated data. As a response, we make use of a strategy
previously developed in the machine learning literature (Bayesian optimisation
for likelihood-free inference, BOLFI), which combines Gaussian process
regression of the discrepancy to build a surrogate surface with Bayesian
optimisation to actively acquire training data. We extend the method by
deriving an acquisition function tailored for the purpose of minimising the
expected uncertainty in the approximate posterior density, in the parametric
approach. The resulting algorithm is applied to the problems of summarising
Gaussian signals and inferring cosmological parameters from the Joint
Lightcurve Analysis supernovae data. We show that the number of required
simulations is reduced by several orders of magnitude, and that the proposed
acquisition function produces more accurate posterior approximations, as
compared to common strategies.Comment: 16+9 pages, 12 figures. Matches PRD published version after minor
modification
- …