Search CORE

79,617 research outputs found

Computing Approximate Statistical Discrepancy

Author: Matheny Michael
Phillips Jeff M.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 29th International Symposium on Algorithms and Computation (ISAAC 2018)
Publication date: 01/01/2018
Field of study

Consider a geometric range space (X,A) where X is comprised of the union of a red set R and blue set B. Let Phi(A) define the absolute difference between the fraction of red and fraction of blue points which fall in the range A. The maximum discrepancy range A^* = arg max_{A in (X,A)} Phi(A). Our goal is to find some A^ in (X,A) such that Phi(A^*) - Phi(A^) <= epsilon. We develop general algorithms for this approximation problem for range spaces with bounded VC-dimension, as well as significant improvements for specific geometric range spaces defined by balls, halfspaces, and axis-aligned rectangles. This problem has direct applications in discrepancy evaluation and classification, and we also show an improved reduction to a class of problems in spatial scan statistics

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

The Hunting of the Bump: On Maximizing Statistical Discrepancy

Author: Agarwal Deepak
Phillips Jeff M.
Venkatasubramanian Suresh
Publication venue
Publication date: 02/10/2005
Field of study

Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy. In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes. We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm running in

O(\smash[tb]{\frac{1}{\epsilon} n^2 \log^2 n})

that computes the maximum discrepancy rectangle to within additive error

\epsilon

, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time

\smash[t]{O(n^4)}

.Comment: 11 pages. A short version of this paper will appear in SODA06. This full version contains an additional short appendi

arXiv.org e-Print Archive

CiteSeerX

A Linear-Time Kernel Goodness-of-Fit Test

Author: Fukumizu Kenji
Gretton Arthur
Jitkrittum Wittawat
Szabo Zoltan
Xu Wenkai
Publication venue
Publication date: 24/10/2017
Field of study

We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.Comment: Accepted to NIPS 201

arXiv.org e-Print Archive

HAL-Polytechnique

The Geometry of Differential Privacy: the Sparse and Approximate Cases

Author: Nikolov Aleksandar
Talwar Kunal
Zhang Li
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/12/2012
Field of study

In this work, we study trade-offs between accuracy and privacy in the context of linear queries over histograms. This is a rich class of queries that includes contingency tables and range queries, and has been a focus of a long line of work. For a set of

d

linear queries over a database

x \in \R^N

, we seek to find the differentially private mechanism that has the minimum mean squared error. For pure differential privacy, an

O(\log^2 d)

approximation to the optimal mechanism is known. Our first contribution is to give an

O(\log^2 d)

approximation guarantee for the case of (\eps,\delta)-differential privacy. Our mechanism is simple, efficient and adds correlated Gaussian noise to the answers. We prove its approximation guarantee relative to the hereditary discrepancy lower bound of Muthukrishnan and Nikolov, using tools from convex geometry. We next consider this question in the case when the number of queries exceeds the number of individuals in the database, i.e. when

d > n \triangleq \|x\|_1

. It is known that better mechanisms exist in this setting. Our second main contribution is to give an (\eps,\delta)-differentially private mechanism which is optimal up to a \polylog(d,N) factor for any given query set

A

and any given upper bound

n

\|x\|_1

. This approximation is achieved by coupling the Gaussian noise addition approach with a linear regression step. We give an analogous result for the \eps-differential privacy setting. We also improve on the mean squared error upper bound for answering counting queries on a database of size

n

by Blum, Ligett, and Roth, and match the lower bound implied by the work of Dinur and Nissim up to logarithmic factors. The connection between hereditary discrepancy and the privacy mechanism enables us to derive the first polylogarithmic approximation to the hereditary discrepancy of a matrix

A

arXiv.org e-Print Archive

Crossref

Recommended from our members

ABC for Climate: Dealing with Expensive Simulators

Author: Edwards Neil
Hensman James
Holden Philip
Wilkinson Richard
Publication venue: Chapman and Hall
Publication date: 09/02/2009
Field of study

A single molecule or molecule complex detection method is disclosed in certain aspects, comprising nano- or micro-fluidic channels.U

Open Research Online (The Open University)

Texas A&M Repository

ABC for Climate: Dealing with Expensive Simulators

Author: Edwards Neil
Hensman James
Holden Philip
Wilkinson Richard
Publication venue: Chapman and Hall
Publication date: 01/01/2018
Field of study

Crossref

Open Research Online (The Open University)

Bayesian optimisation for likelihood-free cosmological inference

Author: Leclercq Florent
Publication venue: 'American Physical Society (APS)'
Publication date: 14/08/2018
Field of study

Many cosmological models have only a finite number of parameters of interest, but a very expensive data-generating process and an intractable likelihood function. We address the problem of performing likelihood-free Bayesian inference from such black-box simulation-based models, under the constraint of a very limited simulation budget (typically a few thousand). To do so, we adopt an approach based on the likelihood of an alternative parametric model. Conventional approaches to approximate Bayesian computation such as likelihood-free rejection sampling are impractical for the considered problem, due to the lack of knowledge about how the parameters affect the discrepancy between observed and simulated data. As a response, we make use of a strategy previously developed in the machine learning literature (Bayesian optimisation for likelihood-free inference, BOLFI), which combines Gaussian process regression of the discrepancy to build a surrogate surface with Bayesian optimisation to actively acquire training data. We extend the method by deriving an acquisition function tailored for the purpose of minimising the expected uncertainty in the approximate posterior density, in the parametric approach. The resulting algorithm is applied to the problems of summarising Gaussian signals and inferring cosmological parameters from the Joint Lightcurve Analysis supernovae data. We show that the number of required simulations is reduced by several orders of magnitude, and that the proposed acquisition function produces more accurate posterior approximations, as compared to common strategies.Comment: 16+9 pages, 12 figures. Matches PRD published version after minor modification

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository