Search CORE

151 research outputs found

ε-Kernel Coresets for Stochastic Points

Author: Huang Lingxiao
Li Jian
Phillips Jeff Mark
Wang Haitao
Publication venue: Hosted by Utah State University Libraries
Publication date: 01/01/2016
Field of study

With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing epsilon-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point\u27s location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An epsilon-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an ε-EXP-KERNEL), as well as the probability distribution on the width (an (ε, tau)-QUANT-KERNEL) for any direction. We show that there exists a set of O(ε^{-(d-1)/2}) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an ε-EXP-KERNEL is still possible. We also provide efficient algorithms for construct an (ε, τ)-QUANT-KERNEL coreset in nearly linear time. Our techniques utilize or connect to several important notions in probability and geometry, such as Kolmogorov distances, VC uniform convergence and Tukey depth, and may be useful in other geometric optimization problem in stochastic settings. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty

arXiv.org e-Print Archive

DigitalCommons@USU

Dagstuhl Research Online Publication Server

Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach

Author: Broderick Tamara
Campbell Trevor
Huggins Jonathan H.
Kasprzak Mikołaj
Publication venue
Publication date: 01/10/2018
Field of study

Bayesian inference typically requires the computation of an approximation to the posterior distribution. An important requirement for an approximate Bayesian inference algorithm is to output high-accuracy posterior mean and uncertainty estimates. Classical Monte Carlo methods, particularly Markov Chain Monte Carlo, remain the gold standard for approximate Bayesian inference because they have a robust finite-sample theory and reliable convergence diagnostics. However, alternative methods, which are more scalable or apply to problems where Markov Chain Monte Carlo cannot be used, lack the same finite-data approximation theory and tools for evaluating their accuracy. In this work, we develop a flexible new approach to bounding the error of mean and uncertainty estimates of scalable inference algorithms. Our strategy is to control the estimation errors in terms of Wasserstein distance, then bound the Wasserstein distance via a generalized notion of Fisher distance. Unlike computing the Wasserstein distance, which requires access to the normalized posterior distribution, the Fisher distance is tractable to compute because it requires access only to the gradient of the log posterior density. We demonstrate the usefulness of our Fisher distance approach by deriving bounds on the Wasserstein error of the Laplace approximation and Hilbert coresets. We anticipate that our approach will be applicable to many other approximate inference methods such as the integrated Laplace approximation, variational inference, and approximate Bayesian computationComment: 22 pages, 2 figure

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

Recommended from our members

Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions

Author: Manousakas Dionysios
Publication venue: University of Cambridge
Publication date: 30/10/2021
Field of study

The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the full data collection—a coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification and regression. We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individuals’ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat. We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing. Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations for inference, that are suitable for applications involving contaminated and unreliable data sources. We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs, Lundgren Fund, Darwin College, University of Cambridge Department of Computer Science & Technology, University of Cambridg

Apollo (Cambridge)

Multi-Resolution Hashing for Fast Pairwise Summations

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 03/11/2018
Field of study

A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector

y

(query) that is unknown a priori. Given a set of points

X\subset \mathbb{R}^{d}

and a pairwise function

w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1]

, we study the problem of designing a data-structure that enables sublinear-time approximation of the summation

Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y)

for any query

y\in \mathbb{R}^{d}

. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of

T\geq 1

hashing schemes with collision probabilities

p_{1},\ldots, p_{T}

such that

\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)})

. This leads to a data-structure that approximates

Z_{w}(y)

using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function

w(x,y)=e^{\phi(\langle x,y\rangle)}

of the inner product on the unit sphere

x,y\in \mathcal{S}^{d-1}

. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to

\mathbb{R}^{d}

and from scalar functions to vector functions.Comment: 39 pages, 3 figure

arXiv.org e-Print Archive

Crossref

Probabilistic Smallest Enclosing Ball in High Dimensions via Subgradient Sampling

Author: Krivosija Amer
Munteanu Alexander
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th International Symposium on Computational Geometry (SoCG 2019)
Publication date: 01/01/2019
Field of study

We study a variant of the median problem for a collection of point sets in high dimensions. This generalizes the geometric median as well as the (probabilistic) smallest enclosing ball (pSEB) problems. Our main objective and motivation is to improve the previously best algorithm for the pSEB problem by reducing its exponential dependence on the dimension to linear. This is achieved via a novel combination of sampling techniques for clustering problems in metric spaces with the framework of stochastic subgradient descent. As a result, the algorithm becomes applicable to shape fitting problems in Hilbert spaces of unbounded dimension via kernel functions. We present an exemplary application by extending the support vector data description (SVDD) shape fitting method to the probabilistic case. This is done by simulating the pSEB algorithm implicitly in the feature space induced by the kernel function

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server