18 research outputs found

### Testing Properties of Multiple Distributions with Few Samples

We propose a new setting for testing properties of distributions while
receiving samples from several distributions, but few samples per distribution.
Given samples from $s$ distributions, $p_1, p_2, \ldots, p_s$, we design
testers for the following problems: (1) Uniformity Testing: Testing whether all
the $p_i$'s are uniform or $\epsilon$-far from being uniform in
$\ell_1$-distance (2) Identity Testing: Testing whether all the $p_i$'s are
equal to an explicitly given distribution $q$ or $\epsilon$-far from $q$ in
$\ell_1$-distance, and (3) Closeness Testing: Testing whether all the $p_i$'s
are equal to a distribution $q$ which we have sample access to, or
$\epsilon$-far from $q$ in $\ell_1$-distance. By assuming an additional natural
condition about the source distributions, we provide sample optimal testers for
all of these problems.Comment: ITCS 202

### A Concentration Inequality for the Facility Location Problem

We give a concentration inequality for a stochastic version of the facility
location problem on the plane. We show the objective $C_n(X) = \min_{F
\subseteq [0,1]^2} \, |F| + \sum_{x\in X} \min_{f \in F} \| x-f\|$ is
concentrated in an interval of length $O(n^{1/6})$ and $\mathbb{E}[C_n] =
\Theta(n^{2/3})$ if the input $X$ consists of $n$ i.i.d. uniform points in the
unit square. Our main tool is to use a suitable geometric quantity, previously
used in the design of approximation algorithms for the facility location
problem, to analyze a martingale process.Comment: 6 pages, 1 figur

### Property Testing of LP-Type Problems

Given query access to a set of constraints S, we wish to quickly check if some objective function ? subject to these constraints is at most a given value k. We approach this problem using the framework of property testing where our goal is to distinguish the case ?(S) ? k from the case that at least an ? fraction of the constraints in S need to be removed for ?(S) ? k to hold. We restrict our attention to the case where (S,?) are LP-Type problems which is a rich family of combinatorial optimization problems with an inherent geometric structure. By utilizing a simple sampling procedure which has been used previously to study these problems, we are able to create property testers for any LP-Type problem whose query complexities are independent of the number of constraints. To the best of our knowledge, this is the first work that connects the area of LP-Type problems and property testing in a systematic way. Among our results are property testers for a variety of LP-Type problems that are new and also problems that have been studied previously such as a tight upper bound on the query complexity of testing clusterability with one cluster considered by Alon, Dar, Parnas, and Ron (FOCS 2000). We also supply a corresponding tight lower bound for this problem and other LP-Type problems using geometric constructions

### Smoothed Analysis of the Condition Number Under Low-Rank Perturbations

Let $M$ be an arbitrary $n$ by $n$ matrix of rank $n-k$. We study the
condition number of $M$ plus a \emph{low-rank} perturbation $UV^T$ where $U, V$
are $n$ by $k$ random Gaussian matrices. Under some necessary assumptions, it
is shown that $M+UV^T$ is unlikely to have a large condition number. The main
advantages of this kind of perturbation over the well-studied dense Gaussian
perturbation, where every entry is independently perturbed, is the $O(nk)$ cost
to store $U,V$ and the $O(nk)$ increase in time complexity for performing the
matrix-vector multiplication $(M+UV^T)x$. This improves the $\Omega(n^2)$ space
and time complexity increase required by a dense perturbation, which is
especially burdensome if $M$ is originally sparse. Our results also extend to
the case where $U$ and $V$ have rank larger than $k$ and to symmetric and
complex settings. We also give an application to linear systems solving and
perform some numerical experiments. Lastly, barriers in applying low-rank noise
to other problems studied in the smoothed analysis framework are discussed

### Faster Linear Algebra for Distance Matrices

The distance matrix of a dataset $X$ of $n$ points with respect to a distance
function $f$ represents all pairwise distances between points in $X$ induced by
$f$. Due to their wide applicability, distance matrices and related families of
matrices have been the focus of many recent algorithmic works. We continue this
line of research and take a broad view of algorithm design for distance
matrices with the goal of designing fast algorithms, which are specifically
tailored for distance matrices, for fundamental linear algebraic primitives.
Our results include efficient algorithms for computing matrix-vector products
for a wide class of distance matrices, such as the $\ell_1$ metric for which we
get a linear runtime, as well as an $\Omega(n^2)$ lower bound for any algorithm
which computes a matrix-vector product for the $\ell_{\infty}$ case, showing a
separation between the $\ell_1$ and the $\ell_{\infty}$ metrics. Our upper
bound results, in conjunction with recent works on the matrix-vector query
model, have many further downstream applications, including the fastest
algorithm for computing a relative error low-rank approximation for the
distance matrix induced by $\ell_1$ and $\ell_2^2$ functions and the fastest
algorithm for computing an additive error low-rank approximation for the
$\ell_2$ metric, in addition to applications for fast matrix multiplication
among others. We also give algorithms for constructing distance matrices and
show that one can construct an approximate $\ell_2$ distance matrix in time
faster than the bound implied by the Johnson-Lindenstrauss lemma.Comment: Selected as Oral for NeurIPS 202

### Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering

Random dimensionality reduction is a versatile tool for speeding up
algorithms for high-dimensional problems. We study its application to two
clustering problems: the facility location problem, and the single-linkage
hierarchical clustering problem, which is equivalent to computing the minimum
spanning tree. We show that if we project the input pointset $X$ onto a random
$d = O(d_X)$-dimensional subspace (where $d_X$ is the doubling dimension of
$X$), then the optimum facility location cost in the projected space
approximates the original cost up to a constant factor. We show an analogous
statement for minimum spanning tree, but with the dimension $d$ having an extra
$\log \log n$ term and the approximation factor being arbitrarily close to $1$.
Furthermore, we extend these results to approximating solutions instead of just
their costs. Lastly, we provide experimental results to validate the quality of
solutions and the speedup due to the dimensionality reduction. Unlike several
previous papers studying this approach in the context of $k$-means and
$k$-medians, our dimension bound does not depend on the number of clusters but
only on the intrinsic dimensionality of $X$.Comment: 25 pages. Published as a conference paper in ICML 202

### Efficiently Computing Similarities to Private Datasets

Many methods in differentially private model training rely on computing the
similarity between a query point (such as public or synthetic data) and private
data. We abstract out this common subroutine and study the following
fundamental algorithmic problem: Given a similarity function $f$ and a large
high-dimensional private dataset $X \subset \mathbb{R}^d$, output a
differentially private (DP) data structure which approximates $\sum_{x \in X}
f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel
function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel
density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among
others.
Our theoretical results improve upon prior work and give better
privacy-utility trade-offs as well as faster query times for a wide range of
kernels and distance functions. The unifying approach behind our results is
leveraging `low-dimensional structures' present in the specific functions $f$
that we study, using tools such as provable dimensionality reduction,
approximation theory, and one-dimensional decomposition of the functions. Our
algorithms empirically exhibit improved query times and accuracy over prior
state of the art. We also present an application to DP classification. Our
experiments demonstrate that the simple methodology of classifying based on
average similarity is orders of magnitude faster than prior DP-SGD based
approaches for comparable accuracy.Comment: To appear at ICLR 202