Search CORE

173 research outputs found

Robust Proximity Search for Balls using Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/01/2014
Field of study

Given a set of n disjoint balls b1, . . ., bn in IRd, we provide a data structure, of near linear size, that can answer (1 \pm \epsilon)-approximate kth-nearest neighbor queries in O(log n + 1/\epsilon^d) time, where k and \epsilon are provided at query time. If k and \epsilon are provided in advance, we provide a data structure to answer such queries, that requires (roughly) O(n/k) space; that is, the data structure has sublinear space requirement if k is sufficiently large

arXiv.org e-Print Archive

University of Memphis Digital Commons

Dagstuhl Research Online Publication Server

Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/12/2012
Field of study

For a set of

n

points in

\Re^d

, and parameters

k

and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if

k

is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of

k

closest points to the query point, and \item sum of squared distances of

k

closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors

Author: Andoni Alexandr
Klein Philip N.
Laarhoven Thijs
Razenshteyn Ilya
Waingarten Erik
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2016
Field of study

[See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the

c

-Approximate Near Neighbor Search problem. For the

d

-dimensional Euclidean space and

n

-point datasets, we develop a data structure with space

n^{1 + \rho_u + o(1)} + O(dn)

and query time

n^{\rho_q + o(1)} + d n^{o(1)}

for every

\rho_u, \rho_q \geq 0

such that: \begin{equation} c^2 \sqrt{\rho_q} + (c^2 - 1) \sqrt{\rho_u} = \sqrt{2c^2 - 1}. \end{equation} This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor

c > 1

, improving upon [Kapralov, PODS 2015]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [Becker, Ducas, Gama, Laarhoven, SODA 2016] and data-dependent hashing [Andoni, Indyk, Nguyen, Razenshteyn, SODA 2014] [Andoni, Razenshteyn, STOC 2015]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole above trade-off in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower bounds for one and two probes that match the above trade-off for

\rho_q = 0

, improving upon the best known lower bounds from [Panigrahy, Talwar, Wieder, FOCS 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.Comment: 62 pages, 5 figures; a merger of arXiv:1511.07527 [cs.DS] and arXiv:1605.02701 [cs.DS], which subsumes both of the preprints. New version contains more elaborated proofs and fixed some typo

arXiv.org e-Print Archive

Repository TU/e

Crossref

Pure OAI Repository

Training Support Vector Machines Using Frank-Wolfe Optimization Methods

Author: Frandi Emanuele
Gasparo Maria Grazia
Lodi Stefano
Nanculef Ricardo
Sartori Claudio
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 04/12/2012
Field of study

Training a Support Vector Machine (SVM) requires the solution of a quadratic programming problem (QP) whose computational complexity becomes prohibitively expensive for large scale datasets. Traditional optimization methods cannot be directly applied in these cases, mainly due to memory restrictions. By adopting a slightly different objective function and under mild conditions on the kernel used within the model, efficient algorithms to train SVMs have been devised under the name of Core Vector Machines (CVMs). This framework exploits the equivalence of the resulting learning problem with the task of building a Minimal Enclosing Ball (MEB) problem in a feature space, where data is implicitly embedded by a kernel function. In this paper, we improve on the CVM approach by proposing two novel methods to build SVMs based on the Frank-Wolfe algorithm, recently revisited as a fast method to approximate the solution of a MEB problem. In contrast to CVMs, our algorithms do not require to compute the solutions of a sequence of increasingly complex QPs and are defined by using only analytic optimization steps. Experiments on a large collection of datasets show that our methods scale better than CVMs in most cases, sometimes at the price of a slightly lower accuracy. As CVMs, the proposed methods can be easily extended to machine learning problems other than binary classification. However, effective classifiers are also obtained using kernels which do not satisfy the condition required by CVMs and can thus be used for a wider set of problems

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Doctor of Philosophy

Author: Daruki Samira
Publication venue: University of Utah
Publication date: 01/01/2018
Field of study

dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as

k

distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear (

o(n)

) sized coresets for graphs that have polynomial (i.e.,

n^{\delta}

for any

\delta >0

) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a

(1+\epsilon)

factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set,

k

-centers and

k

-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify

k

(approximate) eigenvectors of an

n \times n

integer matrix

A

. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space

The University of Utah: J. Willard Marriott Digital Library

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

Author: Ahle Thomas Dybdahl
Aumüller Martin
Pagh Rasmus
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 20/07/2016
Field of study

We present a data structure for *spherical range reporting* on a point set

S

, i.e., reporting all points in

S

that lie within radius

r

of a given query point

q

. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from

q

to the points of

S

, our data structure is parameter-free, except for the space usage, which is configurable by the user. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been *optimally chosen for the data and query* in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH) and easy queries such as those where the number of points to report is a constant fraction of

S

, or where almost all points in

S

are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone. The algorithm has expected query time bounded by

O(t (n/t)^\rho)

, where

t

is the number of points to report and

\rho\in (0,1)

depends on the data distribution and the strength of the LSH family used. We further present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to

O(n^\rho+t)

, which is the best we can hope to achieve using LSH. The previously best running time in high dimensions was

\Omega(t n^\rho)

. For many data distributions where the intrinsic dimensionality of the point set close to

q

is low, we can give improved upper bounds on the expected query time.Comment: 21 pages, 5 figures, due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF fil

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Distribution-Free Proofs of Proximity

Author: Aaronson Hugo
Gur Tom
Rajgopal Ninad
Rothblum Ron D.
Publication venue
Publication date: 17/08/2023
Field of study

Motivated by the fact that input distributions are often unknown in advance, distribution-free property testing considers a setting in which the algorithmic task is to accept functions

f : [n] \to \{0,1\}

having a certain property

\Pi

and reject functions that are

\epsilon

-far from

\Pi

, where the distance is measured according to an arbitrary and unknown input distribution

D \sim [n]

. As usual in property testing, the tester is required to do so while making only a sublinear number of input queries, but as the distribution is unknown, we also allow a sublinear number of samples from the distribution

D

. In this work we initiate the study of distribution-free interactive proofs of proximity (df-IPP) in which the distribution-free testing algorithm is assisted by an all powerful but untrusted prover. Our main result is a df-IPP for any problem

\Pi \in NC

, with

\tilde{O}(\sqrt{n})

communication, sample, query, and verification complexities, for any proximity parameter

\epsilon>1/\sqrt{n}

. For such proximity parameters, this result matches the parameters of the best-known general purpose IPPs in the standard uniform setting, and is optimal under reasonable cryptographic assumptions. For general values of the proximity parameter

\epsilon

, our distribution-free IPP has optimal query complexity

O(1/\epsilon)

but the communication complexity is

\tilde{O}(\epsilon \cdot n + 1/\epsilon)

, which is worse than what is known for uniform IPPs when

\epsilon<1/\sqrt{n}

. With the aim of improving on this gap, we further show that for IPPs over specialised, but large distribution families, such as sufficiently smooth distributions and product distributions, the communication complexity can be reduced to

\epsilon\cdot n\cdot(1/\epsilon)^{o(1)}

(keeping the query complexity roughly the same as before) to match the communication complexity of the uniform case

arXiv.org e-Print Archive