Search CORE

76,511 research outputs found

Probabilistic Polynomials and Hamming Nearest Neighbors

Author: Alman Josh
Williams Ryan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/07/2015
Field of study

We show how to compute any symmetric Boolean function on

n

variables over any field (as well as the integers) with a probabilistic polynomial of degree

O(\sqrt{n \log(1/\epsilon)})

and error at most

\epsilon

. The degree dependence on

n

and

\epsilon

is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be efficiently sampled from the distribution. This polynomial construction is combined with other algebraic ideas to give the first subquadratic time algorithm for computing a (worst-case) batch of Hamming distances in superlogarithmic dimensions, exactly. To illustrate, let

c(n) : \mathbb{N} \rightarrow \mathbb{N}

. Suppose we are given a database

D

n

vectors in

\{0,1\}^{c(n) \log n}

and a collection of

n

query vectors

Q

in the same dimension. For all

u \in Q

, we wish to compute a

v \in D

with minimum Hamming distance from

u

. We solve this problem in

n^{2-1/O(c(n) \log^2 c(n))}

randomized time. Hence, the problem is in "truly subquadratic" time for

O(\log n)

dimensions, and in subquadratic time for

d = o((\log^2 n)/(\log \log n)^2)

. We apply the algorithm to computing pairs with maximum inner product, closest pair in

\ell_1

for vectors with bounded integer entries, and pairs with maximum Jaccard coefficients.Comment: 16 pages. To appear in 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015

arXiv.org e-Print Archive

Crossref

Approximate Nearest Neighbor Search for Low Dimensional Queries

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/01/2010
Field of study

We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data.Comment: 25 page

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Crossref

Efficient Large-scale Approximate Nearest Neighbor Search on the GPU

Author: Lensch Hendrik P. A.
Sorkine-Hornung Alexander
Wang Oliver
Wieschollek Patrick
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Efficient Algorithms for the Closest Pair Problem and Applications

Author: Pathak Sudipta
Rajasekaran Sanguthevar
Publication venue
Publication date: 21/07/2014
Field of study

The closest pair problem (CPP) is one of the well studied and fundamental problems in computing. Given a set of points in a metric space, the problem is to identify the pair of closest points. Another closely related problem is the fixed radius nearest neighbors problem (FRNNP). Given a set of points and a radius

R

, the problem is, for every input point

p

, to identify all the other input points that are within a distance of

R

from

p

. A naive deterministic algorithm can solve these problems in quadratic time. CPP as well as FRNNP play a vital role in computational biology, computational finance, share market analysis, weather prediction, entomology, electro cardiograph, N-body simulations, molecular simulations, etc. As a result, any improvements made in solving CPP and FRNNP will have immediate implications for the solution of numerous problems in these domains. We live in an era of big data and processing these data take large amounts of time. Speeding up data processing algorithms is thus much more essential now than ever before. In this paper we present algorithms for CPP and FRNNP that improve (in theory and/or practice) the best-known algorithms reported in the literature for CPP and FRNNP. These algorithms also improve the best-known algorithms for related applications including time series motif mining and the two locus problem in Genome Wide Association Studies (GWAS)

arXiv.org e-Print Archive

CiteSeerX

Scalable Techniques for Similarity Search

Author: Nagireddy Siddartha Reddy
Publication venue: SJSU ScholarWorks
Publication date: 01/10/2015
Field of study

Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages

SJSU ScholarWorks

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees

Author: Chechik Shiri
Cohen Edith
Kaplan Haim
Publication venue
Publication date: 01/01/2015
Field of study

The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present highly practical algorithms with strong statistical guarantees for fundamental problems. We show that the average distance (and hence the centrality) for all nodes in a graph can be estimated using

O(\epsilon^{-2})

single-source distance computations. For a set

V

n

points in a metric space, we show that after preprocessing which uses

O(n)

distance computations we can compute a weighted sample

S\subset V

of size

O(\epsilon^{-2})

such that the average distance from any query point

v

V

can be estimated from the distances from

v

S

. Finally, we show that for a set of points

V

in a metric space, we can estimate the average pairwise distance using

O(n+\epsilon^{-2})

distance computations. The estimate is based on a weighted sample of

O(\epsilon^{-2})

pairs of points, which is computed using

O(n)

distance computations. Our estimates are unbiased with normalized mean square error (NRMSE) of at most

\epsilon

. Increasing the sample size by a

O(\log n)

factor ensures that the probability that the relative error exceeds

\epsilon

is polynomially small.Comment: 21 pages, will appear in the Proceedings of RANDOM 201

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Query-driven learning for predictive analytics of data subspace cardinality

Author: Anagnostopoulos Christos
Triantafillou Peter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/06/2017
Field of study

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

Crossref

Warwick Research Archives Portal Repository

Enlighten

Approximation and Streaming Algorithms for Projective Clustering via Random Projections

Author: Kerber Michael
Raghvendra Sharath
Publication venue
Publication date: 08/07/2014
Field of study

Let

P

be a set of

n

points in

\mathbb{R}^d

. In the projective clustering problem, given

k, q

and norm

\rho \in [1,\infty]

, we have to compute a set

\mathcal{F}

k

q

-dimensional flats such that

(\sum_{p\in P}d(p, \mathcal{F})^\rho)^{1/\rho}

is minimized; here

d(p, \mathcal{F})

represents the (Euclidean) distance of

p

to the closest flat in

\mathcal{F}

. We let

f_k^q(P,\rho)

denote the minimal value and interpret

f_k^q(P,\infty)

to be

\max_{r\in P}d(r, \mathcal{F})

. When

\rho=1,2

and

\infty

and

q=0

, the problem corresponds to the

k

-median,

k

-mean and the

k

-center clustering problems respectively. For every

0 < \epsilon < 1

S\subset P

and

\rho \ge 1

, we show that the orthogonal projection of

P

onto a randomly chosen flat of dimension

O(((q+1)^2\log(1/\epsilon)/\epsilon^3) \log n)

will

\epsilon

-approximate

f_1^q(S,\rho)

. This result combines the concepts of geometric coresets and subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence, an orthogonal projection of

P

to an

O(((q+1)^2 \log ((q+1)/\epsilon)/\epsilon^3) \log n)

dimensional randomly chosen subspace

\epsilon

-approximates projective clusterings for every

k

and

\rho

simultaneously. Note that the dimension of this subspace is independent of the number of clusters~

k

. Using this dimension reduction result, we obtain new approximation and streaming algorithms for projective clustering problems. For example, given a stream of

n

points, we show how to compute an

\epsilon

-approximate projective clustering for every

k

and

\rho

simultaneously using only

O((n+d)((q+1)^2\log ((q+1)/\epsilon))/\epsilon^3 \log n)

space. Compared to standard streaming algorithms with

\Omega(kd)

space requirement, our approach is a significant improvement when the number of input points and their dimensions are of the same order of magnitude.Comment: Canadian Conference on Computational Geometry (CCCG 2015

arXiv.org e-Print Archive

CiteSeerX

MPG.PuRe