Search CORE

18 research outputs found

Independent Range Sampling, Revisited

Author: Afshani Peyman
Wei Zhewei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th Annual European Symposium on Algorithms (ESA 2017)
Publication date: 01/01/2017
Field of study

In the independent range sampling (IRS) problem, given an input set P of n points in R^d, the task is to build a data structure, such that given a range R and an integer t >= 1, it returns t points that are uniformly and independently drawn from P cap R. The samples must satisfy inter-query independence, that is, the samples returned by every query must be independent of the samples returned by all the previous queries. This problem was first tackled by Hu, Qiao and Tao in 2014, who proposed optimal structures for one-dimensional dynamic IRS problem in internal memory and one-dimensional static IRS problem in external memory. In this paper, we study two natural extensions of the independent range sampling problem. In the first extension, we consider the static IRS problem in two and three dimensions in internal memory. We obtain data structures with optimal space-query tradeoffs for 3D halfspace, 3D dominance, and 2D three-sided queries. The second extension considers weighted IRS problem. Each point is associated with a real-valued weight, and given a query range R, a sample is drawn independently such that each point in P cap R is selected with probability proportional to its weight. Walker\u27s alias method is a classic solution to this problem when no query range is specified. We obtain optimal data structure for one dimensional weighted range sampling problem, thereby extending the alias method to allow range queries

Dagstuhl Research Online Publication Server

Independent Range Sampling, Revisited Again

Author: Afshani Peyman
Phillips Jeff M.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th International Symposium on Computational Geometry (SoCG 2019)
Publication date: 01/01/2019
Field of study

We revisit the range sampling problem: the input is a set of points where each point is associated with a real-valued weight. The goal is to store them in a structure such that given a query range and an integer k, we can extract k independent random samples from the points inside the query range, where the probability of sampling a point is proportional to its weight. This line of work was initiated in 2014 by Hu, Qiao, and Tao and it was later followed up by Afshani and Wei. The first line of work mostly studied unweighted but dynamic version of the problem in one dimension whereas the second result considered the static weighted problem in one dimension as well as the unweighted problem in 3D for halfspace queries. We offer three main results and some interesting insights that were missed by the previous work: We show that it is possible to build efficient data structures for range sampling queries if we allow the query time to hold in expectation (the first result), or obtain efficient worst-case query bounds by allowing the sampling probability to be approximately proportional to the weight (the second result). The third result is a conditional lower bound that shows essentially one of the previous two concessions is needed. For instance, for the 3D range sampling queries, the first two results give efficient data structures with near-linear space and polylogarithmic query time whereas the lower bound shows with near-linear space the worst-case query time must be close to n^{2/3}, ignoring polylogarithmic factors. Up to our knowledge, this is the first such major gap between the expected and worst-case query time of a range searching problem

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS

Author: Afshani Peyman
Afshani Peyman
Aumüller Martin
Broder Andrei Z.
Dwork Cynthia
Har-Peled Sariel
Hardt Moritz
Ilya
Leonhardt Jurek
Riazi M. Sadegh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the

r

-near neighbor (

r

-NN) problem: given a radius

r>0

and a set of points

S

, construct a data structure that, for any given query point

q

, returns a point

p

within distance at most

r

from

q

. In this paper, we study the

r

-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance

r

from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for

r

-NN where all points in

S

that are near

q

have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), Pages 191-204, June 202

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently

Author: D Krioukov
H Samet
H-P Kriegel
HW Hethcote
L Arge
M Looz von
R Aldecoa
V Batagelj
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/08/2016
Field of study

\newcommand{\dist}{\operatorname{dist}}

In this paper we define the notion of a probabilistic neighborhood in spatial data: Let a set

P

n

points in

\mathbb{R}^d

, a query point

q \in \mathbb{R}^d

, a distance metric \dist, and a monotonically decreasing function

f : \mathbb{R}^+ \rightarrow [0,1]

be given. Then a point

p \in P

belongs to the probabilistic neighborhood

N(q, f)

q

with respect to

f

with probability f(\dist(p,q)). We envision applications in facility location, sensor networks, and other scenarios where a connection between two entities becomes less likely with increasing distance. A straightforward query algorithm would determine a probabilistic neighborhood in

\Theta(n\cdot d)

time by probing each point in

P

. To answer the query in sublinear time for the planar case, we augment a quadtree suitably and design a corresponding query algorithm. Our theoretical analysis shows that -- for certain distributions of planar

P

-- our algorithm answers a query in

O((|N(q,f)| + \sqrt{n})\log n)

time with high probability (whp). This matches up to a logarithmic factor the cost induced by quadtree-based algorithms for deterministic queries and is asymptotically faster than the straightforward approach whenever

|N(q,f)| \in o(n / \log n)

. As practical proofs of concept we use two applications, one in the Euclidean and one in the hyperbolic plane. In particular, our results yield the first generator for random hyperbolic graphs with arbitrary temperatures in subquadratic time. Moreover, our experimental data show the usefulness of our algorithm even if the point distribution is unknown or not uniform: The running time savings over the pairwise probing approach constitute at least one order of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44543-4_3

arXiv.org e-Print Archive

Crossref

On Range Summary Queries

Author: Afshani Peyman
Basu Roy Aniket
Cheng Pingan
Wei Zhewei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)
Publication date: 01/01/2023
Field of study

We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter ? and a set P of n points in ?^d where each point is assigned a color from a set C, and the goal is to build a structure such that given any geometric range ?, we can efficiently find a list of approximate heavy hitters in ??P, i.e., colors that appear at least ? |??P| times in ??P, as well as their frequencies with an additive error of ? |??P|. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence S of 1+1/? weights such that the i-th weight in S has approximate rank i?|??P|, meaning, rank i?|??P| up to an additive error of ?|??P|. Previously, optimal results were only known in 1D [Wei and Yi, 2011] but a few sub-optimal methods were available in higher dimensions [Peyman Afshani and Zhewei Wei, 2017; Pankaj K. Agarwal et al., 2012]. We study the problems for two important classes of geometric ranges: 3D halfspace and 3D dominance queries. It is known that many other important queries can be reduced to these two, e.g., 1D interval stabbing or interval containment, 2D three-sided queries, 2D circular as well as 2D k-nearest neighbors queries. We consider the real RAM model of computation where integer registers of size w bits, w = ?(log n), are also available. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time O(log n + 1/?). Note that as the output size is 1/?, after investing the initial O(log n) searching time, our structure takes on average O(1) time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra log_w(1/?) (resp. log log_w(1/?)) factor in 3D (resp. 2D). By spending extra log^O(1)(1/?) factors in both time and space, we can also support quantile queries. We remark that it is hopeless to achieve a similar query bound for dimensions 4 or higher unless significant advances are made in the data structure side of theory of geometric approximations

Dagstuhl Research Online Publication Server

Fair near neighbor search via sampling

Author: Aumüller Martin
Har-Peled Sariel
Mahabadi Sepideh
Pagh Rasmus
Silvestri Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee

Copenhagen University Research Information System

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

Simpler is Much Faster: Fair and Independent Inner Product Search

Author: Amagata Daichi
Aoyama Kazuyoshi
Fujita Sumio
Hara Takahiro
Publication venue: Association for Computing Machinery, Inc
Publication date
Field of study

The problem of inner product search (IPS) is important in many fields. Although maximum inner product search (MIPS) is often considered, its result is usually skewed and static. Users are hence hard to obtain diverse and/or new items by using the MIPS problem. Motivated by this, we formulate a new problem, namely the fair and independent IPS problem. Given a query, a threshold, and an output size k, this problem randomly samples k items from a set of items such that the inner product of the query and item is not less than the threshold. For each item that satisfies the threshold, this problem is fair, because the probability that such an item is outputted is equal to that for each other item. This fairness can yield diversity and novelty, but this problem faces a computational challenge. Some existing (M)IPS techniques can be employed in this problem, but they require O(n) or o(n) time, where n is the dataset size. To scale well to large datasets, we propose a simple yet efficient algorithm that runs in O(logn + k) expected time. We conduct experiments using real datasets, and the results demonstrate that our algorithm is up to 330 times faster than baselines.Aoyama K., Amagata D., Fujita S., et al. Simpler is Much Faster: Fair and Independent Inner Product Search. SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2379 (2023); https://doi.org/10.1145/3539618.3592061

Osaka University Knowledge Archive

Subset Sampling and Its Extensions

Author: Huang Jinchao
Wang Sibo
Publication venue
Publication date: 21/07/2023
Field of study

This paper studies the \emph{subset sampling} problem. The input is a set

\mathcal{S}

n

records together with a function

\textbf{p}

that assigns each record

v\in\mathcal{S}

a probability

\textbf{p}(v)

. A query returns a random subset

X

\mathcal{S}

, where each record

v\in\mathcal{S}

is sampled into

X

independently with probability

\textbf{p}(v)

. The goal is to store

\mathcal{S}

in a data structure to answer queries efficiently. If

\mathcal{S}

fits in memory, the problem is interesting when

\mathcal{S}

is dynamic. We develop a dynamic data structure with

\mathcal{O}(1+\mu_{\mathcal{S}})

expected \emph{query} time,

\mathcal{O}(n)

space and

\mathcal{O}(1)

amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where

\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v)

. The query time and space are optimal. If

\mathcal{S}

does not fit in memory, the problem is difficult even if

\mathcal{S}

is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in

\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right)

amortized expected I/Os using

\mathcal{O}(n/B)

space, where

M

is the memory size,

B

is the block size and

\log^*_B n

is the number of iterative

\log_2(.)

operations we need to perform on

n

before going below

B

. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range

[a,b]

. For this extension, we provide a solution under the dynamic setting, with

\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]})

expected \emph{query} time,

\mathcal{O}(n)

space and

\mathcal{O}(\log n)

amortized expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page

arXiv.org e-Print Archive