18 research outputs found
Independent Range Sampling, Revisited
In the independent range sampling (IRS) problem, given an input set P of n points in R^d, the task is to build a data structure, such that given a range R and an integer t >= 1, it returns t points that are uniformly and independently drawn from P cap R. The samples must satisfy inter-query independence, that is, the samples returned by every query must be independent of the samples returned by all the previous queries. This problem was first tackled by Hu, Qiao and Tao in 2014, who proposed optimal structures for one-dimensional dynamic IRS problem in internal memory and one-dimensional static IRS problem in external memory.
In this paper, we study two natural extensions of the independent range sampling problem. In the first extension, we consider the static IRS problem in two and three dimensions in internal memory. We obtain data structures with optimal space-query tradeoffs for 3D halfspace, 3D dominance, and 2D three-sided queries. The second extension considers weighted IRS problem. Each point is associated with a real-valued weight, and given a query range R, a sample is drawn independently such that each point in P cap R is selected with probability proportional to its weight. Walker\u27s alias method is a classic solution to this problem when no query range is specified. We obtain optimal data structure for one dimensional weighted range sampling problem, thereby extending the alias method to allow range queries
Independent Range Sampling, Revisited Again
We revisit the range sampling problem: the input is a set of points where each point is associated with a real-valued weight. The goal is to store them in a structure such that given a query range and an integer k, we can extract k independent random samples from the points inside the query range, where the probability of sampling a point is proportional to its weight.
This line of work was initiated in 2014 by Hu, Qiao, and Tao and it was later followed up by Afshani and Wei. The first line of work mostly studied unweighted but dynamic version of the problem in one dimension whereas the second result considered the static weighted problem in one dimension as well as the unweighted problem in 3D for halfspace queries.
We offer three main results and some interesting insights that were missed by the previous work: We show that it is possible to build efficient data structures for range sampling queries if we allow the query time to hold in expectation (the first result), or obtain efficient worst-case query bounds by allowing the sampling probability to be approximately proportional to the weight (the second result). The third result is a conditional lower bound that shows essentially one of the previous two concessions is needed. For instance, for the 3D range sampling queries, the first two results give efficient data structures with near-linear space and polylogarithmic query time whereas the lower bound shows with near-linear space the worst-case query time must be close to n^{2/3}, ignoring polylogarithmic factors. Up to our knowledge, this is the first such major gap between the expected and worst-case query time of a range searching problem
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS
Similarity search is a fundamental algorithmic primitive, widely used in many
computer science disciplines. There are several variants of the similarity
search problem, and one of the most relevant is the -near neighbor (-NN)
problem: given a radius and a set of points , construct a data
structure that, for any given query point , returns a point within
distance at most from . In this paper, we study the -NN problem in
the light of fairness. We consider fairness in the sense of equal opportunity:
all points that are within distance from the query should have the same
probability to be returned. In the low-dimensional case, this problem was first
studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the
theoretically strongest approach to similarity search in high dimensions, does
not provide such a fairness guarantee. To address this, we propose efficient
data structures for -NN where all points in that are near have the
same probability to be selected and returned by the query. Specifically, we
first propose a black-box approach that, given any LSH scheme, constructs a
data structure for uniformly sampling points in the neighborhood of a query.
Then, we develop a data structure for fair similarity search under inner
product that requires nearly-linear space and exploits locality sensitive
filters. The paper concludes with an experimental evaluation that highlights
(un)fairness in a recommendation setting on real-world datasets and discusses
the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems (PODS), Pages 191-204, June 202
Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently
In this paper we define the notion
of a probabilistic neighborhood in spatial data: Let a set of points in
, a query point , a distance metric \dist,
and a monotonically decreasing function be
given. Then a point belongs to the probabilistic neighborhood of with respect to with probability f(\dist(p,q)). We envision
applications in facility location, sensor networks, and other scenarios where a
connection between two entities becomes less likely with increasing distance. A
straightforward query algorithm would determine a probabilistic neighborhood in
time by probing each point in .
To answer the query in sublinear time for the planar case, we augment a
quadtree suitably and design a corresponding query algorithm. Our theoretical
analysis shows that -- for certain distributions of planar -- our algorithm
answers a query in time with high probability
(whp). This matches up to a logarithmic factor the cost induced by
quadtree-based algorithms for deterministic queries and is asymptotically
faster than the straightforward approach whenever .
As practical proofs of concept we use two applications, one in the Euclidean
and one in the hyperbolic plane. In particular, our results yield the first
generator for random hyperbolic graphs with arbitrary temperatures in
subquadratic time. Moreover, our experimental data show the usefulness of our
algorithm even if the point distribution is unknown or not uniform: The running
time savings over the pairwise probing approach constitute at least one order
of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via
http://dx.doi.org/10.1007/978-3-319-44543-4_3
On Range Summary Queries
We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter ? and a set P of n points in ?^d where each point is assigned a color from a set C, and the goal is to build a structure such that given any geometric range ?, we can efficiently find a list of approximate heavy hitters in ??P, i.e., colors that appear at least ? |??P| times in ??P, as well as their frequencies with an additive error of ? |??P|. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence S of 1+1/? weights such that the i-th weight in S has approximate rank i?|??P|, meaning, rank i?|??P| up to an additive error of ?|??P|. Previously, optimal results were only known in 1D [Wei and Yi, 2011] but a few sub-optimal methods were available in higher dimensions [Peyman Afshani and Zhewei Wei, 2017; Pankaj K. Agarwal et al., 2012].
We study the problems for two important classes of geometric ranges: 3D halfspace and 3D dominance queries. It is known that many other important queries can be reduced to these two, e.g., 1D interval stabbing or interval containment, 2D three-sided queries, 2D circular as well as 2D k-nearest neighbors queries. We consider the real RAM model of computation where integer registers of size w bits, w = ?(log n), are also available. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time O(log n + 1/?). Note that as the output size is 1/?, after investing the initial O(log n) searching time, our structure takes on average O(1) time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra log_w(1/?) (resp. log log_w(1/?)) factor in 3D (resp. 2D). By spending extra log^O(1)(1/?) factors in both time and space, we can also support quantile queries.
We remark that it is hopeless to achieve a similar query bound for dimensions 4 or higher unless significant advances are made in the data structure side of theory of geometric approximations
Fair near neighbor search via sampling
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee
Simpler is Much Faster: Fair and Independent Inner Product Search
The problem of inner product search (IPS) is important in many fields. Although maximum inner product search (MIPS) is often considered, its result is usually skewed and static. Users are hence hard to obtain diverse and/or new items by using the MIPS problem. Motivated by this, we formulate a new problem, namely the fair and independent IPS problem. Given a query, a threshold, and an output size k, this problem randomly samples k items from a set of items such that the inner product of the query and item is not less than the threshold. For each item that satisfies the threshold, this problem is fair, because the probability that such an item is outputted is equal to that for each other item. This fairness can yield diversity and novelty, but this problem faces a computational challenge. Some existing (M)IPS techniques can be employed in this problem, but they require O(n) or o(n) time, where n is the dataset size. To scale well to large datasets, we propose a simple yet efficient algorithm that runs in O(logn + k) expected time. We conduct experiments using real datasets, and the results demonstrate that our algorithm is up to 330 times faster than baselines.Aoyama K., Amagata D., Fujita S., et al. Simpler is Much Faster: Fair and Independent Inner Product Search. SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2379 (2023); https://doi.org/10.1145/3539618.3592061
Subset Sampling and Its Extensions
This paper studies the \emph{subset sampling} problem. The input is a set
of records together with a function that assigns
each record a probability . A query returns a
random subset of , where each record is
sampled into independently with probability . The goal is to
store in a data structure to answer queries efficiently. If
fits in memory, the problem is interesting when is
dynamic. We develop a dynamic data structure with
expected \emph{query} time,
space and amortized expected \emph{update}, \emph{insert} and
\emph{delete} time, where
. The query time and
space are optimal. If does not fit in memory, the problem is
difficult even if is static. Under this scenario, we present an
I/O-efficient algorithm that answers a \emph{query} in
amortized expected I/Os using space, where is the memory
size, is the block size and is the number of iterative
operations we need to perform on before going below . In
addition, when each record is associated with a real-valued key, we extend the
\emph{subset sampling} problem to the \emph{range subset sampling} problem, in
which we require that the keys of the sampled records fall within a specified
input range . For this extension, we provide a solution under the
dynamic setting, with expected
\emph{query} time, space and amortized
expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page