18 research outputs found

    Independent Range Sampling, Revisited

    Get PDF
    In the independent range sampling (IRS) problem, given an input set P of n points in R^d, the task is to build a data structure, such that given a range R and an integer t >= 1, it returns t points that are uniformly and independently drawn from P cap R. The samples must satisfy inter-query independence, that is, the samples returned by every query must be independent of the samples returned by all the previous queries. This problem was first tackled by Hu, Qiao and Tao in 2014, who proposed optimal structures for one-dimensional dynamic IRS problem in internal memory and one-dimensional static IRS problem in external memory. In this paper, we study two natural extensions of the independent range sampling problem. In the first extension, we consider the static IRS problem in two and three dimensions in internal memory. We obtain data structures with optimal space-query tradeoffs for 3D halfspace, 3D dominance, and 2D three-sided queries. The second extension considers weighted IRS problem. Each point is associated with a real-valued weight, and given a query range R, a sample is drawn independently such that each point in P cap R is selected with probability proportional to its weight. Walker\u27s alias method is a classic solution to this problem when no query range is specified. We obtain optimal data structure for one dimensional weighted range sampling problem, thereby extending the alias method to allow range queries

    Independent Range Sampling, Revisited Again

    Get PDF
    We revisit the range sampling problem: the input is a set of points where each point is associated with a real-valued weight. The goal is to store them in a structure such that given a query range and an integer k, we can extract k independent random samples from the points inside the query range, where the probability of sampling a point is proportional to its weight. This line of work was initiated in 2014 by Hu, Qiao, and Tao and it was later followed up by Afshani and Wei. The first line of work mostly studied unweighted but dynamic version of the problem in one dimension whereas the second result considered the static weighted problem in one dimension as well as the unweighted problem in 3D for halfspace queries. We offer three main results and some interesting insights that were missed by the previous work: We show that it is possible to build efficient data structures for range sampling queries if we allow the query time to hold in expectation (the first result), or obtain efficient worst-case query bounds by allowing the sampling probability to be approximately proportional to the weight (the second result). The third result is a conditional lower bound that shows essentially one of the previous two concessions is needed. For instance, for the 3D range sampling queries, the first two results give efficient data structures with near-linear space and polylogarithmic query time whereas the lower bound shows with near-linear space the worst-case query time must be close to n^{2/3}, ignoring polylogarithmic factors. Up to our knowledge, this is the first such major gap between the expected and worst-case query time of a range searching problem

    Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS

    Get PDF
    Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the rr-near neighbor (rr-NN) problem: given a radius r>0r>0 and a set of points SS, construct a data structure that, for any given query point qq, returns a point pp within distance at most rr from qq. In this paper, we study the rr-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance rr from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for rr-NN where all points in SS that are near qq have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), Pages 191-204, June 202

    Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently

    Full text link
    \newcommand{\dist}{\operatorname{dist}} In this paper we define the notion of a probabilistic neighborhood in spatial data: Let a set PP of nn points in Rd\mathbb{R}^d, a query point qRdq \in \mathbb{R}^d, a distance metric \dist, and a monotonically decreasing function f:R+[0,1]f : \mathbb{R}^+ \rightarrow [0,1] be given. Then a point pPp \in P belongs to the probabilistic neighborhood N(q,f)N(q, f) of qq with respect to ff with probability f(\dist(p,q)). We envision applications in facility location, sensor networks, and other scenarios where a connection between two entities becomes less likely with increasing distance. A straightforward query algorithm would determine a probabilistic neighborhood in Θ(nd)\Theta(n\cdot d) time by probing each point in PP. To answer the query in sublinear time for the planar case, we augment a quadtree suitably and design a corresponding query algorithm. Our theoretical analysis shows that -- for certain distributions of planar PP -- our algorithm answers a query in O((N(q,f)+n)logn)O((|N(q,f)| + \sqrt{n})\log n) time with high probability (whp). This matches up to a logarithmic factor the cost induced by quadtree-based algorithms for deterministic queries and is asymptotically faster than the straightforward approach whenever N(q,f)o(n/logn)|N(q,f)| \in o(n / \log n). As practical proofs of concept we use two applications, one in the Euclidean and one in the hyperbolic plane. In particular, our results yield the first generator for random hyperbolic graphs with arbitrary temperatures in subquadratic time. Moreover, our experimental data show the usefulness of our algorithm even if the point distribution is unknown or not uniform: The running time savings over the pairwise probing approach constitute at least one order of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44543-4_3

    On Range Summary Queries

    Get PDF
    We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter ? and a set P of n points in ?^d where each point is assigned a color from a set C, and the goal is to build a structure such that given any geometric range ?, we can efficiently find a list of approximate heavy hitters in ??P, i.e., colors that appear at least ? |??P| times in ??P, as well as their frequencies with an additive error of ? |??P|. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence S of 1+1/? weights such that the i-th weight in S has approximate rank i?|??P|, meaning, rank i?|??P| up to an additive error of ?|??P|. Previously, optimal results were only known in 1D [Wei and Yi, 2011] but a few sub-optimal methods were available in higher dimensions [Peyman Afshani and Zhewei Wei, 2017; Pankaj K. Agarwal et al., 2012]. We study the problems for two important classes of geometric ranges: 3D halfspace and 3D dominance queries. It is known that many other important queries can be reduced to these two, e.g., 1D interval stabbing or interval containment, 2D three-sided queries, 2D circular as well as 2D k-nearest neighbors queries. We consider the real RAM model of computation where integer registers of size w bits, w = ?(log n), are also available. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time O(log n + 1/?). Note that as the output size is 1/?, after investing the initial O(log n) searching time, our structure takes on average O(1) time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra log_w(1/?) (resp. log log_w(1/?)) factor in 3D (resp. 2D). By spending extra log^O(1)(1/?) factors in both time and space, we can also support quantile queries. We remark that it is hopeless to achieve a similar query bound for dimensions 4 or higher unless significant advances are made in the data structure side of theory of geometric approximations

    Fair near neighbor search via sampling

    Get PDF
    Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee

    Simpler is Much Faster: Fair and Independent Inner Product Search

    Get PDF
    The problem of inner product search (IPS) is important in many fields. Although maximum inner product search (MIPS) is often considered, its result is usually skewed and static. Users are hence hard to obtain diverse and/or new items by using the MIPS problem. Motivated by this, we formulate a new problem, namely the fair and independent IPS problem. Given a query, a threshold, and an output size k, this problem randomly samples k items from a set of items such that the inner product of the query and item is not less than the threshold. For each item that satisfies the threshold, this problem is fair, because the probability that such an item is outputted is equal to that for each other item. This fairness can yield diversity and novelty, but this problem faces a computational challenge. Some existing (M)IPS techniques can be employed in this problem, but they require O(n) or o(n) time, where n is the dataset size. To scale well to large datasets, we propose a simple yet efficient algorithm that runs in O(logn + k) expected time. We conduct experiments using real datasets, and the results demonstrate that our algorithm is up to 330 times faster than baselines.Aoyama K., Amagata D., Fujita S., et al. Simpler is Much Faster: Fair and Independent Inner Product Search. SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2379 (2023); https://doi.org/10.1145/3539618.3592061

    Subset Sampling and Its Extensions

    Full text link
    This paper studies the \emph{subset sampling} problem. The input is a set S\mathcal{S} of nn records together with a function p\textbf{p} that assigns each record vSv\in\mathcal{S} a probability p(v)\textbf{p}(v). A query returns a random subset XX of S\mathcal{S}, where each record vSv\in\mathcal{S} is sampled into XX independently with probability p(v)\textbf{p}(v). The goal is to store S\mathcal{S} in a data structure to answer queries efficiently. If S\mathcal{S} fits in memory, the problem is interesting when S\mathcal{S} is dynamic. We develop a dynamic data structure with O(1+μS)\mathcal{O}(1+\mu_{\mathcal{S}}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(1)\mathcal{O}(1) amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where μS=vSp(v)\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v). The query time and space are optimal. If S\mathcal{S} does not fit in memory, the problem is difficult even if S\mathcal{S} is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in O((logBn)/B+(μS/B)logM/B(n/B))\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right) amortized expected I/Os using O(n/B)\mathcal{O}(n/B) space, where MM is the memory size, BB is the block size and logBn\log^*_B n is the number of iterative log2(.)\log_2(.) operations we need to perform on nn before going below BB. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range [a,b][a,b]. For this extension, we provide a solution under the dynamic setting, with O(logn+μS[a,b])\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(logn)\mathcal{O}(\log n) amortized expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page
    corecore