Subset Sampling and Its Extensions

Abstract

This paper studies the \emph{subset sampling} problem. The input is a set S\mathcal{S} of nn records together with a function p\textbf{p} that assigns each record v∈Sv\in\mathcal{S} a probability p(v)\textbf{p}(v). A query returns a random subset XX of S\mathcal{S}, where each record v∈Sv\in\mathcal{S} is sampled into XX independently with probability p(v)\textbf{p}(v). The goal is to store S\mathcal{S} in a data structure to answer queries efficiently. If S\mathcal{S} fits in memory, the problem is interesting when S\mathcal{S} is dynamic. We develop a dynamic data structure with O(1+ΞΌS)\mathcal{O}(1+\mu_{\mathcal{S}}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(1)\mathcal{O}(1) amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where ΞΌS=βˆ‘v∈Sp(v)\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v). The query time and space are optimal. If S\mathcal{S} does not fit in memory, the problem is difficult even if S\mathcal{S} is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in O((log⁑Bβˆ—n)/B+(ΞΌS/B)log⁑M/B(n/B))\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right) amortized expected I/Os using O(n/B)\mathcal{O}(n/B) space, where MM is the memory size, BB is the block size and log⁑Bβˆ—n\log^*_B n is the number of iterative log⁑2(.)\log_2(.) operations we need to perform on nn before going below BB. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range [a,b][a,b]. For this extension, we provide a solution under the dynamic setting, with O(log⁑n+ΞΌS∩[a,b])\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(log⁑n)\mathcal{O}(\log n) amortized expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page

    Similar works

    Full text

    thumbnail-image

    Available Versions