CORE
πΊπ¦Β
Β make metadata, not war
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Community governance
Advisory Board
Board of supporters
Research network
About
About us
Our mission
Team
Blog
FAQs
Contact us
Subset Sampling and Its Extensions
Authors
Jinchao Huang
Sibo Wang
Publication date
21 July 2023
Publisher
View
on
arXiv
Abstract
This paper studies the \emph{subset sampling} problem. The input is a set
S
\mathcal{S}
S
of
n
n
n
records together with a function
p
\textbf{p}
p
that assigns each record
v
β
S
v\in\mathcal{S}
v
β
S
a probability
p
(
v
)
\textbf{p}(v)
p
(
v
)
. A query returns a random subset
X
X
X
of
S
\mathcal{S}
S
, where each record
v
β
S
v\in\mathcal{S}
v
β
S
is sampled into
X
X
X
independently with probability
p
(
v
)
\textbf{p}(v)
p
(
v
)
. The goal is to store
S
\mathcal{S}
S
in a data structure to answer queries efficiently. If
S
\mathcal{S}
S
fits in memory, the problem is interesting when
S
\mathcal{S}
S
is dynamic. We develop a dynamic data structure with
O
(
1
+
ΞΌ
S
)
\mathcal{O}(1+\mu_{\mathcal{S}})
O
(
1
+
ΞΌ
S
β
)
expected \emph{query} time,
O
(
n
)
\mathcal{O}(n)
O
(
n
)
space and
O
(
1
)
\mathcal{O}(1)
O
(
1
)
amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where
ΞΌ
S
=
β
v
β
S
p
(
v
)
\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v)
ΞΌ
S
β
=
β
v
β
S
β
p
(
v
)
. The query time and space are optimal. If
S
\mathcal{S}
S
does not fit in memory, the problem is difficult even if
S
\mathcal{S}
S
is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in
O
(
(
log
β‘
B
β
n
)
/
B
+
(
ΞΌ
S
/
B
)
log
β‘
M
/
B
(
n
/
B
)
)
\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right)
O
(
(
lo
g
B
β
β
n
)
/
B
+
(
ΞΌ
S
β
/
B
)
lo
g
M
/
B
β
(
n
/
B
)
)
amortized expected I/Os using
O
(
n
/
B
)
\mathcal{O}(n/B)
O
(
n
/
B
)
space, where
M
M
M
is the memory size,
B
B
B
is the block size and
log
β‘
B
β
n
\log^*_B n
lo
g
B
β
β
n
is the number of iterative
log
β‘
2
(
.
)
\log_2(.)
lo
g
2
β
(
.
)
operations we need to perform on
n
n
n
before going below
B
B
B
. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range
[
a
,
b
]
[a,b]
[
a
,
b
]
. For this extension, we provide a solution under the dynamic setting, with
O
(
log
β‘
n
+
ΞΌ
S
β©
[
a
,
b
]
)
\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]})
O
(
lo
g
n
+
ΞΌ
S
β©
[
a
,
b
]
β
)
expected \emph{query} time,
O
(
n
)
\mathcal{O}(n)
O
(
n
)
space and
O
(
log
β‘
n
)
\mathcal{O}(\log n)
O
(
lo
g
n
)
amortized expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page
Similar works
Full text
Available Versions
arXiv.org e-Print Archive
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:arXiv.org:2307.11585
Last time updated on 28/07/2023