1 research outputs found
Efficient Parallel Random Sampling : Vectorized, Cache-Efficient, and Online
We consider the problem of sampling numbers from the range
without replacement on modern architectures. The main result
is a simple divide-and-conquer scheme that makes sequential algorithms more
cache efficient and leads to a parallel algorithm running in expected time
on processors, i.e., scales to massively parallel
machines even for moderate values of . The amount of communication between
the processors is very small (at most ) and independent of
the sample size. We also discuss modifications needed for load balancing,
online sampling, sampling with replacement, Bernoulli sampling, and
vectorization on SIMD units or GPUs