Efficient cache oblivious algorithms for randomized divide-and-conquer
  on the multicore model by Sharma, Neeraj & Sen, Sandeep
Efficient cache oblivious algorithms for randomized
divide-and-conquer on the multicore model
Neeraj Sharma
Mentor Graphics (INDIA) Pvt. Ltd.
Noida, U.P. 201301, India
neeraj sharma@mentor.com
Sandeep Sen
Department of CSE
I.I.T. Delhi, New Delhi 110016, India
ssen@cse.iitd.ac.in
November 9, 2018
Abstract
In this paper we present randomized algorithms for sorting and convex hull that achieves optimal
performance (for speed-up and cache misses) on the multicore model with private cache model. Our
algorithms are cache oblivious and generalize the randomized divide and conquer strategy given by
Reischuk [14] and Reif and Sen [17]. Although the approach yielded optimal speed-up in the PRAM
model, we require additional techniques to optimize cache-misses in an oblivious setting. Let p, n,M,B
respectively denote number of processors, problem size, the size of individual processor cache memory
and block size respectively, then we obtain expected parallel running time O(np log n + log n log log n)
with expected O( nB logM n) cache misses for sorting n keys and constructing convex hull of n points.
For p ≤ nlog logn , and under the tall-cache assumption M ≥ B2, both speed-up and cache-misses are the
best possible. Since the input-size n ≥ Mp, under a very mild assumption M ≥ log log n, p ≤ nlog logn ,
so in all realistic scenarios, our algorithm will have optimal time and cache misses with high probability.
Although similar results have been obtained recently for sorting [11] , we feel that our approach is simpler
and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds.
We also present a simple randomized processor allocation technique without the explicit knowledge of
the number of processors that is likely to find additional applications in resource oblivious environments.
SPAA’12, June 25–27, 2012, Pittsburgh, Pennsylvania, USA.
1
ar
X
iv
:1
20
4.
65
08
v2
  [
cs
.D
S]
  2
7 M
ay
 20
12
1 Introduction
The private-cache multicore model and the closely related Parallel External Memory (PEM) model com-
bines several features of parallel computing models like PRAM and the memory hierarchy issues captured
by the External Memory Models. The goal is to capture the relevant aspects of a large scale multiprocess-
ing environment, whose numerous parameters may be unknown to the algorithm designer. Although, it
is not intuitive, this last requirement can be tackled using the strategy called cache obliviousness or more
generally resource obliviousness.
These multiprocessing models consists of p processors (or cores) each having a private cache of size M .
There is a large global memory which acts as a shared memory. The processors communicate with each
other only through shared memory. Initially the input is present in the global memory stored in form of
blocks of size B. So an input of size n uses n/B blocks in the global memory. All the transfers to and
from memory (or cache) are done in blocks of size B. Whenever a core needs some data, say x, if the data
is already present in its private cache then no cost is incurred but if it is not present in the cache then
it copies this block from the shared memory to its private cache. Similarly when a core wants to write a
block, first that block is moved into its private cache then those changes can be updated in shared memory
at that time or later using some write-back policies. When a core modifies a block in its cache, then all
the copies of that block which are present in the cache of other cores are invalidated and for any future
access to that block it needs to be read again from the shared memory. To obtain full parallelism we need
to fill the cache of each core at least once, i.e. n ≥M · p.
Thus we have two types of cache related-cost incurred in this model1.
1. Cache misses : This is the standard type of cache miss which also occurs in sequential computation.
Whenever any core need some data which is not present in its cache, that block is copied from main
memory to its cache. This is treated as one cache miss. If some block B1 is evicted from cache
because of cache replacement policy and processor again need this B1 block, it will again cost one
cache miss to access this block (also known as capacity misses).
2. Block misses : These kind of misses occur only in the case of concurrent writing. It doesn’t have
an analogue in sequential computation and requires extra care in parallel computation. Suppose two
cores C1 and C2 are accessing the same block B1. If C1 modifies the contents of this block then the
copy of B1 present in C2’s cache is invalidated and it has to be read again which will be counted as
one cache miss. Note that block misses increase when a larger number of cores are accessing the same
block and one core modifies it. So the block misses occur when multiple processors are accessing the
same block at same time and at least one of them is writing. An algorithm designer will try to avoid
this situation as much as possible.
Both concurrent reads and concurrent writes are allowed in this model.
Concurrent Reads : If x cores are reading the same block, every core will incur one cache miss and the
total cache cost will be x (unless some processor writes on this block that will be treated as a block miss).
Concurrent Writes : We have two cases:
1. When x cores are reading one block and one core writes to that block at same time. All these cores will
have to read this block again from shared memory, which will lead to a total of x cache misses.
2. If there are x cores and all of them want to write to the same block, this block will migrate across these
x cores to satisfy their write requests. Thus the ith core in sequence will have to wait for time equal to i
cache misses before it can complete its write operation. So the total cache misses across all cores can be∑x−1
i=1 i = Ω(x
2).
In the worst case, any core will incur B cache misses to read or write on any block. So, while distributing
problems between different cores, to save cache cost, we will try to avoid block misses.
1These terms have been previously defined in [11]
2
The performance of any algorithm is characterized by two parameters.
Cache misses : The total number of cache misses and block misses incurred during algorithm. We will
analyze the total cache cost wherever any block miss occurs.
Critical path : It is the maximum time taken by any processor in the overall algorithm. To decrease the
critical path, we have to ensure equitable work-load distribution.
Note : Apart from its close relation to parallel time, this also affects the overall cache misses in a multi-
programming scheduling algorithm based on work-stealing [7].
Further, based on our a priori knowledge of multicore parameters (M,B, p) we have the following
variations
Cache Aware : The algorithm designer makes use of the values of multicore parameters (M,B, p) - for
example, we can divide the problem according to values of M and B to better utilize the cache.
Cache Oblivious : In this model we know the number of processors while designing algorithm but the
values of M and B are unknown. However, the analysis can make use of these parameters and our goal is
to match the performance of Cache aware algorithms.
Resource Oblivious Model : In this model, even the number of processors is unknown to the algorithm
designer. Usually the tasks are shared by processors using some on-line strategy like work-stealing methods
[6]
1.1 Previous Related Work
Our results make use of Reischuk’s [14] parallel randomized sorting algorithm that sorts n elements using
CREW PRAM processors inO(log n) time with high probability. For the external memory model, Aggarwal
and Vitter [2] designed a version of merge sort, that uses a maximum O(NB logM/B N/B) I/O’s and this
is optimal. For the cache oblivious model, Frigo et al. [13] presented a sequential cache-oblivious sorting
algorithm called Funnel sort which can sort n keys in O(n log n) time and O( nB logM n) cache misses
2. Note
that both time and cache misses achieved by this algorithm are optimal.
Arge et al. [4] formalized the PEM model and presented a merge sort algorithm based on [10] that runs
in O(log n) time and has optimal cache misses. Note that their model is both cache aware and processor
aware. This algorithm is very similar to merge sort implemented on [2]. They proved these results assuming
that n ≥ pB2 by using a d-way merge sort which partitions the input into d subsets, then sorts each subset
recursively and merges them. To achieve optimal I/O complexity and parallel speedup, the sorted subsets
are sampled and these sample sets are merged first. Then using the values of M and B, a suitable value
of d is chosen to achieve the claimed bounds.
Blelloch et al. [7] presented a resource oblivious distribution sort algorithm that has O(log2 n) critical
path-length and incurs sub-optimal cache cost in the private-cache multicore model. Their distribution sort
uses merging to divide the input into contiguous subsets and sorts them recursively. For their randomized
version, they proved an expected O(log1.5 n) depth. The potential bottleneck for reducing the depth further
was their technique for using merging to partition into buckets. The algorithm given in [22] is designed
for a BSP-style version of a cache aware, multi-level multicore. It uses a different collection of parameters,
and so it is difficult to compare it directly with the previous results.
Recently Cole and Ramachandran [11] presented a new optimal merge sort algorithm (SPMS) for
resource oblivious multicore model. This algorithm sorts n keys in O(np log n + log n log log n) time using
n processors with optimal number of cache misses on resource oblivious model assuming n ≥ Mp and
M ≥ B2 logB log logB. It works in O(log log n) depth where each stage requires O(log n) time. The
authors addressed a general computational paradigm called Balanced Partitioning Trees and designed a a
resource-oblivious priority work scheduler based on work-stealing to attain the above bounds. It follows
the same approach as merge sort where we splits the input into subsets, sorts the subsets recursively and
then merges them (to achieve the desired bounds, the authors used multi-way merging).
2For tall cache, since logM n is O(logM/B n/B), we will use the shorter expression
3
As sorting is basic problem in algorithms similarly Convex hull problem is a basic problem in Compu-
tational geometry. There is a close relation between sorting and convex hull problem. It has been proved
that sorting can be reduced to convex hull problem. So it means we can’t solve convex hull problem faster
than sorting. On a single processor convex hull problem can be solved easily by reducing it to sorting.
For 2-D convex hull problem, on single processor, there exists a output-sensitive cache oblivious algorithm
which achieve optimal time and cache misses [1]. By saying output sensitive hull, we means that total time
is proportional to number of points in output of convex hull. This algorithm is a modification of Chan’s
output sensitive algorithm [8]. On CREW model, there exists an algorithm which solve 3-D convex hull
problem in expected O(log n) time using n processors [17].
In the table below, we summarize the results discussed above where O˜ is used to represent expected bound.
To sort n elements using p processors
Model Time Cache cost Condition Source
CRCW
O˜(np log n) − n ≥ p Reischuk [14]PRAM
EREW
O(np log n) − n ≥ p Cole [10]PRAM
Cache
O(np log n) O(
n
B logMB
n
B )
n ≥ B2p
Goodrich [4]
Aware M ≥ B2
Cache
O(np log
2 n) O( nB log
n
B )
n ≥Mp
Ramachandran [5]
Oblivious M ≥ B2
Cache
Depth3 = O(log2 n) O(
n
B logMB
n
B ) M = Ω(B
2) Blelloch [7]
Oblivious
Cache
Depth3 = O˜(log1.5 n) O˜(
n
B logMB
n
B ) M = Ω(B
2) Blelloch [7]
Oblivious
Cache O˜(np log n+ O˜( nB logMB
n
B )
n ≥Mp
Our Results
Oblivious log n log log n) M ≥ B2B2/31
Resource O(np log n+ O( nB logMB
n
B )
n ≥Mp, M ≥
[11]
Oblivious log n log log n) B2 logB log logB
3
1.2 Our Work
In this paper, we have presented a randomized distributed sorting algorithm on cache-oblivious multicore
model. Our basic approach is similar to Reischuk [14], but to bound cache misses, we had to modify it
significantly. First, we sample some elements from input, sort them using a brute-force method and use
these elements to divide the input into disjoint buckets. To do this partitioning with minimal cache misses
in a cache-oblivious fashion, we do it in two phases. Roughly speaking, we divide the n input keys into√
n buckets by successive partitioning into n1/4 size buckets using a common procedure. This partitioning
procedure, which is the crux of our distribution sort, in turn invokes an efficient merging procedure to
attain the final bounds.
We are able to sort n elements in expected O(np log n+ log n log logn) time with expected O(
n
B logM n)
cache misses. These cache misses match the optimal cache bound O( nB logMB
n
B ) for tall cache i.e. if
M ≥ B2. So the cache cost is optimal but time is optimal only for p ≤ (n/ log log n). From the natural
condition that n ≥ Mp, if M ≥ log log n it implies that p ≤ (n/ log log n) and our algorithm will have
both time and cache misses optimal. Our bounds for sorting match that of Cole and Ramachandran [11]
in cache-misses and depth and we can obtain matching performance in the cache-oblivious PEM model.
3 It didn’t account for extra costs like block misses
4
We also present a simple technique for processor-obliviousness where the algorithm need not have any
prior knowledge of the number of processors and the processors can generate their ids on the fly. Our
approach is fundamentally different from [11, 7] that designs a scheduler to map tasks to processors in a
resource-oblivious fashion.
In practice, size of cache is 32KB or 64KB and size of block is 256B it means M ≥ log log n condition
is satisfied for input ≤ 2(28000). So our algorithm have optimal time and cache cost with high probability
for input of size ≤ 2(28000).
Moreover, our approach yields a general framework for randomized divide-and-conquer that has other
applications. In the latter part of this paper, we present a parallel multicore algorithm for constructing
convex hulls of point sets. This is based on the approach of of Reif and Sen [17] but we follow a simpler
description given in [19] for planar hulls. We are able to apply a number of subroutines developed for the
parallel sorting algorithm for the construction of the convex hull and achieve a bound similar to sorting.
Since it is known that Voronoi diagrams can be reduced to three dimensional convex hulls, we obtain
identical results as given above for sorting.
2 Some basic algorithms
The reader may recall that we are designing these algorithms without using the parameters M and B.
First we mention some basic algorithms which we will need for our sorting algorithm. We will use T to
represent total time (total operations), T¯ to represent expected time, T ′′ to represent parallel time, T¯ ′′ to
represent expected parallel time and Q to represent total cache misses.
Some times processors may have to read or write in contiguous locations which can lead to block misses.
So we bound total cache misses as follows :
Contiguous Reads and Writes. We have p cores and a total of n keys in the main memory stored in
n/B blocks and every processor has to read equal number of keys. So the total cache misses for every
core will be d npB e. Since a core may start or end reading in the middle of any block, the total cache misses
across all cores should be ≤ n/B + 2p which is O(n/B) given our assumption that (n ≥ Mp). Similarly,
when p cores have to write a total of n keys in n contiguous locations and every processor has to write an
equal number of keys n/p, then there will not be more than two block misses per processor for n ≥ Bp.
We now state results for some basic problems used throughout this paper. The associated algorithms
are based on standard approaches and the bounds assuming that n ≥Mp.
Lemma 2.1 Using p cores, maximum of n keys can be found in time O(n/p+ log p) with total O(dn/Be)
cache misses.
Proof: Assume input has been divided into p contiguous chunks each of size n/p. Every core is assigned
a task to get a maximum element from one chunk. This will take dn/pe time and n/B + p cache misses.
We get p elements. Use p/2 cores, where every core will read two keys and reject one key. This step will
take O(1) time and p cache misses. Repeat the same procedure using p/4 cores and so on until we are left
with one input key. Total time for whole algorithm will be O(n/p+ log p) and total cache misses across all
cores will be O(n/B + p) which can be bounded by O(n/B) if n ≥ Bp. 2
Lemma 2.2 Using p cores, n keys can be added in time O(n/p+ log p) with total O(n/B) cache misses.
Proof: This task can be done using the same approach as used in Lemma 2.2 within same bounds. 2
Lemma 2.3 Using p cores, prefix computation on n keys can be done in time O(n/p+ log p) with a total
of O(n/B) cache misses.
5
Input : (k1, k2, . . . , kn)
Output : (k1, k1 + k2, k1 + k2 + k3, . . . ,
n∑
i=1
ki)
Proof: We will use an algorithm which does prefix computation on n keys in O(n) time and O(n/B)
cache misses using one processor. [7]. Due to its Divide and Conquer nature, this algorithm can be easily
adapted to run on more than one processor. To bound the block misses, we need to makes some small
changes.
We are given an input array A of length n, a temporary array S of length n− 1, and an output array R of
length n. Say K=(n/p). Consider a balanced binary tree laid over input as shown in figure 1 (for n = 8).
This balanced tree will be of size n − 1 and will be saved in array S. To make memory operations more
cache efficient and avoid block misses (explained later), save this tree in infix order in S as shown in figure
1.
Figure 1: For n = 8, A is input array and S is temporary array
This algorithm will work in two phases. In first phase, we will calculate the sum of the left subtree for
each node and in second phase we will push this sum down the tree to calculate prefix sum. The recursive
codes for the two phases are shown below :
Phase 1: phase1(i, size). A is input array and S is temporary array where output of this phase 1 is stored.
(a) if size = 1 then return A[i].
(b) S[i+ size/2] = phase1(i, size/2)
(c) return S[i+ size/2] + phases1(i+ size/2, size/2)
Phase 2: phase2(i, size, s). Output of this phase is final output and is stored in array R.
(a) if size = 1 then set R[i] = s and return.
(b) phase2(i, size/2, s)
(c) phase2(i+ n/2, size/2, s+ S[i+ n/2])
Processor Allocation : After every level, no of processes get multiplied by two. So we know how
many processes we would have at any time which means calls can be distributed easily to processors using
processors numbers. Processors do not need any other information to execute these calls. Thus we do not
have to spend any extra time in processor allocation.
Analysis : Both phases use the same approach and same time and cache misses. We will analyze phase1
and same result follows for phase2 also. Assume K=n/p. Each call to phase1 is divided into two calls
of half sizes after constant computation and these both calls can be executed on different processors in
parallel at same time. But if size of problem is ≤ K then we have enough subproblems such that every
processor can solve one problem individually and sequentially. So if size ≤ K, all processes created by this
call will be executed on same processor. This modification helps us to bound total number of cache misses
6
when size becomes below M , total cache misses will be bounded by M if the problem is solved on single
core.
Block Misses :
Phase 1: Block misses occur when more than one processors are trying to write on same block. All written
operations in phase 1 are done only on S array. Because of the way binary tree is stored in S, we are able
to bound block misses. During phase1 we are basically traversing tree stored in S from top to bottom
(Figure 1) where very node represents one call and two children represents two newly created sub-calls.
We write one value on array S and create two new sub-calls to left and right subtree. All nodes of tree
which are present at same level are being executed in parallel and each node is writing on one location in
S. Distance between any two written positions will be equal to size of nodes on that level. So block misses
are zero till problem size is > B. And when problem size goes ≤ n/p, it is executed on single processor. In
this subtree every processor is going to write contiguous elements (more than M). And when a problem is
being solved on one processor than all calls are executed in depth first order which means that every core
will write its elements in order from left to right in S which gives us zero block misses. So block misses are
zero when problem size is greater than B or less than n/p. On the whole, we will have zero block misses.
Phase 2 : In phase2 also we are traversing through a binary tree in which only leaves are taking part in
writing. So similar to phase1, block misses are zero if n ≥Mp.
Time and Cache misses : Say T (n, p) represents total time, T ′′(n, p) represents parallel time and Q(n, p)
represents total cache misses taken by phase1 for input of size n using p cores.
T (n, 1) =
{
1 if n = 1
O(1) + 2T (n/2, 1) otherwise
Q(n, 1) =
{
n/B if n ≤M
O(1) + 2Q(n/2, 1) otherwise
Solving these both equations, we get T (n, 1) = O(n) and Q(n, 1) = O(n/B).
Say K = n/p, where K ≥M
T ′′(n, p) =
{
n if n ≤ K
O(1) + T ′′(n/2, p/2) otherwise
Q(n, p) =
{
n/B if n ≤ K
O(1) + 2Q(n/2, p/2) otherwise
Solving these both equations, we get T ′′(n, p) = O(n/p+ log p) and Q(n, p) = O(n/B).
This concludes the proof of Theorem 2.3. 2
Lemma 2.4 Using p cores, a matrix of size m × n can be transposed in time O(mn/p + log p) with total
O(mn/B) cache misses given mn ≥Mp.
Proof: As given in [[13]], Using one processor, we can transpose any matrix of size m × n in O(mn)
time and O(mn/B) cache misses. We will implement the same algorithm on multi-core model with slight
variations to bound block misses. Given that matrix-es are saved in row major layout, the algorithm can
be described with this simple recurrence relation :
T (m,n) =

O(1) if Max(m,n) ≤ c
O(1) + 2T (m,n/2) if n ≥ m
O(1) + 2T (m/2, n) otherwise
where c is some constant.
7
1. If both m and n are constant, transpose it directly using some brute-force approach.
2. Else, divide the whole matrix into two equal halves according to larger dimension.
Implementing with p processors : Say K = mn/p and as given K ≥M .
T ′′(m,n, p) =

O(mn) if mn ≤ K
O(1) + T ′′(m,n/2, p/2) if n ≥ m
O(1) + T ′′(m/2, n, p/2) otherwise
which gives us O(mn/p+ log p) parallel time.
After every level, both, size of problem and one of the dimension (row or column) is divided by two. As
we keep on dividing the problem, at some level, mn becomes ≤M and whole problem can fits into cache.
Because of dividing nature of this algorithm, at that time one of the dimension will be >
√
M/2 and other
will be ≥ √M .
Processor Allocation can be done using same method as explained in Prefix Sum Computation (Lemma
2.3) without any extra cost.
Cache Misses :
Q(m,n, p) =

O(mn/B) if mn ≤ K
O(1) + 2Q(m,n/2, p/2) if n > m/4
O(1) + 2Q(m/2, n, p/2) otherwise
which comes to Q(m,n, p) = O(mn/B). We have changed condition from n ≥ m to n > m/4 to make
sure that when every processor gets its individual problem, no. of rows are ≥ no. of columns in that
sub-problem. This will help us to bound block misses.
Block Misses : After dividing given problem into enough sub-problems such that every core get its
individual sub-problem, say of size x× y, where we know that xy = mn/p ≥M .
Depending on m we have two cases:
Case I m ≥ B
Because of dividing conditions of this algorithm, we can claim that x ≥ √M . While writing transpose
of matrix x × y in output, processor will be writing x contiguous elements y times which will incur
zero block misses, if x ≥ B.
Case II m < B
When dividing given problem into sub-problems, we will always divide the matrix by column side
Figure 2: Each core gets matrix of size m× n/p
till every processor get one sub-problem. Every processor will have a sub-problem of size m × n/p
8
which is ≥M as shown in figure 2. Given that these matrix are saved in row major form, it can be
easily seen that keys given to every core will be stored in contiguous locations in output matrix as
shown in figure 3. And every processor have ≥M keys, so we will not have any block misses.
Figure 3: Each core gets write mn/p contiguous elements
This concludes the proof of theorem 2.4. 2
Lemma 2.5 Given n keys, rank of any key can be found in time O
(
n
p + log p
)
with total O(n/B) cache
misses using p cores, given n ≥ Bp.
Proof: The whole task is divided into p processors such that every processor will find rank of given key
among dn/pe keys. It will take dn/pe time and total
(
n
pB × p+ p
)
= (n/B + p) cache misses across all p
cores.
Ranks found by all p cores are added to get the final rank of key among all n keys. For addition, we will
use only p2/n cores(to avoid block misses). As p ≤ n is given, so p2/n will always be ≤ p.
Using p2/n cores, p elements can be added in O(n/p+log p) time and O(p/B) cache misses, given p ≥ Bp2/n
or n ≥ Bp (Theorem 2.2.) Total cache misses are bounded by O(n/B), as p ≤ n. 2
Lemma 2.6 Using p cores, n elements can be written in contiguous locations in O(n/p) time and O(n)
cache misses, if n ≥ Bp.
Proof: The crux of this step is to avoid block misses. For that reason we make sure that every processor
gets more than B contiguous locations to write its keys. This lemma is generally used in cases when
processors work on disjoint problems, get some output and finally write all those outputs at contiguous
locations.
Every processor is assigned an equal number of keys to write. The first processor reads the first dn/pe keys
from the input and writes them in the first n/p locations of the output. Similarly the ith processor reads
the ith block of n/p keys from the input and writes in contiguous locations starting from (n/p×(i−1)+1)th
location of the output. It will take dn/pe time and n cache misses to read and write all keys. Total block
misses are zero if n/p ≥ B. 2
2.1 Brute-force Sorting
Lemma 2.7 Using Brute-force approach, n keys can be sorted in O
(
n2
p + log p
)
time using p cores with
total cache misses O(n2/B + n), given n2 ≥ Bp.
Proof: To sort these n keys, rank of every key is found in parallel using (p/n) processors.
It will take O
(
n2
p +log p
)
time and total n×
(
n
B +
p
n
)
=
(
n2
B +p
)
cache misses across all p cores (Lemma
9
2.5). We have one output array B of length n where every input key will be written at position given by
its rank. To avoid block misses, use the following strategy.
Create one more output array A of length n2 and write the ith key at the location n · Ri of A (where Ri
represents rank of ith key.) This is done in dn/pe phases, by writing p keys in every phase. In the first
phase, all those cores whose keys have ranks cn/p (where c is 0 to n− 1), will write their keys and rest of
the cores will remain idle. It will take O(1) time, p cache misses and zero block misses if n2 ≥ Bp. In the
next phase, all those cores whose keys have ranks cn/p+ 1, (where c is 0 to n− 1) will write their keys. In
total, we will have total of n/p such phases and a total of O(n) cache misses across all phases.
Finally, array A will contain all n input keys spread-ed in n2 locations in sorted order. Using Lemma 2.6,
we can write these n keys at continuous location using dp/ne processors in O(n2/p) time and O(n) cache
misses, if n ≥ Bp/n or n2 ≥ Bp. 2
Corollary 2.1 Using brute-force approach, n keys can be sorted in time O
(
n2
p +log p
)
with total O(n2/B)
cache misses using p cores, given n2 ≥ Bp and n ≥ B.
Proof: This is a simple instance of lemma 2.7 when n ≥ B. 2
2.2 Sampling
Lemma 2.8 Given a set A of n input keys and by using p cores, where n ≥ Mp, a set S of size n1/x
(where x ≥ 4) can be chosen from A in O(n/p+ log p) time and O(n/B) cache misses such that S satisfies
the following property.
Suppose t0, . . . , tf+1 is one of the longest sorted subsequence in sorted A such that t0, tf+1 ∈ S but
t1, t2, . . . , tf /∈ S.
Then Pr(f > (1 + t−1/6)n1−1/x = (t1/22(t1/2))−1, where t =
√
n/n1/x.
Proof: This is done using the sampling technique given by Reif and Valiant[15].
1. Choose another set S∗ of size n1/2 from A randomly. The probability for any element being chosen
in S∗ is equal. To choose S∗, imagine that input has been divided into n1/2 contiguous chunks all of
equal size and one key is selected randomly from every chunk. It will take (d√n/pe) parallel steps
and total n1/2 cache misses which is bounded by n/B if n ≥ B2 or n ≥M and m ≥ B2.
2. Using Lemma 2.1, S∗ can be sorted in time O
(
n
p +log p
)
using p cores with total O( nB ) cache misses
if n ≥ Bp and √n ≥ B which is true as n ≥Mp is given.
3. To get required set S of size n1/x, choose 1(
√
n/n1/x)th, 2(
√
n/n1/x)th, . . . ,
√
n
th
element from S∗. It
will take maximum (dn1/x/pe) parallel steps and total n1/x cache misses which is bounded by n/B if
n ≥ B2.
4. Write these dn1/xe elements at contiguous locations using dpn1/x/ne cores in dn/pe time and dn/Be
cache misses.
From the results in [15], we can claim that Pr(f > (1+t−1/6)n1−1/x = (t1/22(t1/2))−1, where t =
√
n/n1/x. 2
3 Randomized Divide and Conquer
Our basic framework for randomized divide-and-conquer is exemplified by Resichuk’s [14] parallel sorting
algorithm. We are given n keys in beginning A = (k1, k2, . . . , kn). The basic steps of algorithm are :
10
1. (Random Sampling to divide) Choose a subset S of size
√
n from A randomly.
2. Sort S by comparing every pair of keys in S.
3. (Partitioning Step) Using S, divide A into
√
n intervals B0, B1, . . . , B√n where Bi contains those
keys that are bigger than ith smallest but not bigger than (i+ 1)th smallest element of S.
4. (recursive call) Sort each box recursively independent of other boxes.
The above algorithm runs in O(log n) steps with high probability in an n processors CRCW PRAM model
[14].
3.1 Adaptation into Multicore
We now present a variation of the above algorithm on Multi-core cache oblivious model to achieve the
same kind of bounds as the PRAM model. We will use the following notations for the various performance
measures:
T : total time (total operations) T¯ : expected time T ′′ : parallel time T¯ ′′ : expected parallel time
Q : total cache misses Q¯: expected cache misses.
Theorem 3.1 Using p cores, we can sort n general keys in expected time O(np log + log n log log n) and
expected cache misses O( nB logM n), assuming M ≥ B2+α where α = 2/31 and p ≤ nM .
Proof: We are given a set A of n keys (k1, k2, . . . , kn) in shared memory. We have p cores each with cache
size of M .
Our algorithm divides the problem into disjoint subproblems recursively until we have the number of
subproblems matching the number of processors. Every subproblem is assigned processors according to its
size so that (n/p) ratio is same for every sub-problem. As all the subproblems at same level have nearly
equal sizes, so when problem size becomes lower than this ratio then it will mean that we have enough
number of subproblems. (Note that we know p.)
We represent the initial input size as N and total processors given as P . After every step we check whether
problem size ≤ (N/P ) if yes, each subproblem is assigned to an individual core otherwise partition it
further into smaller subproblems.
The algorithms is executed in following steps :
Figure 4: The overall recursive partitioning scheme
11
1. If the problem size ≤ (N/P ) then solve this problem on one processor sequentially.
Note that n keys can be sorted using one processor in time O(n log n) with O( nB logM n) cache misses ([13])
2. We choose a set S of size dn1/xe, where x ≥ 4 from A, using Lemma 2.8. This task can be done in O(n/p+log p)
time and O(n/B) cache misses using p cores, given n ≥ Mp. We will decide the value of x later during the
analysis.
3. Using S, divide A into |S| buckets B1, B2, . . . , B|S| where Bi contains those keys that are greater than ith
smallest but not greater than (i+ 1)th smallest element of S.
In other words, we have max {x | x ∈ Bi} ≤ Si ≤ min {x | x ∈ Bi+1}.
For this step we can follow the same approach as used in Reischuk sorting [14], where we assign every key to
one processor and let it find the bucket number of this key using binary search on given |S| bucket indexes.
Then we do integer sorting4 to move all the elements to the right buckets. But even binary search step could
cost us n log |S| cache misses where our target is nB logM |S| cache misses. To avoid this we use the following
strategy
Using Theorem 5.1, we can do this task using p cores in time O(np log n + log n log log n) with a total of
O( nB logM n) cache misses, provided |S| ≤
√
n, p ≤ nM and n ≥ B2/(1−logn |S|)p. As |S| = n1/x, this condition
can be reduced to n ≥ B2/(1−1/x)p or n ≥ B2B2/(x−1)p.
4. The original problem has been divided into dn1/xe subproblems where each subproblem can be solved inde-
pendently. We claim that the size of largest subproblem will be ≤ (1 + t−1/6)n1−1/x with very high prob-
ability (which is 1 − 1/(t1/22(t1/2)), using Lemma 2.8, where t = √n/n1/x. Therefore, a subproblem size
is ≤ (1 + n(2−x)/12x)n1−1/x w.h.p. otherwise, we repeat the partitioning step again starting from step two.
Distribute processors between subproblems according to their sizes and these subproblems are solved again
recursively using this same procedure. For the n1/x subproblems, processors are assigned to a problem such
that the (n/p) ratio is same. For example, if the size of the ith problem is ni then it gets pi = pni/n processors.
We will do prefix computation for the processor allocation. This takes O
(
n
p + log p
)
time using pn
1/x
n cores
with total cache misses O(n1/x/B), given n1/x ≥M
(
pn
1/x
n
)
or n ≥Mp. This concludes step 4.
Recall that to achieve the claimed bounds in steps 2 to 4, we require np ≥ max{M,B2B2/(x−1)}.
3.2 Detailed Analysis
We represent the initial input size as N and the total number of processors given as P . As ratio n/p remains
same throughout the algorithm, so if it is given that NP ≥ max{M,B2B2/(x−1)}, condition given above will
always be satisfied till the problem size ≥ (N/P ). When problem size becomes N/P , it means we have enough
number of sub-problems so that we do not need to run steps 2 to 4, we solve the problem directly using step 1.
We can summarize the performance of our algorithm as :
Given, NP ≥ max{M,B2B2/(x−1)}, say K = N/P, let r be the probability of bad case (when size of largest
sub-problem is ≥ (1 + n(2−x)/12x)n1−1/x), x > 2 and P ≤ N . The expected parallel time satisfies
T¯ ′′(n) =
O
(
n
p log n+ log n log log n
)
+ (1− r)T¯ ′′(ni) + rT¯ ′′(n) if n ≥ K
O(K logK) otherwise
(1)
For n ≥ 218, we can choose x s.t. ni ≤ n1−1/2x. Since r ≤ 1/n, the previous recurrence can be rewritten as
T¯ ′′(n) =
{
O(K log n+ log n log log n) + T¯ ′′(ni) if n ≥ K
O(K logK) otherwise
(2)
It follows that T¯ ′′(n) = O
(
xnp log n + x log n log log n
)
implying T¯ ′′(n) = O
(
n
p log n + log n log log n
)
. We
have the following two cases depending on number of processors.
Given n ≥Mp and n ≥ B2B2/(x−1)p where T¯ ′′(n) = O(np log n+ log n log log n).
4also known as semi-sorting
12
(a) If M ≥ log log n or B2B2/(x−1) ≥ log log n then np log n ≥ log n log log n
so T¯ ′′(n) = O(np log n).
(b) If p ≤ nlog logn then np log n ≥ log n log log n
so T¯ ′′(n) = O(np log n).
Likewise the total expected cache misses for all p processors
Q¯(n) =

O( nB logM n) +
n1/x∑
i=1
Q¯(ni) if n ≥ K
O(KB logM K) otherwise
(3)
This yields Q¯(ni) = O(
ni
B logM ni) +
n
1/x
i∑
i=1
Q¯(nii).
Since ni ≤ n1−1/32x and
n1/x∑
i=1
ni = n,
n1/x∑
i=1
O(
ni
B
logM ni) ≤ O(
n
B
logM n
(1−1/32x)). It follows that Q¯(n) =
O(x nB logM n) implying Q¯(n) = O(
n
B logM n).
Notice that the bounds on time and caches misses are expected over the choice of the random sample. We
have chosen x = 32, so if we assume that M ≥ B2B2/31 then the condition NP ≥ max{M,B2B2/(x−1)} can be
simplified to N ≥MP. This concludes the proof of Theorem 3.1.
2
Remark : Using the analysis in [15, 17], we can obtain high probability bounds for parallel time and cache
misses.
In the next subsections, we sketch some details of the individual steps.
4 Merging algorithm
Figure 5: Each core write contiguous xy/p elements in output
13
4.0.1 Basic structure of the Merging algorithm
Given x lists each of size y and all divided into m buckets, merge all these lists to get one single list divided
into m buckets.
Let Bij represent the i
th list and jth bucket where 1 ≤ i ≤ x and 1 ≤ j ≤ m. As shown in Figure 5,
initially we are given input in form of B11, B12, . . . , B1m, B21, B22, . . . , B2m, . . . , Bx1, Bx2, . . . , Bxm
and p processors. We will divide the input into p parts such that each core will get to write equal number
of keys and in a manner that in the output, the first processor will write the first xyp keys, the second
processor will write next xyp keys and so on. These processors may have to read their keys from different
locations but while writing, they will write contiguously. To find the keys, every processor will need the
information about the size of every bucket of every list. This will be done using prefix sum computation.
So whole task can be divided into three steps :
1. Every core needs to get the information about xy/p keys it must write. This is done using prefix
sum computation.
2. Every core will read exactly xy/p keys using the information generated in first step.
3. Every core will write its xy/p keys in contiguous locations.
4.0.2 Detailed explanation of Merging
Theorem 4.1 Given x lists of total size y, that are indexed by t buckets, we can merge these lists into a
single list partitioned into t contiguous buckets in time O
(
y
p+log p
)
using p processors incurring O( yB +xt)
cache misses, provided y ≥Mp and y ≥ xt.
Proof: Let Tmerge(x, y, t) represent the total time taken and Qmerge(x, y, t) represent the total number of
cache misses to merge x lists of total size y and all divided into t buckets using p processors.
1. Every core needs to know all buckets size of every list to find y/p keys it has to write. There are a
total of xt such sizes. This information can be processed by first transposing given array of size x× t
and then doing prefix sum computation on these xt elements.
Using (xpt/y) cores, both of these tasks can be done in time O(y/p + log p) and O(xt/B) cache
misses, given xt ≥M(xpt/y) or y ≥Mp (Lemma 2.4 and 2.3). We have enough cores for this task if
y ≥ xt.
2. From the information found in Step 1, we need those sizes which have values y/p, 2y/p, . . . , py/p.
Every core is assigned a task to find these sizes from xt/p sizes. Create an output array of length
y, and if the size cy/p is found, it will be written at location cy/p of this output array. For reading
and searching, it will take dxt/pe time and dxtB + pe cache misses. For writing, the total time will be
dxt/pe, and the total cache misses are p (and zero block misses if y/p ≥ B).
3. These p points found in Step 2 are written at contiguous locations in dy/pe time with O(p) cache
misses, using p2/y cores, given p ≥ Bp2/y or y ≥ Bp (Lemma 2.6).
4. Every core will read exactly y/p keys which will take dy/pe time.
Now we will bound total number of cache misses to read these y/p keys. All the processors will read
their input keys sequentially unless two event happens as explained below :
(a) The bucket that core was reading, ends. So in this case the processor may have to go to next
list or next bucket which can increase cache miss count by at most one over sequential misses
and this can happen only xt times.
14
(b) Processors may have to start or end reading in the middle of some block which can increase
cache miss count by at most by one or two over sequential misses. This can happen a maximum
of p times.
The total cache misses can be bounded by O(y/B + xt+ p) which is O(y/B + xt) for p ≤ y/B..
5. Every core write its y/p keys sequentially that leads to a total of O(y/B) cache misses. As yB ≥ p,
every core has more than one block to write and since every core is writing sequentially so there will
not be any block misses.
This concludes the proof of Theorem 4.1 . 2
The following is a simple corollary of the above Theorem.
Corollary 4.1 Given x lists, each of size y, that are partitioned into t buckets , we can merge these lists
into a single list partitioned into t buckets in time O
(
xy
p + log p
)
using p processors with total O(xyB + xt)
cache misses, given xy ≥Mp and y ≥ t.
Corollary 4.2 Given x lists, of total size Y , we can merge these lists into a single list in time O
(
Y
p +log p
)
using p processors incurring a total of O(YB + x) cache misses, given Y ≥Mp.
Proof: By merging, we imply writing the input lists in contiguous locations without changing the order of
elements, i.e., L1, L2, L3, . . . , Lx. This Lemma is used in cases when different processors work on different
problems and generate some output, and we need to write the output in a sequence. So every list can be
thought of as one bucket and the final list is also one list divided into one bucket. This can be done by
using Theorem 4.1. Given x lists of total size Y , we can merge these lists into a single list (divided into
one bucket) in time O
(
Y
p + log p
)
using p processors incurring a total of O(YB + x) cache misses, given
Y ≥ Mp and Y ≥ x. Note that Y is the total size of x lists and every list has at least one element so
Y ≥ x. 2
5 Dividing input into buckets
5.0.3 Basic structure of algorithm
Here we give a brief description about algorithm and later follow up with detailed time and cache analysis.
Given n keys and m bucket indexes X1, X2, . . . , Xm (given in sorted order), divide n keys into m buckets
B1, B2, . . . , Bm such that Xi ≤ {x | x ∈ Bi} ≤ Xi+1.
Let T (n,m) represents the total time to divide n keys into m buckets.
First we divide n keys into
√
m buckets with splitters (X√m, X2√m, . . . , X√m√m). For every i, 1 ≤ i ≤
√
m,
Bi is again divided into Bi1, Bi2, . . . , Bi
√
m as shown in figure 6 :
1. Divide the n keys into
√
m buckets. This is done in two steps.
(a) Divide n keys into
√
n contiguous chunks of size
√
n each. Now each chunk of size
√
n is divided
into
√
m buckets recursively.
(b) We get
√
n lists, each divided into
√
m buckets. Merge these lists to get a single list divided
into
√
m buckets. The merging algorithm is explained in the appendix.
T (n,
√
m) =
√
nT (
√
n,
√
m) + Tmerge(
√
n,
√
n,
√
m)
15
Figure 6: As shown for Bi, all buckets are divided into
√
m buckets
where Tmerge(
√
n,
√
n,
√
m) represents time needed to merge
√
n lists, each of size
√
n and all
divided into
√
m buckets (explained in section 4) .
2. Now we have
√
m buckets n1,n2,. . .,n√m. Each bucket is again divided into
√
m buckets.
T (n,m) = T (n,
√
m) +
m1/2∑
i=1
T (ni,
√
m)
We follow the same approach as done in step 1 above with some modifications. The following steps
are done for every bucket :
(a) Divide bucket into contiguous chunks of size
√
n. All these chunks(except last) will be of size√
n and last chunk will have size ≤ √n.
(b) All these chunks (except the last) are divided into
√
m buckets recursively.
(c) The last chunk is divided into
√
m buckets directly with the help of binary search and sorting.
We will explain this step in detail later.
(d) We have divided all these chunks in
√
m buckets. Using our merge procedure, merge all these
chunks to get one list divided into
√
m buckets. This concludes step 2.
Figure 4 gives a high level structure of this scheme. Finally, we obtain the following recurrence for the
parallel time.
T ′′(n,m) = O(n/p+ log p) + 2T ′′(
√
n,
√
m).
5.0.4 Detailed analysis
Lemma 5.1 Using one processor, n keys can be divided into x buckets in O(n log n) time with O( nB logM n)
cache misses, given n ≥M and x ≤ n .
Proof: This problem can be easily reduced to sorting, so it will take the same time as that of sorting.
We can sort n keys in time O(n log n) with O( nB logM n) cache misses using one processor ([13]). After
sorting, input is divided according to the bucket indexes. We just need to calculate the size of each bucket.
Assuming bucket indexes are also given sorted. Start from the first bucket index and compare it with every
key from the input and whenever it becomes larger than this bucket index we switch to the next bucket
index and start comparing it with next key from input. It will read both the input and the bucket indexes
contiguously from start. For x ≤ n, the total cache misses for this step are bounded by O(n/B). 2
16
Lemma 5.2 Using p cores, n keys can be divided into x buckets in O
(
n2
p + log p
)
time with O(n2/B)
cache misses, given n2 ≥Mp and x ≤ n.
Proof: We will solve this problem using the same approach as used in Lemma 5.1.
Sort n keys in O
(
n2
p + log p
)
time with total O(n2/B + n) cache misses, given n2 ≥ Bp (Corollary 2.1).
The number of cache misses are bounded by O(n2/B) if n ≥ B (follows from n ≥ √M and M ≥ B2).
After sorting, input keys has been divided according to buckets. To calculate the size of each bucket we
assign every bucket index to one processor; say processor pi is assigned i
th bucket index. pi does binary
search to find the size of its bucket on n keys.
To do this, p processors will take O(x lognp ) time and O(xn/B) cache misses. Given x ≤ n, the time and
cache misses can be bounded by O(n
2
p ) and O(n
2/B) respectively. 2
Lemma 5.3 Using p cores, n keys can be divided into y buckets in O
(
n3/2
p + log p
)
time with total
O(n
3/2
B + y
√
n) cache misses, given n ≥ (Mp)2/3 and y ≤ √n.
Proof: Let T (a, b) represent the total time to divide a keys into b buckets. The whole algorithm can be
described by the following recurrence relation.
T (n, y) =
√
nT (
√
n, y) + Tmerge(
√
n,
√
n, y)
where Tmerge(
√
n,
√
n, y) represents time needed to merge
√
n lists, each of size
√
n and divided into y
buckets (explained in section 4) .
1. Divide n keys into
√
n contiguous chunks, each of size
√
n. All of these chunks are divided into
√
y
buckets in parallel. Every T (
√
n, y) call is assigned to p/
√
n processors and is solved using Lemma
5.2. It will take O
(
n3/2
p + log p
)
time and O(n/B) cache misses if n ≥ M(p/√n) or n ≥ (Mp)2/3
and y ≤ √n. Total cache misses for √n calls will be O(n3/2/B) and the time is O
(
n3/2
p + log p
)
.
2. From Step 1, we get
√
n lists, each divided into y buckets. Merge these
√
n lists to get a single list
divided into y buckets. Only p/
√
n processors are used in this step to avoid block misses.
Using p/
√
n processors, merging will take O
(
n3/2
p + log p
)
time with O(n/B + y
√
n) cache misses,
given n ≥ (Mp/√n) or n ≥ (Mp)2/3 and y ≤ √n (Corollary 4.1).
2
Corollary 5.1 Using (p/n1/4) cores, n1/2 keys can be divided into y buckets in O
(
n
p + log p
)
time with
total O(n3/4/B + y 4
√
n) cache misses, given n ≥Mp and y ≤ 4√n.
Proof: This is an instance of Lemma 5.3 with
√
n input and p/n1/4 cores which lead to total O(n3/4/B+
y 4
√
n) cache misses. To satisfy conditions given in Lemma 5.3, we need
√
n ≥ (Mp/n1/4)2/3 or n ≥ Mp
and y ≤ 4√n. 2
Theorem 5.1 We are given an array A containing n input keys and an array S containing z bucket indexes
(stored in contiguous locations). We can partition A into z buckets B1 , B2 , . . . , Bz such that for
i = 1, 2,. . . z, we have max {x | x ∈ Bi} ≤ Si ≤ min {x | x ∈ Bi+1}, using p processors in time O(np log n+
log n log logn) and a total of O( nB logM n) cache misses, given z ≤
√
n and n ≥max{Mp,B2/(1−logn z)p}.
17
Proof: Let T (a, b) represent the total time, T ′′(a, b) represent the parallel time and Q(a, b) represent the
total cache misses to divide a keys into b buckets.
All the processors work together to divide the problem into smaller size but when no. of sub-problems
becomes more than the no. of processors then every core can have one problem and solve it independently
using Lemma 5.1. During recursion, all the sub-problems at the same level have the same size. Every
sub-problem is assigned processors according to its size so that (n/p) ratio is same for every sub-problem
at any level. When problem size becomes lower than this ratio then it will mean that we have enough
number of sub-problems to solve them on one processor individually.
Let initial input size be represented by N and the total number of processors by P. It is assumed that
N ≥MP . The whole task can be divided into these steps.
1. If problem size ≤ (N/P ) then solve this problem on one processor otherwise break this problem into
smaller sub-problems using the following steps.
Using Lemma 5.1, n keys can be divided into z buckets using one processor in time O(n log n) with
O( nB logM n) cache misses, given z ≤ n.
2. Choose
√
z
th
, 2
√
z
th
, 3
√
z
th
, . . . ,
√
z
√
z
th
element from the given z splitters. We obtain a total of
√
z
elements. First we divide input keys into these buckets and each bucket is divided further into
√
z
buckets. For this step, it will take maximum (dz1/2/pe) parallel steps and one cache miss for every
chosen key. Total cache misses will be z1/2 and can be bounded by n/B as z ≤ √n and n ≥ B2.
3. Write these dz1/2e elements at contiguous locations using dpz1/2/ne cores in dn/pe time, dn/Be cache
misses and zero block misses, if z1/2 ≥ Bpz1/2/n or n ≥ Bp.
4. Divide the n keys into these
√
z buckets as given as follows :
T ′′(n,
√
z) = T (
√
n,
√
z) + T ′′merge(
√
n,
√
n,
√
z)
Q(n,
√
z) =
√
nQ(
√
n,
√
z) +Qmerge(
√
n,
√
n,
√
z)
(a) Divide whole input array into
√
n contiguous sub-arrays of size
√
n. Recursively divide each
sub-array of size
√
n into
√
z buckets.
(b) We get
√
n lists, all divided into
√
z buckets. Using Corollary 4.1, merge these lists into a single
list divided into
√
z buckets. Given, n ≥ Mp and z ≤ n.
we get T ′′(n,
√
z) = T ′′(
√
n,
√
z) +O
(
n
p
+ log p
)
(4)
Similarly for cache misses Q(n,
√
z) =
√
nQ(
√
n,
√
z) +O(n/B +
√
nz)
5. The entire input has been divided into
√
z buckets and we need to divide each bucket again into
√
z
buckets.
T (n, z) = T (n,
√
z) +
z1/2∑
i=1
T (ni,
√
z) (5)
Break this T (ni,
√
z) in chunks of T (
√
n,
√
z) and then merge these chunks using Corollary 4.1. ni
can be broken as ni = xi
√
n+ yi where yi <
√
n.
z1/2∑
i=1
xi ≤
√
n and yi < n
1/2
T (ni,
√
z) = xiT (
√
n,
√
z) + Tmerge(xi,
√
n,
√
z) + T (
√
n,
√
z)
18
Summing this for all z1/2 problems, we get
z1/2∑
i=1
T (ni,
√
z) =
√
nT (
√
n,
√
z) +
z1/2∑
i=1
Tmerge(xi,
√
n,
√
z) + z1/2T (n1/2,
√
z) (6)
We have total p processors and z1/2 calls to T (n1/2,
√
z). Assign p/n1/4 processors to each call and
solve it using Corollary 5.1. As z ≤ √n so we have enough processors for z1/2 calls. Each such
call will take O
(
n
p + log p
)
time and O(n3/4/B +
√
z 4
√
n) cache misses if
√
z ≤ 4√n or z ≤ √n and
n ≥Mp. For all n1/4 calls, total cache misses will be O(n/B +√nz).
Combining equation (4),(5) and (6), we get
T ′′(n, z) = 2T ′′(
√
n,
√
z) +O
(
n
p
+ log p
)
+ T ′′merge(xi,
√
n,
√
z) (7)
Similarly for cache misses.
Q(n, z) = 2
√
nQ(
√
n,
√
z) +O(n/B +
√
nz) +
z1/2∑
i=1
Qmerge(xi,
√
n,
√
z) (8)
Assign pi processors to every bucket ni such that pi = pni/n or to be precise pi =
pxi
√
n
n =⇒ xi
√
n
pi
= np .
Using Corollary 4.1
If xi
√
n ≥Mpi then T ′′merge(xi,
√
n,
√
z) = O
(
xi
√
n
pi
+ log pi
)
(9)
and xi
√
n ≥Mpi if n ≥Mp and z ≤ n. As pi ≤ p, so equation (9) can be reduced to
T ′′merge(xi,
√
n,
√
z)= O
(
n
p + log p
)
For cache misses, Qmerge(xi,
√
n,
√
z)=O(xi
√
n
B + xi
√
z) (Corollary 4.1)
which gives
z1/2∑
i=1
MergeQ(xi,
√
n, pi) =⇒
z1/2∑
i=1
O(
xi
√
n
B
+ xi
√
z)
=⇒ O(
√
n
B +
√
z)
z1/2∑
i=1
xi =⇒ O(n/B +
√
nz).
Finally equation (7) can be rewritten as
T ′′(n, z) = 2T ′′(
√
n,
√
z) +O
(
n
p
+ log p
)
Likewise for cache misses, using n ≥Mp, equation (8) can be reduced to
Q(n,
√
n) = 2
√
nQ(
√
n,
√
z) +O(n/B +
√
nz)
The number of cache misses O(n/B +
√
nz) can be bounded by O(n/B) if
√
nz ≤ n/B or n ≥ zB2.
Say z = n1/t ⇒ t = logz n. Condition n ≥ zB2 can be further reduced to n1−1/t ≥ B2 ⇒ n ≥
B
2
1−1/t ⇒ n ≥ B2/(1−logn z)
Finally, to complete all the steps of algorithm in claimed bounds, we need n ≥ max{Mp,Bq}, where
19
q = 2/(1 − logn z). As mentioned earlier, ratio (n/p) remains same throughout the recursion proce-
dure and the same is true for q also. Since after every iteration both n and z are reduced to
√
n,
√
z
and logn z remain same, which means q will also remain same through the recursion.
If it is given that N ≥ max{MP,BqP}, condition given above (n ≥ max{Mp,Bq}) is true till prob-
lem size ≥ (N/P ) and when problem size becomes N/P , it means we have enough problems such
that every processor can be assigned one problem.
Processor Allocation : During every iteration of this algorithm, one problem of size n is divided
into
√
n sub-problems. Processors are assigned to every sub-problem such that (n/p) ratio is same for
every sub-problem. The ith sub-problem with size ni gets pi processors such that pi = pni/n, where
n1/2∑
i=1
ni = n. After doing prefix sum computation on sizes of all n
1/2 sub-problems, every processor
can be assigned to its respective sub-problem. It will take O
(
n
p + log p
)
time using p/
√
n cores with
total cache misses O(
√
n/B), given n ≥Mp.
The whole algorithm can be summarized as :
if N ≥ max{MP,BqP}, where q = 2/(1− logn z).
T ′′(n, z) =
O
(
n
p + log p
)
+ 2T ′′(
√
n,
√
z) if n ≥ N/P
O(NP log
N
P ) otherwise
(10)
The total number of cache misses for all p processors
Q(n, z) =
{
O(n/B) + 2
√
nQ(
√
n,
√
z) if n ≥ N/P
O( NBP logM
N
P ) otherwise
(11)
Assume NP = K, where K will be constant throughout the algorithm as explained before
T ′′(n, z) =
{
a(K + log n) + 2T ′′(
√
n,
√
z) if n > K
bK logK otherwise
We get, T ′′(n, z) = a(K + log n) + a(2K + log n)+ . . .+ a(2xK + log n) + b(2x+1K logK)
=⇒ T ′′(n, z) = 2x+1aK + ax log n+ 2x+1bK logK
Using (n)1/2
x ≥ K =⇒ 2x ≤
(
logn
logK
)
We get, T ′′(n, z) = O(K lognlogK + log
(
logn
logK
)
log n+K log n), which comes to
T ′′(n, z) = O(K log n+ log n log log n) Or O(np log n+ log n log logn).
The total cache misses for all the p processors
Q(n, z) =
{
an/B + 2
√
nQ(
√
n,
√
z) if n > K
bKB logM K. otherwise
(12)
Q(n, z) = a( nB ) + 2a(
n
B ) + 4a(
n
B ) + . . .+ 2
xa( nB ) + 2
x+1 n
K a(
k
B logM K)
=⇒ Q(n, z) = 2x+1a( nB ) + 2x+1a( nB logM K)
20
Using (n)1/2
x ≥ K =⇒ 2x ≤
(
logn
logK
)
We get, Q(n, z) = O( nB logK n+
n
B logM n) = O(
n
B logM n) if K ≥M or n ≥Mp.
2
5.1 Multi-search
Lemma 5.4 We can simultaneously search for n elements in a sorted set of m elements using p pro-
cessors in O(np log n + log n log logn) time with total O(
n
B logM n) cache misses, given m ≤
√
n and
n ≥max{Mp,B2/(1−lognm)p}.
Proof: This follows directly from Theorem 5.1. The basic approach goes back to Reif and Sen [18].
The condition n ≥ B2/(1−lognm)p can be simplified to n ≥ B4p when m = √n. As the value of m decreases,
this condition becomes progressively weaker. For m = n1/32, condition is reduced to n ≥ B(2+2/31)p. 2
6 Convex Hull construction
An object is convex if for every pair of points within the object, the straight line segment that joins them
lies completely within the object.
Given a set of points P , the convex hull CH(P ) is the smallest convex object which contains P .
Because of close relationship between convex hull and sorting, we will try to prove the same time bound and
cache cost for convex hull problem as our sorting algorithm. Our algorithm is based on the algorithm given
in Reif and Sen [17] and our description follows the randomized divide-and-conquer strategy described
in [19]. It has been known that the convex hull problem can be reduced to the problem of finding the
intersection of half-planes (under a dual transform). So we will solve the problem of finding the intersection
of half-planes as it is relatively easier to divide it into disjoint sub-problems.
7 Basic Algorithms
Before we describe the main algorithm, we briefly describe some of the basic subroutines that are used
frequently.
7.1 2-D Maxima Problem
A point p = (px, py) dominates a point q = (qx, qy) if and only if px > qx and py > qy. The maxima
problem can be defined as : Given a set of points, to report the maximal points within the set i.e. to report
all the points which are not dominated by any other point in this set.
This problem can be divided into following steps :
1. Sort the given points along x-axis to get a set T = {t1, t2, . . . , tn}.
2. It is obvious that tn is a maximal point. Start sweeping from tn towards left (in decreasing value of
x). The first point we encounter is pn−1. We already know that tn−1(x) < tn(x), so if tn−1(y) < tn(y)
then tn−1 is dominated by tn. Similarly, for the next point in left direction tn−2, we just need to
compare tn−2(y) with max{tn(y), tn−1(y)} to verify whether point tn−2 is maximal point or not.
In short, to check any point ti whether it is maximal or not, we just need to compare ti(y) with
max{ti+1(y), ti+2(y), . . . , tn(y)}. While sweeping to left, maintaining the value of the largest y-
coordinate till that point is enough for comparison.
21
Figure 7: (a) Input points (b) Maximal points form the stairs
Lemma 7.1 From a given set S of n points, maximal points can be found in O(n log n) time with total
O( nB logM n) cache misses, using one processor, given n ≥M .
Proof: The whole task can be divided into following two steps :
1. Sort given n keys w.r.t. x-axis in O(n log n) time with total O( nB logM n) cache misses using one
processor, given n ≥M. [[16]].
2. As explained above, this sorted set is read sequentially which will cost O(n/B) cache misses. We
need to maintain one max variable which will be accessed every time next element is read from input,
because of LRU replacement policy, it will not cost any extra cache miss.
2
7.2 Finding maximal points on p cores
Lemma 7.2 From a given set S of n points, maximal points can be found in expected O(np log n+log n log logn)
time with expected O( nB logM n) cache misses using p cores, given n ≥Mp and M ≥ B2B2/31.
Proof: The whole task can be divided into following steps :
1. Sort input points w.r.t. x co-ordinate in expected time O(np log n + log n log logn) with expected
O( nB logM n) cache misses using p cores, given n ≥Mp and M ≥ B2B2/31 (Theorem 3.1).
2. Assume input has been divided into p contiguous chunks k1, k2, . . . , kp, each of size n/p. Each core
is assigned one chunk to find maximum element say y, from it. To do so, every core will read these
n/p keys sequentially. This will take n/p time and total n/B + p cache misses across all cores. We
get total p elements.
3. Write these p elements at p contiguous locations using p2/n cores in n/p time with total p cache
misses, if p ≥ Bp2/n or n ≥ Bp (Lemma 2.6). Total cache misses will be O(n/B) if n ≥ Bp.
4. Do prefix computation using max operator on these p points found is Step 3. To avoid block misses
use only p2/n processors. Using p2/n cores, prefix computation on p keys can be done in O(n/p+log p)
time with total O(p/B) cache misses given p ≥Mp2/n or n ≥Mp (Lemma 2.3).
5. Every core is assigned a task to find maximal points from one chunk. It will take O(np log
n
p ) time
and total O( nB logM n) cache misses across all p cores, if n/p ≥M. (Lemma 7.1).
22
6. Every core have one output list of maximal points from its corresponding chunk. Store these lists at
contiguous locations. Using qp/n cores, we can join these p lists of total size q,(where q ≤ n) in time
O(n/p+ log p) time and O(q/B) cache misses, if q ≥Mqp/n or n ≥Mp (Theorem 4.2).
2
7.3 Dividing Convex hull problem into Upper hull & Lower hull
Lemma 7.3 Given a problem to calculate convex hull of n points, it can be divided into two disjoints
problem of finding Upper hull and Lower hull of size m and n−m respectively in O(n/p+ log p) time with
total O(n/B) cache misses using p cores, given n ≥Mp.
Proof: Divide the given problem into following simple steps :
1. Find the point with maximum x co-ordinate and minimum x co-ordinate [Lemma 2.1]. It will take
O(n/p+ log p) time and total O(n/B) cache misses, if n ≥ Bp.
2. We get two points p1 and p2. For every input point check whether it is on upper side of line p1p2
or lower side of line p1p2. Based on this information, input points can be divided into two disjoint
problem.
3. Every core reads n/p keys and divide them into two buckets. This will take dn/pe time and n/B+ p
cache misses. We will get p lists each of each of n/p size, all divided into two problems.
4. Using lemma 4.1, merge these lists into a single list divided into two buckets in O(n/p+ log p) time
and total O(n/B + 2p) cache misses, given n ≥ Mp and n/p ≥ 2 or n ≥ 2p. Total cache misses are
bounded by O(n/B) if n ≥Mp.
Whole algorithm can be summarized by this recurrence relation.
T (n, 2) = pT (n/p, 2) + Tmerge(p, n/p, 2)
2
7.4 Brute-force algorithm for intersection of half-planes
Corollary 7.1 Given two points p1, p2 and n half-planes, find number of half-planes for which p1 and p2
are on same side in O
(
n
p + log p
)
time and total O(n/B) cache misses, given n ≥ Bp.
Proof: This is simple instance of Lemma 2.5 where rank of point can be defined as the number of half-
planes for which both given points are on the same side. 2
Lemma 7.4 Intersection of n half-planes can be computed in O(n3/p+log p) time using p cores with total
O(n3/B) cache misses, given n3 ≥Mp and a point x which will be present in this intersection.
Proof: We will use a simple brute-force approach. We have n half-planes k1,k2,. . .,kn and a point x for
which we already know that it will be inside intersection. As n3 ≥ M and M ≥ B2, so we can say that
n2 ≥ B.
23
1. As n lines can have maximum n2 intersection points. We will simply check for each point whether
it is a vertex of convex hull or not. A point p is a vertex of convex hull if p and x both are on same
side of every half-plane For every point ki, where 1 ≤ i ≤ n2, find whether it is vertex of convex hull
or not, in parallel using p/n2 cores. Using lemma 7.1, each point will take O
(
n3
p + log p
)
time with
total O(n/B) cache misses, given n ≥ Bp/n2 or n3 ≥ Bp. Total cache misses for n2 points will be
O(n3/B).
2. If point ki is vertex of convex hull, then one of the processors which are working on k
th
n point will
write this point in output. Maximum n cores are ready to write one point each. We need to write
these n points in n contiguous locations. Create an output array A of length n3 and key ki will be
written at position n2ki in A. This will take O(n/p) time, n cache misses and zero block misses if
n2 ≥ B. n cache misses can be bounded by O(n3/B) if n2 ≥ B. We have written these n elements
in n3 locations.
3. Write these n points at n contiguous locations, in time n3/p with total n cache misses, using p/n2
cores if n ≥ Bp/n2 or n3 ≥ Bp (Lemma 2.6). Total cache misses are bounded by O(n3/B) if n2 ≥ B.
4. All these written points are vertices’s of convex hull but they may not be ordered. Divide these
n points into two disjoint problems Upper hull and Lower hull using left-most and rightmost point
(Lemma 7.3) with p/n2 cores in time O(n3/p + log p) with total O(n/B) cache misses, given n ≥
Mp/n2 or n3 ≥Mp.
5. Using lemma 2.7, sort both of these disjoint problems. We can sort n points along x-axis using n2/p
cores in O(n2/p + log p) time and O(n2/B + n) cache misses which can be bounded by O(n3/B) if
n2 ≥ B.
This concludes the lemma. 2
8 Randomized Algorithm For Convex Hull construction
The main result in [17] states that Given n half-planes in R3, and a point t inside their intersection, we
can find intersection of these half-planes in O(log n) expected time using n CREW PRAM processors. This
algorithm achieves the optimal speed up. The basic steps of this algorithm are :
1. Select O(log n) samples each of size bnc from K randomly, where 0 <  < 1.
2. Select a good sample using Polling.
3. Find intersection of half-planes in selected sample using any brute force method to get a convex hull
with points P1, P2, . . . , Px (where x ≤ .)
4. This convex polygon can be divided in x triangles (or sectors) of the form P1tP2, P2tP3, . . . , Px−1tPx, PxtP1,
where t is given point which is present inside this hull.
5. The original problem can be divided into x disjoint sub-problems, one sub-problem for each sector.
For every sector find the intersection of half-planes which intersect this sector.
6. As one half-plane can intersect many sectors so total size of all sub-problems may be much larger
than size of original problem. But because of Polling we can claim that total size of all sub-problems
will be O(n), and size of maximum sub-problem will be ≤ n1− log n.
7. Use a filtering scheme to reduce total size of all sub-problems from O(n) to 2n.
8. If size of sub-problem is very small then solve it directly using any brute-force algorithm otherwise
solve it recursively using this same procedure.
24
8.1 Adaptation to Multi-Core Model
We will use the algorithm explained above with slight modifications on Multi-core cache oblivious model
to achieve the same kind of bounds as the PRAM model.
Theorem 8.1 Using p cores, the intersection of n half-planes can be computed in expected time O(np log n+
log n log logn) incurring an expected O( nB logM n) cache misses, given n ≥Mp and M ≥ B2B2/7.
Proof: The whole algorithm can be divided into following steps:
1. From n half-planes, using polling5, choose a good sample of size n using p cores in expected
time O(np log n + log n log logn) with expected O(
n
B logM n) cache misses, given  ≤ 1/8, n ≥
max{Mp,B2/(1−4)p} and M ≥ B2B2/31 (Theorem 8.2).
2. Compute intersection of n half-planes in O(n/p+ log p) time using n3p/n cores incurring a total of
O(n3/B) cache misses, given n3 ≥Mn3p/n or n ≥Mp and a point O which will be present in this
intersection. (Lemma 7.4). No. of processors needed for this are ≤ p and total cache misses can be
bounded by O(n/B) if  ≤ 1/3.
3. We get a convex hull with points P1, P2, . . . , Px (where x ≤ n.) Divide this convex polygon into x
triangles (or sectors) of the form P1OP2, P2OP3, . . . , Px−1OPx, PxOP1, where O is some given point
which is present inside this hull.
4. Divide the original problem into x = O(n) disjoint sub-problems.
Using p processors, divide n half-planes into n disjoint sub-problems in expected O(np log n +
log n log logn) time with expected O( nB logM n) cache misses , if n ≥ n(x+2), x ≥ 2, M ≥ B2B2/31
and n ≥max{Mp,B2+4/xp} Lemma (8.3).
Choose x = ( 12 − 2). Condition x ≥ 2 is reduced to  ≤ 1/8 and n ≥ B2+4/xp is reduced to
n ≥ B2/(1−4)p.
5. Because of the way sampling was done in Step 1 (using polling), we can claim that total size of all
sub-problems will be O(n) [17]. Use a filtering scheme to reduce total size of all sub-problems from
O(n) to 2n.
Using n sectors, O(n) half-planes can be filtered in expected time O(np log n+ log n log log n) with a
total of O( nB logM n) expected cache misses, given n ≥Mp and M ≥ B2B2/31 (Theorem 8.4).
6. After filtering, total size of all sub-problems is ≤ 2n and all these sub-problems has been divided into
 sectors. To maintain the same n/p ratio throughout the recursion, increase the number of processors
to 2p. As it has been proved ([19]) that at any level of recursion, maximum size of total sub-problems
will be ≤ 2n, 2p processors will be enough for this algorithm. No. of processors are assigned to
each sub-problem according to its size. For e.g. problem with size x will get xp/n processors. For
division of processors, prefix computation is done on n sizes. Using np/n processors, it will take
O
(
n
p + log p
)
time and O(n/B) cache misses, given n ≥Mnp/n or n ≥Mp (Lemma 2.3).
7. Maximum size of any sub-problem can be 2n(1−) log n. Those sub-problems which have size ≤ n/p
get only one processor and can be solved sequentially. We do not need to divide these sub-problems
any further. Using one processor, intersection of b half-planes can be computed in O(b log b) time
and O( bB logM b) cache misses [1]. All other sub-problems are solved recursively using this same
procedure.
5An efficient re-sampling technique that is described later
25
As mentioned above, to complete step 1-7 in claimed bounds we need  ≤ 1/8, n ≥ max{Mp,B2/(1−4)p}
and M ≥ B2B2/31. By choosing  = 1/32, all these can be reduced to n ≥Mp and M ≥ B2B2/7.
This algorithm follows same kind of recurrence as showed in our sorting algorithm (Theorem 3.1).
T¯ ′′(n) =
{
O(K log n+ log n log logn) + T¯ ′′(ni) if n ≥ K
O(K logK) otherwise
(13)
where K = n/p and ni ≤ 2n1− log n or ni ≤ 2n31/32 log n or ni ≤ n63/64 if 2 log n ≤ n1/64 which will be
true for large n.
Likewise total cache misses for all p processors
Q¯(n) =

O( nB logM n) +
n1/x∑
i=1
Q¯(ni) if n ≥ K
O(KB logM K) otherwise
(14)
As solved in Theorem 3.1, we get total expected time O(np log n + log n log log n) and total O(
n
B logM n)
expected cache misses, 2
Corollary 8.1 Given n half-spaces in R3 that contains the origin, we can find their intersection using p
cores in expected time O(np log n + log n log log n) with expected cache misses O(
n
B logM n), given n ≥ Mp
and M ≥ B2B2/7.
Proof: The above algorithm readily generalizes to 3 dimensions using the algorithm of Reif and Sen[17]
where we use a simpler filtering step described in Amato, Goodrich and Ramos [3]. 2
8.2 Intersection of Half-planes and Convex Hull
Given a convex polygon divided into sectors and number of input half-planes, for every half-plane identify
the sectors of convex hull it intersects. As shown in [19], this problem can be reduced to a point location
problem. In dual space these half-planes are mapped to points and vertices’s of convex hull are mapped
to lines. These lines will intersect with each other to create some regions. It can be proved that all those
points which belong to same region in dual space, in primal space they are the half-planes which intersect
same set of sectors in this convex hull. As a part of preprocessing, for every region find the set of sectors it
intersect. After preprocessing, for any half-plane, do point location on these lines in the dual space. (These
lines are vertices’s of convex hull in primal space.) After finding region we can list all the sectors which
this half-plane intersects. For point location problem we will use the algorithm of Dobkin and Lipton [12].
After preprocessing the half-planes, we can perform the following steps:
1. For any query point we can find the unique cell in which it is present using two binary searches.
2. For every region we can list the set of sectors in constant time where listing involves the start sector
and the end sector. Because of convexity of the polygon, we can say that half-plane which pass
through these start and end sectors will also pass through the sectors between these start and end
sector.
Lemma 8.1 Using p cores, m half-planes can be preprocessed in O(m4/p + log p) time with O(m4/B)
cache misses, given m4 ≥ Bp so that point location among the half-planes can be done using two binary
searches.
Proof: The algorithm executes in the following steps :
26
1. For m half-planes we can have a maximum of m2 intersection points. Sort these m2 points w.r.t. x
co-ordinate. Using p cores, It will take O
(
m4
p + log p
)
time and total O(m4/B + m) cache misses,
given m4 ≥ Bp (Theorem 2.7).
2. From Step 1, we get m2 intervals or say vertical slabs. In each slab there will be n non-intersecting
half-planes. We will order these half-planes according to y coordinate. Allocate equal number of
processors to each slab, so that every slab will get p/m2 processors.
3. For every vertical slab find intersection points of m half-planes with both of its sides. This task can
be done easily in dm3/pe time and d(m/B + p/m2)e cache misses by assigning m3/p half-planes to
each processor for every slab. Total cache misses for all vertical slabs will be (m3/B + p) which is
bounded by O(m4/B) if m4 ≥ Bp.
4. All given m half-planes has been divided into m3 regions and as said earlier, all points which belongs
to same region will intersect same set of sectors. We just need to find the set of sectors, any region
will intersect. Take one point from every region, map it to half-plane in primal space and test it
against all sectors using some brute force algorithm to get a set of sectors for every region but as this
polygon is convex we will only save start and end sector.
This is done for all m3 regions. Assign m3/p of these regions to every processor. Every processor
will search linearly through all sectors to find the set of sectors for every half-plane. It will take a
total of dm4/pe time and m3p × mB × p = dm4/Be cache misses across all p cores.
2
Lemma 8.2 Given n points and arrangement of m half-planes, for every point we can identify the re-
gion it will lie in, in O(np log n + log n log log n) time and O(
n
B logM n) cache misses using p cores, given
n ≥max{Mp,B2+4/xp},n ≥ mx+2 and x ≥ 2.
Proof: The entire procedure can be divided into the following steps :
1. Given m4 ≥ Bp, pre-process m half-plane using Lemma 8.1 in O(m4/p + log p) time with total
O(m4/B) cache misses. If n ≥ m4, time and total cache misses are bounded by O(n/p + log p) and
O(n/B) respectively. After pre-processing, m planes have been divided into m2 vertical slabs and
every vertical slab is further divided into m horizontal slabs. Using this information, we will locate
the region for every input point.
2. Divide all the input points into these vertical slabs.
Using Theorem 5.1, n points can be divided into m2 vertical slabs in O(np log n+ log n log logn) time
and total O( nB logM n) cache misses using p cores, given m
4 ≤ n and n ≥max{Mp,B2/(1−2 lognm)p}.
3. All n points have been divided into m2 sub-problems (vertical slabs) where the ith sub-problem
contains si points. Divide every sub-problem again into m sub-problems (horizontal slabs) using
following steps :
(a) First consider sub-problems with si ≥ mx. This x will be chosen after analysis. In worst case,
all the sub-problems may have size ≥ mx. Assign sip/n processors to ith problem according to
size of this problem.
(b) Processor allocation is done by doing prefix computation on size of these m2 sizes of sub-
problems. Using m2p/n cores, prefix computation on m2 keys can be done in O(n/p + log p)
time with total O(m2/B) cache misses, given m2 ≥ Mm2p/n or n ≥ Mp (Theorem 2.3). We
will have enough processors for this task if m2 ≤ n.
27
(c) Divide si points into m slabs in O(
n
p log n + log n log logn) time and total O(
si
B logM n) cache
misses using sip/n cores, given si ≥ m2 and sisip/n ≥ max{M,B
2/(1−logsi m)} or
n/p ≥max{M,B2/(1−logsi m)} (Theorem 5.1).
In the worst case this step is done for all m2 problems. The total number of cache misses are
m2∑
i=1
si
B
logM n =
n
B
logM n for all m
2 problems. The condition si ≥ m2 is true if x ≥ 2 because
we are doing this step only for those problem who have si ≥ mx. For the next condition, we can
see that B2/(1−logsi m) decreases as we increase si, so if condition n/p ≥ B2/(1−logsi m) is true for
minimum si then it will be true for all si. The minimum si is m
x, which reduce the condition
to n/p ≥ B2/(1−log(mx)m) or n/p ≥ B2x/(x−1), where x ≥ 2.
(d) Solve rest of the sub-problems with sizes < mx. In the worst case there will be m2 such problems
and all of them will have size mx.
Using Theorem 5.1, we can divide mx (x is a small constant) points into m slabs in O(np logm+
logm log logm) time and total O(m
x
B logM m) cache misses using m
xp/n cores, given mx ≥ m2
and m
x
mxp/n ≥ B2/(1−log(mx)m) or n/p ≥max{M,B2x/(x−1)}. We will need a total of mx+2p/n
processors for m2 problems. So we have enough processors for this task if n ≥ mx+2. The total
cache misses for all m2 problems will be O(m
x+2
B logM m) which can be bounded by O(
n
B logM n)
if n ≥ mx+2.
After combining all of the conditions mentioned in steps above, we get n ≥ m4, x ≥ 2, n ≥ mx+2, n/p ≥
max{M,B2x/(x−1)} and n/p ≥ max{M,B2/(1−2 lognm)}. As x ≥ 2 so n ≥ mx+2 implies that n ≥ m4.
As B2/(1−lognm) decreases as we increase n, so if condition n/p ≥ B2/(1−lognm) is true for minimum n then
it will be true for all n. Minimum n possible is mx+2 so n/p ≥ B2/(1−2 log(mx+2)m) or n/p ≥ B2+4/x is
enough, where x ≥ 2. As B2+4/x ≥ B2x/(x−1) so n/p ≥ max{M,B2+4/x} is enough.
2
Corollary 8.2 Given n half-planes and a convex polygon divided into m radial sectors, for every half-plane
h, all the sectors intersected by h can be identified in O(np log n+log n log logn) time and O(
n
B logM n) cache
misses using p cores, given n ≥ mx+2, x ≥ 2 and n ≥ max{Mp,B2+4/xp}.
Proof: We get this result simply by combining Lemma 8.1 and Lemma 8.2. Map n half-planes to n points
in dual space and m vertices’s of convex polygon to m half-planes.
Using Lemma 8.2,given n points and m half-planes, for every point we can identify the region it lies in, in
O(np log n + log n log log n) time and O(
n
B logM n) cache misses using p cores, given n ≥ mx+2, x ≥ 2 and
n ≥max{Mp,B2+4/xp}.
Because of preprocessing done on m half-planes in Lemma 8.2, any region can be directly mapped to its
set of sectors. 2
Lemma 8.3 Given n half planes, and a convex polygon divided between m radial sectors, for each sector
Ci where i ≤ m, we want to identify all the half-planes intersecting Ci. If the cumulative output size is
O(n), this task can be done using p processors in expected time O(np log n + log n log logn) with expected
O( nB logM n) cache misses, given n ≥ mx+2, x ≥ 2, M ≥ B2B2/31 and n ≥max{Mp,B2+4/xp}.
Proof: We can divide this task into following steps :
1. Using Lemma 8.2, for every half-plane identify all the sectors this half-plane will intersect inO(np log n+
log n log logn) time and O( nB logM n) cache misses using p cores, given n ≥ mx+2, x ≥ 2 and n ≥
max{Mp,B2+4/xp}.
28
2. For every half-plane Ki (where 1 ≤ i ≤ n) if it intersect with sectors Sj ,Sj+1,Sj+2,. . . ,Sk (where
j ≤ k and both j and k are ≤ m), it means there will be ti = (k − j + 1) copies of this half-plane
in output list. So we need to write ti copies of Ki and assign them number j, j + 1, j + 2, . . . , k
respectively. There will be total
z∑
i=1
ti elements in output which is bounded by an (for some constant
as given). Sort this output list w.r.t. this assigned number.
This step is explained in detail in following steps :
(a) Assign n/p half-planes to every core so that every core will find ti for all of its n/p half-planes
and add them. To add n/p elements, one processor will take O(n/p) time and O(n/pB) cache
misses. The total cache misses across all cores will be n/B.
(b) Every core will write its n/p half-planes in output list as explained above. To find the location
where every core needs to write we will do prefix computation on these p sums found in step
above. Using p2/n cores, prefix computation on p keys can be done in O(n/p+ log p) time with
total O(p/B) cache misses given p ≥Mp2/n or n ≥Mp (Lemma 2.3 ).
(c) Sort these an elements using Theorem 3.1 w.r.t. number assigned to them above to get final
output half-planes divided them into m sectors. It will take O(anp log n+log n log log n) expected
time and O(anB logM n) expected cache misses, given an ≥ Mp and M ≥ B2B2/31 (for some
constant a).
2
8.3 Sampling and polling
To bound the total size of the sub-problems we need to find a good sample. A good sample means when
we divide our n half-planes into disjoints sub-problems using this sample total size of sub-problems will be
≤ cn and size of maximum sub-problem will be ≤ 2n1− log n, where  is number of sub-problems. We will
find this good sample using sampling and polling as shown in [19]. First we choose logn samples each of
size n and a random set of n/ log4 n input half-planes for all of these samples. Then for every sample we
divide its input sample into disjoint problems using this sample. Then based on total problem sizes of all
these samples, we choose a good sample from those log n samples [19].
Lemma 8.4 Using one processor, x elements can be sampled from n elements in O(x log x+n) time with
total O( nB +
x
B log x) cache misses, using x independent trials such that the probability of an element being
chosen is same in each trial.
Proof: We proceed as follows :
1. Processor will generate x random numbers in range [1, n] and save them sequentially in an array of
length x. It will take x time and incur x/B cache misses.
2. Sort these x keys in time O(x log x) with O( xB log x) cache misses using one processor.
3. Read these x numbers and input keys both from start and select those number whose rank is present
in these x numbers. It will take total (x+ n) time and (x+ n)/B cache misses.
2
Theorem 8.2 Using polling, a good sample of size n can be chosen from n half-planes in expected
O(np log n + log n log log n) time with expected O(
n
B logM n) cache misses, using p cores, given  ≤ 1/8
and n ≥max{Mp,B2/(1−4)p} and M ≥ B2B2/31.
29
Proof: We can divide the whole task into following steps :
1. Using p cores, choose log n samples each of size n from given n half-planes randomly, where 0 <  < 1.
For every sample, the probability for choosing any element is equal. Imagine that the input has been
divided into n contiguous chunks all of equal size and choose one key from every chunk.
To choose one sample using p/ log n cores, it will take dlog n. n/pe parallel steps and one cache miss
for every chosen key implying total n cache misses. For all log n samples, total cache misses will
be O(log n n). The time is bounded by O(np log n) if  ≤ 1 and the cache misses are bounded by
≤ nB logM n if n1− ≥ B logM (which is true if n ≥ B2/(1−)).
2. Find intersection of half-planes for every sample.
Using pn3/n log n cores, intersection of n half-planes can be computed in O(np log n + log p) time
with total O(n3/B) cache misses, if n3 ≥Mpn3/n log n or n log n ≥Mp (Lemma 7.4).
The total cache misses for log n samples will be O(n
3
B log n) which is bounded by O(
n
B logM n) if
n3 logM ≤ n or  ≤ 1/6. The total processors needed are pn3/n which is ≤ p if  ≤ 1/3.
3. We have to choose a random subset of n/ log4 n planes for every sample. If we use the same method
as used in Step 1, it will lead to total O(n/ log3 n) cache misses.
To reduce cache misses we use the following strategy :
(a) We have to choose total n/ log3 n elements. Every core is assigned n/p half-planes and gets a
task to choose n
p log3 n
elements from these n/p elements.
(b) Using Lemma 8.4, Every core will sample n
p log3 n
elements from n/p elements in O(n/p) time
and O(n/pB) cache misses. Total cache misses across all cores will be O(n/B).
(c) Every core assigns a random number in the range 1 to log n to all of these n
p log3 n
half-planes.
(d) Divide these n half-planes into log n problems w.r.t. this assigned number using Theorem 5.1.
Using p processors it will take O(np log n + log n log log n) time and total O(
n
B logM n) cache
misses, given n ≥ max{Mp,B2/(1−logn logn)p} and log n ≤ √n.
4. To find a good sample what we need is that every sample should divide its input sample into disjoint
sub-problems and based on the total size of those sub-problems, we can pick a good sample from
these.
5. Using intersection found in step 1, every sample will divide its input sample (chosen in step 3) into
disjoint problems.
Given n/ log4 n half-planes (chosen in step 3) and a convex polygon divided into n sectors (found
in step 1), for every half-plane we can find all sectors this half-plane will intersect in O(np log n +
log n log logn) time and O( n
B log4 n
logM n) cache misses using p/ log
4 n cores, given n
log4 n
≥ n(x+2),
x ≥ 2 and n ≥ max{Mp,B2+4/xp} (Corollary 8.2.)
We get an output list of size n/ log4 n. To find total size of sub-problems we need to just add these
n/ log4 n elements using p/ log4 n processors. It takes time O(n/p+ log p) with a total of O( n
B log4 n
)
cache misses, given n ≥ Bp. As both tasks are done for log n samples, so the total cache misses
across all cores will be O( n
B log3 n
logM n) which is ≤ O(n/B)
6. For every sample we have found the total size of the sub-problems. After finding the maximum size
we have enough information to select a good sample. Using p log n/n cores, maximum of log n keys
can be found in time O(n/p+ log p) with total O(log n/B) cache misses given log n ≥ Bp log n/n or
n ≥ Bp.
All of the above conditions can be combine as n ≥ n(x+2) log4 n, x ≥ 2 and n ≥ max{Mp,B2+4/xp}
M ≥ B2B2/31,  ≤ 1/6, n ≥ B2/(1−), n ≥ max{Mp,B2/(1−logn logn)p}.
By choosing x = ( 12 − 2), we obtain n ≥ B2/(1−4),  ≤ 1/8, M ≥ B2B2/31
30
28.4 Filtering
The total size of sub-problems after obtaining a good sample is O(n). We do some further pre-processing to
reduce this size to exact size of parent problem. This step is a kind of post-processing step after sampling.
For every sector we identify half-planes which intersect it. Some of them may be part of the output while
Figure 8: (a) B dominates A (b) No half-plane is dominated
some of them may not show up in output (in that sector). Since the output size is bounded by input size
so our objective is to eliminate all those half-planes from a sector which do not show up in output. This
task can be reduced [19] to sorting and maximal points problem.
As shown in figure 8(a) we can say that half-plane A is dominated by B through-out the sector so it
would not be part of output in this sector. In filtering step we will remove all half-planes of these kind.
After filtering there would not be any half-plane left in any sector such that it is dominated by any other
half-plane as shown in figure 8(b).
Filtering : We are given some sectors and some half-planes. From these half-planes we will remove most
of those planes which would not be part of output (intersection of given half-planes). If output size if x
(intersection of given half-planes), then after this filtering we will have utmost 2x half-planes left.
Theorem 8.3 Given a sector, n half-planes can be filtered in expected time O(np log n+log n log log n) with
a total of O( nB logM n) expected cache misses , given n ≥Mp and M ≥ B2B2/31.
Proof: This task can be divided into the following steps :
1. Consider one side of a sector and find intersection points of half-planes with this side. This can be
done in dn/pe time and dn/Be cache misses by assigning n/p half-planes to each processor.
2. Sort these half-planes w.r.t. distance of intersection point found above from center of sector. Using
Theorem 3.1 it will take expected time O(np log n + log n log log n) and expected O(
n
B logM n) cache
misses, given n ≥Mp and M ≥ B2B2/31.
31
3. Repeat steps 1 and 2 for other side of sector.
4. We get two sorted lists say x and y. For half-plane ni consider a tuple of the form xi, yi, where
(xi, yi) are rank of these half-planes in these sorted lists. To find this tuple we can do the following
- In list x write rank on every half-plane and then sort this list w.r.t. half-plane number. We get a
list of ranks for all half-planes.
5. We have a tuple (xi, yi) for every half-plane. A half-plane ni is completely occluded by another
half-plane nj if xj ≥ xi and yj ≥ yi. In that case half-plane ni would not be part of output. Find
maximal points on this tuple.
6. Using Theorem 7.2, given n points, we can find the maximal points in expected time O(np log n +
log n log logn) with a total of O( nB logM n) expected cache misses using p cores, given n ≥ Mp and
M ≥ B2B2/31.
2
Theorem 8.4 Given n sectors, we can filter O(n) half-planes in expected time O(np log n+ log n log log n)
with a total of O( nB logM n) expected cache misses , given n ≥Mp and M ≥ B2B2/31.
Proof: We have a total of an half-planes divided into n problems. Each problem has size si, (where
1 ≤ i ≤ n) and
n∑
i=1
si = an. We will do filtering on all of these n
 problems in parallel. Every problem is
assigned number of processors according to its size, so the ith problem is assigned sip/an processors using
prefix computation (Theorem 2.3).
Using Theorem 8.3 described above, one problem of size si(where si ≤ an) using sip/an processors can be
filtered in expected time O(anp log n + log n log log n) with a total of O(
si
B logM n) expected cache misses ,
given si ≥ Msip/an or an ≥ Mp and M ≥ B2B2/31. The total cache misses for all the n problems is
O(anB logM n).
For any constant a, time is bounded by O(np log n + log n log log n) and total number of cache misses is
bounded by O( nB logM n). 2
9 Concurrent Writes and processor oblivious load balancing
Traditionally, parallel programs are written assuming that there is a unique id (an integer in the range
1 . . . p) for each of the p processors. The processors id is used to designate a particular task to a specific
processor. For example, for an input array of n numbers, a processor with id i may be allocated the task
associated with the sub-array np ·i . . . np ·(i+1). This is easy because p is known at the time the parallel code
is generated. However, in some situations p may not known and moreover, the processors may not have an
id associated with them. We illustrate this method for the fundamental problem of prefix computation.
The basic idea is that each of the p processors simultaneously chooses a random number in the range [1..n]
and writes to the corresponding location in an array A. The expected number of processors writing to a
specific location A[i] 1 ≤ i ≤ n, is p/n and no more than O(log n) w.h.p. - note that, from our earlier
assumptions, p ≤ n/B, so the expected number of elements writing into a B element block ≤ 1. Roughly
speaking, a processor writing to a location i assumes responsibility for the block of n/p elements starting
from b in/pc. However, because of conflicts caused by independent random choices, we have to do some
limited redistribution and also estimate the value of p. It can be argued using Chernoff bounds that w.h.p.
Θ(log n) processors will choose a location in the range [bi . . . b(i+ 1)] i = 0, 1 . . . where b = (n/p) · log n.
32
From the previous observation, p can be estimated in the following manner. Let the leftmost processor
in the the array count up to log n processors. If this location is β, then we can estimate p to be within a
constant factor of (n/β) · log n). This procedure requires more elaboration.
To find the leftmost processor, we can let every processor that succeeded in writing to a location in
the first step traverse left in a synchronous fashion. Once it encounters a location that has been already
written into, it terminates the traversal. Only the leftmost processor will be able to continue and reach
location 1.
In the second phase when it counts up to log n processors, we have to ensure that the processors that
conflict in the random choices are counted with the proper multiplicity. If j processors write to a single
location, then we let these processors increase a counter and the final count is the number of the processors -
since j ≤ log n w.h.p., this step completes in O(log n) time. So the leftmost processor can start its rightward
traversal after Ω(log n) steps and report the location β. Note that the expected value of β is (n/p) log n
which can be subsumed in the overall time for a problem like sorting but not for prefix computation. So,
for prefix computation, we can assume that p ≤ (n/ log n) and run the estimation procedure in an array of
length n/ log n, so that the overall time for the estimation is O((n/ log n) · (1/p) · log n) which is O(n/p).
The processors that belong to the block (of size (n/p) log n) must be assigned a unique id which can
be computed as follows. Once p is estimated, each processor knows its most significant log p − log log n
bits of the id. The remaining log log n bits can be assigned on the basis of a simple prefix sum within the
block carried out by the leftmost processor within the block (once p is estimated, the leftmost processor
in a block is easily identified by repeating the procedure within the block). Following this, each processor
is assigned a unique block of θ(n/p) elements.
In the above procedure, we have not accounted for the block misses when processors conflict in writing
to a specific block. For instance, if j processors write to the same block, the block misses will be Ω(j2).
This could be as much as Ω(log2 n) since we can only bound j ≤ log n with high probability. The overall
block misses will be given by
∑n′
i=1 n
2
i where n
′ = n/B and ni is the number of processors writing in block
i. Note that
∑
i ni = p. We can compute the expected block misses as follows. Since nj denotes the
number of processors that chose block j, we are interested in the quantity E[
∑
j n
2
j ] that represents the
total expected block misses. Let r.v. Xi = 1 if processor i
6 writes to block 1 and 0 otherwise (the same
analysis will apply to any fixed block by symmetry). Then E[Xi] = 1/n
′ and moreover E[X2i ] = 1/n
′.
From our earlier notation, n1 =
∑p
i=1Xi and so n
2
1 =
∑p
i=1X
2
i . Taking expectations,
E[
(
p∑
i=1
Xi
)2
] = pE[X2i ] +
(
p
2
)
· 2E[Xi ·Xj ] = p · 1
n′
+ p(p− 1) · E[Xi] · E[Xj ] = p · 1
n′
+ p(p− 1) 1
n′2
The first equality follows from linearity and the second from independence. Therefore, from symmetry,
E[n21 + n
2
2 + . . .] = n
′ · (p/n′ + p(p− 1)/n′2) = p(1 + (p− 1)/n′) which is O(p) for p ≤ n′ = n/B.
For prefix computation, we proceed as follows. Given p ≤ n/ log n processors, we first estimate p
using the first n/ log n locations (i.e. n′/ log n blocks) of the array and also the processor id. If there are
c log n processors for a block of size n/p log n, then a processor with id = < m′, `′ > where m′ denotes
the most significant log p− log log n bits, and `′ denotes the least significant bits is assigned the locations
[α ·m′ + β · `′ . . . α ·m′ + β(`′ + 1)] where α = n/p log n and β = 1c · (n/p).
The first phase takes O(n/p) sequential steps, following which the number of data items is p. We can
run the optimal (cache oblivious) speed-up prefix computation on the p processors.
10 Future Work
For both sorting and convex hull algorithm given here, running time is not optimal when we have n cores.
The bottleneck step in these algorithm is where we divide input into buckets. This step is being imple-
6this is only for the analysis since a processor is not explicitly given any id; also note that we need Concurrent Write
capability for this
33
mented using a deterministic algorithm. We can try to optimize this by using a randomized approach.
References
[1] Peyman Afshani, Arash Farzan, Cache-Oblivious Output-Sensitive Two-Dimensional Convex Hull, 19th Canadian Con-
ference on Computational Geometry, 2007.
[2] A. Aggarwal and J. S. Vitter. The Input/Output Complexity of Sorting and Related problems, Communications of the
ACM, 1988.
[3] N.M. Amato, M.T. Goodrich, and E.A. Ramos, N.M. Amato, M.T. Goodrich, and E.A. Ramos, Parallel algorithms for
higher- dimensional convex hulls, Proc. 35th Annu. IEEE Sympos. Found. Comput. Sci., pp. 683694, 1994.
[4] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multi-
processors, SPAA’08, June 14-16, 2008, Munich, Germany.
[5] G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache
performance for divide-and-conquer algorithms. In ACM-SIAM SODA, pages 501-510, 2008.
[6] Robert D. Blumofe and Charles E. Leiserson, Scheduling Multithreaded Computations by Work Stealing, Journal of the
ACM 46(5), 1999,pp. 720-748.
[7] Guy E. Blelloch, Phillip B. Gibbons, Harsha Vardhan Simhadri, Low Depth Cache-Oblivious Algorithms, SPAA ’10
Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures.
[8] T. Chan. Optimal output-sensitive convex hull algorithms in two and three dimensions. Discrete and Computational
Geometry, 1996.
[9] R. Cole and U. Vishkin. Approximate and Exact Parallel Scheduling with Application to List, Tree and Graph Problems.
Proc. 27th IEEE Symposium on Foundation of Computer of Science, 1986. pp 478-491.
[10] R. Cole. Parallel Merge Sort. SIAM J. Compute. 17 (1988). pp 770-785.
[11] Richard Cole and Vijaya Ramachandran, Resource Oblivious Sorting on Multicores, ICALP’10 Proceedings of the 37th
international colloquium conference on Automata, languages and programming, 2010.
[12] D. Dobkin and R.J. Lipton, Multidimensional searching problems, SIAM J. Comput., 5(1976), pp.181-186
[13] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran, Cache-Oblivious Algorithms,
[14] Rudiger Reischuk, A Fast Probabilistic Parallel Sorting Algorithm, Proc. 22nd IEEE Symposium on Foundation of
Computer of Science, 1981.
[15] John H. Reif and Leslie G. Valiant, A Logarithmic Time Sort for Linear Size Networks, Journal of the Association for
Computing Machinery, Vol. 34, No. I, January 1987, pp. 60-76.
Journal of the ACM, Vol. 49, No. 6, November 2002, pp. 828-858.
[16] Harald Prokop, Cache-Oblivious Algorithms, MS Thesis, MIT, June 1999. IEEE 40th Annual Symposium on Foundations
of Computer Science, 1999.
[17] John H. Reif, Sandeep Sen, Optimal Parallel Randomized Algorithms for 3-D Convex Hulls and Related Problems (1992),
SIAM Journal on Computing
[18] John H. Reif and Sandeep Sen, Randomized Algorithms for Binary Search and Load Balancing on Fixed Connection
Networks with Geometric Applications, SIAM J. Comput., vol 23(3), 1994,pp.633-651.
[19] Random Sampling Techniques and Parallel Algorithms Design, Sandeep Sen, S. Rajasekaran, Synthesis of Parallel Algo-
rithms, J. Reif ed. , Morgan Kauffman, 1993.
[20] S. Rajasekaran and S. Sen, On parallel integer sorting, Acta Informatica, 29: 1–15, 1992.
[21] Sandeep Sen, Sidhhartha Chatterjee and Neeraj Dumir, Towards a Theory of Cache-Efficient Algorithms, Journal of the
ACM, 49(6), pp. 828–858, 2002.
[22] L. G. Valiant, A bridging model for multi-core computing, In Proc. of the 16th Annual ESA, volume 5193, pages 13-28,
2008.
34
