In this paper we present a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache. We first derive an O( n p log n + log n log log n) expected parallel depth algorithm for sorting n numbers with expected O( n B log M n) cache misses where p, M and B respectively denote the number of processors, the size of an individual cache memory and the block size respectively. Although similar results have been obtained recently for sorting [4] , we feel that our approach is simpler and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds.
INTRODUCTION
The private-cache multicore model and the closely related Parallel External Memory (PEM) model combines several features of parallel computing models like PRAM and the memory hierarchy issues captured by the External Memory Models. The goal is to capture the relevant aspects of a large scale multiprocessing environment, whose numerous parameters may be unknown to the algorithm designer. Although, it is not intuitive, this last requirement can be tackled using the strategy called cache obliviousness or more generally resource obliviousness.
These multiprocessing models consists of p processors (or cores) each having a private cache of size M . that commu- nicate with each other through a shared memory -for full parallelism, n ≥ M · p. Initially, the input is present in the global memory stored in form of blocks of size B and all transfers are done in blocks of size B. We incur two types of cache related-cost in this model(previously defined in [4] ) Cache misses :
Whenever any core needs some data which is not present in its cache, that block is copied from the main memory into its cache and it is counted as one cache miss. Block misses : When a block residing in multiple caches is modified, then every core containing the block incurs a block miss. Both concurrent reads and concurrent writes are allowed in this model. Concurrent Reads : If x cores are reading the same block, every core will incur one cache miss and the total cache cost will be x. Concurrent Writes : If x cores write simultaneously to the same block, this block will migrate across these x cores to satisfy their write requests. Thus the i th core in sequence will have to wait for time equal to i cache misses before it can complete its write operation. So the total cache misses across all cores will be
. The performance of a multicore algorithm is characterized by two parameters -Cache misses (including block misses) and the Critical path length, which is the maximum time taken by any processor in the overall algorithm. Our goal is to design efficient cache oblivious algorithms where the parameters M, B cannot be used to customize the algorithm, yet, we would like to match the performance of Cache aware algorithms. Further, we would like to generate the parallel code without the knowledge of the the number of processors which is known as resource obliviousness.
Previous Related Work
For the external memory model, Aggarwal and Vitter [1] Note that both time and cache misses achieved by this algorithm are optimal.
Arge et al. [2] formalized the PEM model and presented a cache-aware mergesort algorithm that runs in O(log n) time and has optimal cache misses. Blelloch et al. [3] presented a resource oblivious distribution sort algorithm that has expected O(log 3/2 n) critical path-length and incurs suboptimal cache cost in the private-cache multicore model. Their distribution sort uses merging to divide the input which is the potential bottleneck for reducing the depth further. The algorithm given in [9] is designed for a BSP-style version of a cache aware, multi-level multicore which is difficult to compare it directly with the previous results. Recently Cole and Ramachandran [4] presented a new optimal merge sort algorithm (SPMS) for resource oblivious multicore model. This algorithm sorts n keys in O( n p log n + log n log log n) time using n processors with optimal number of cache misses on resource oblivious model assuming n ≥ Mp and M ≥ B 2 log B log log B. It works in O(log log n) stages where each stage requires O(log n) time. The authors addressed a general computational paradigm called Balanced Partitioning Trees and designed a a resource-oblivious priority work scheduler based on work-stealing to attain the above bounds.
Our Work
In this paper, we have presented a randomized distributed sorting algorithm on cache-oblivious multicore model that is similar to Reischuk [7] . However, to bound the cache misses, we had to modify it significantly. First, we sample an appropriate number of elements from the input, sort them using a brute-force method, and use these elements to divide the input into disjoint buckets. To partition in a cache-oblivious fashion, we do it in two phases. Roughly speaking, we divide the n input keys into √ n buckets by successive partitioning into n 1/4 size buckets. This partitioning procedure, which is the crux of our distribution sort, in turn invokes an efficient merging procedure to attain the final bounds. log n + log n log log n) and expected cache misses O( n B log M n), using a cache oblivious algorithm.
So the cache cost is optimal but time is optimal only for p ≤ (n/ log log n). Since n ≥ Mp, under a very weak condition, viz., M ≥ log log n, it follows that p ≤ (n/ log log n) and our algorithm matches both time and cache misses optimality. Our bounds for sorting match that of Cole and Ramachandran [4] in cache-misses and depth and we can obtain matching performance in the cache-oblivious PEM model. Further, our algorithm is based on a general framework for randomized divide-and-conquer that has other applications. In this work we exploit this to obtain an algorithm for constructing three dimensional convex hulls with bounds similar to sorting that is based on the approach of of Reif and Sen [6] . More specifically, we obtain the following result. log n + log n log log n) and expected O( n B log M n) cache misses using a cache oblivious algorithm.
Since it is known that Voronoi diagrams can be reduced to three dimensional convex hulls, we obtain identical results for constructing 2D Voronoi diagrams.. We also present a simple technique for processor-obliviousness where the algorithm need not have any prior knowledge of the number of processors and the processors can generate their ids on the fly. Our approach is fundamentally different from [4, 3] that design a scheduler to map tasks to processors in a resource-oblivious fashion.
AN OVERVIEW OF OUR ALGORITHM
The crux of our algorithm is an efficient cache-oblivious partitioning scheme that works as follows. Let T (n, m) represent the total time to divide n keys into m buckets. Instead of dividing the n keys directly into m buckets, we first divide the n keys into √ m buckets and then every buckets is further split into √ m buckets.
Divide the n keys into
√ m buckets. This is done in two steps. (i) Divide n keys into √ n contiguous chunks of size √ n each. Now each chunk of size √ n is divided into √ m buckets recursively.
(ii) Now we have √ n lists, each divided into √ m buckets. We merge these lists and get a single list divided into √ m buckets. The merging procedure is summarized in Theorem 2.2.
where 
We follow the same approach as done in step 1 above with some modifications and subsequently merge the buckets. Finally, we obtain the following recurrence for the parallel time T (n, m) = O(n/p + log p) + 2T ( √ n, √ m).. The analysis for cache misses is similar and is omitted from this abstract. log n + log n log log n) incurring a total of O(
For merging x lists, each divided into t buckets, we assign the y keys equally to the p processors such that the first processor will write first y p keys, the second processor will write next y p keys and so on. For this, the input is partitioned using a prefix computation where the processors may have to read the keys from multiple lists, but while writing, they write contiguously. All the cores will read their corresponding keys sequentially unless one of the following events happen: (i) The bucket that a core was reading, ends. It can increase the cache miss count by at most one compared to the sequential cache misses and this can happen at most xt times.
(ii) Processors may have to start or end reading in the middle of some block. This can increase cache miss count by at most by one or two over sequential misses that is bounded by O(p). Note that the above bounds are worst case deterministic. For the overall distribution sort, a random sample of size n 1/α , α > 1 is used to partition the problem, and the recurrence for the expected parallel running time can be written as
otherwise (1) where K = N/P is the original problem size to processor ratio. For n ≥ 2 18 , we can choose α such that ni ≤ n
with probability > 1 − 1/n, i.e. r < 1/n. It follows that
log n + α log n log log n). Using similar arguments, the expected number of cache missesQ(n) can be bounded by O(
Processor oblivious load balancing
Traditionally, parallel programs are written assuming that there is a unique id (an integer in the range 1 . . . p) for each of the p processors. The processors id is used to designate a particular task to a specific processor. For example, for an input array of n numbers, a processor with id i may be allocated the task associated with the subarray
. This is easy because p is known at the time the parallel code is generated. In our situation p is not known and the processors do not have any predefined unique id associated with them.
The basic idea is that each of the p processors simultaneously chooses a random number in the range [1. .n] and writes to the corresponding location in an array A. The expected number of processors writing to a specific location
is p/n and no more than O(log n) w.h.p. -note that, from our earlier assumptions, p ≤ n/B, so the expected number of elements writing into a B element block ≤ 1. Roughly speaking, a processor writing to a location i assumes responsibility for the block of n/p elements starting from i n/p . However, because of conflicts caused by independent random choices, we have to do some limited redistribution and also estimate the value of p. It can be argued using Chernoff bounds that w.h.p. Θ(log n) processors will choose a location in the range [bi . . . b(i + 1)] i = 0, 1 . . . where b = (n/p) · log n. If there are c log n processors for a block of size n/p log n, then a processor has id = < m , > where m denotes the most significant log p − log log n bits, and denotes the least significant bits. The processor is assigned locations [α · m + β · . . . α · m + β( + 1)] where α = n/p log n and β = 1 c · (n/p). The bits can be assigned using a counter when the processors write (concurrently) to a common block. The details of computing m , are omitted from this abstract.
In the above procedure, we have not accounted for the block misses when processors conflict in writing to a specific block. For instance, if j processors write to the same block, the block misses will be Ω(j 2 ). This could be as much as Ω(log 2 n) since we can only bound j ≤ log n with high probability. The overall block misses will be given by
where n = n/B and ni is the number of processors writing in block i. Note that i ni = p. We can compute the expected block misses as follows. If n j denotes the number of processors that chose block j, we are interested in the quantity E[ j n 2 j ] that represents the total expected block misses. Let r.v. X i = 1 if processor i writes to block 1 and 0 otherwise (we need concurrent write capability). Then E[X i] = 1/n and moreover E[X
