Over the last five years, major microprocessor manufacturers have released plans for a rapidly increasing number of cores per microprossesor, with upwards of 64 cores by 2015. In this setting, a sequential RAM computer will no longer accurately reflect the architecture on which algorithms are being executed. In this paper we propose a model of low degree parallelism (LoPRAM) which builds upon the RAM and PRAM models yet better reflects recent advances in parallel (multi-core) architectures. This model supports a high level of abstraction that simplifies the design and analysis of parallel programs. More importantly we show that in many instances it naturally leads to work-optimal parallel algorithms via simple modifications to sequential algorithms.
INTRODUCTION
Modern microprocessor architectures have gradually incorporated support for a certain degree of parallelism. Over the last two decades we have witnessed the introduction of the graphics processor, the multi-pipeline architecture, and vector architectures. More recently major hardware vendors such as Intel, Sun, AMD and IBM have announced multicore architectures in their main line of processors, with two and four core processors being currently available. Until recently the degree of parallelism provided by any of these solutions was rather small and as such it was best studied as a constant speedup over the traditional and/or transdichotomous RAM model. However, recent road maps released by major microprocessor manufacturers predict a rapidly increasing number of cores, reaching 64 to 128 cores by 2015. A newly revised version of "Moore's Law" now states that the number of cores per chip is expected to double every two years. In this scenario, a constant speedup would no longer accurately reflect the amount of resources available.
We propose a new model of low degree parallelism (Lo-PRAM) which better reflects recent multicore architectural advances. We argue that in current architectures the number of processors available is best described as O(log n) rather than the implicit O(n) of the classic PRAM model. As with the classical RAM model, the LoPRAM supports different degrees of abstraction. Depending on the intended application and the performance parameters required, the design and analysis of an algorithm can consider issues such as the memory hierarchy, interprocess communication cost, low level parallelism, or high-level thread-based parallelism. Our main focus is on this higher level, thread-based parallelism, shifting the classical view of the SIMD PRAM with finely synchronized parallelism to a higher-level multi-threaded view. As we shall see the design and analysis of algorithms at this higher level is often sufficient to achieve optimal speedup. This, of course, does not preclude the use of low level optimizations when necessary.
We then apply this model to the design and analysis of algorithms for multicore architectures for a sizeable subset of problems and show that we can readily obtain optimal speedups. This is in contrast to the PRAM model, in which even a work-optimal sorting algorithm proved to be a difficult research question [8] . More explicitly, we show that a large class of dynamic programming and divide-and-conquer algorithms can be parallelized using the high level LoPRAM thread model while achieving optimal speedup using an automated scheduler. Interestingly, the assumption that there is a logarithmic bound on the degree of parallelism is key in the analysis of the techniques given. We identify that for certain problems there are sharp thresholds in difficulty when the number of processors grows past log n and n 1−ǫ . This paper provides the first formal argument for the observation that it is easier to develop work optimal parallel programs when the degree of parallelism is small. While this observation is widely known and acknowledged by many, it had previously been stated only at an empirical level and not been the subject of formal study.
As such the main contribution of this work is the combination of a series of established facts, as well as new observations and lemmas into a novel model which is simple, effective and better reflects the state of current parallelism.
PARALLEL MODELS
The dominant model for previous theoretical research on parallel computations is the PRAM model [12] , which generally assumed Θ(n) processors working synchronously with zero communication delay and often with infinite bandwidth among them. If the number of processors available in practice was smaller, the Θ(n) processor solution could be emulated using Brent's Lemma [6] . The PRAM model, while fruitful from a theoretical perspective, proved unrealistic and various attempts were made to refine it in a way that would better align to what could effectively be achieved in practice (see, for example, [10, 3, 18, 14, 16, 17, 1, 2] ). In addition to its lack of fidelity, an important drawback of the PRAM is the enormous difficulty in developing and implementing work-optimal algorithms (i.e. linear speedup) for a computer with Θ(n) processors.
To the best of our knowledge the assumption of a logarithmic level of parallelism as well as its theoretical implications had yet to be noted in the literature.
MODEL
The core of a LoPRAM is a PRAM with O(log n) processors running in multiple-instruction multiple-data (MIMD) mode. The read and write model, while architecture dependent, can generally be assumed to be Concurrent-Read Exclusive-Write (CREW) [15] . To support this model, semaphores and automatic serialization on shared variables are made available-either hardware or software based-in a transparent form to the programmer.
The LoPRAM model naturally supports a high level of abstraction that simplifies the design and analysis of parallel programs. The application benefits from parallelism through the use of threads.
Thread Model
Two main types of threads are provided: standard threads and pal-threads (Parallel ALgorithmic threads). Standard threads are executed simultaneously and independently of the number of cores available; they are executed in parallel if enough cores are available or by using multitasking if the thread count exceeds the degree of parallelism, just as in a regular RAM. Pal-threads on the other hand are executed at a rate determined by the scheduler. If there are any pal-threads pending, at least one of them must be actively executing, while all others remain at the discretion of the scheduler. They could be assigned resources, if they are available, or they could be made to wait inactive until resources free over. Once a thread has been activated though, it remains active just like a standard thread (this is important to avoid potential deadlock). Pending pal-threads are activated in a manner consistent with order of creation as resources become available, in a fashion reminiscent of work stealing [7] . While primitives are provided for ad-hoc ordering of pal-threads activation, by default threads are inserted into an ordered tree. The root of the tree is the main thread, and new threads are attached as children of the thread that creates them, in order of creation.
The scheduler executes the nodes in a combination of parallel breadth-first and depth-first order. When a thread issues calls for its children, the calling thread enters a wait state and children threads are executed in order of creation. If there are enough cores available, each children is assigned to a different core. Otherwise, when all available cores are executing a thread, each core executes the subtree rooted at its thread in depth-first order. If at any point cores become available again, threads are assigned to them in the order given by the preorder traversal of the tree. When no further children remain pending control is returned to the parent thread. Execution concludes when there are no further threads to execute and the main thread exits.
Multiprocessing model
In actuality, the number of cores made available by the operating system may vary as the level of multiprogramming in the system changes. Hence, in the analysis of the algorithm the number of processors available is denoted as p, with the assumption that this number is bounded from above by O(log n), i.e., p = O(log n) but not necessarily Θ(log n). The algorithm must execute properly for any value of p. The running time is, of course, a function of n and p.
WORK-OPTIMAL PARALLELIZATION
We present two classes of problems which allow for ready parallelization under the LoPRAM model. Note that these same classes were not, in general, readily parallelizable under the classic PRAM model.
Divide and Conquer
Consider the class of divide-and-conquer algorithms whose time complexity is described by a recurrence which can be solved using the master theorem. We show that when these algorithms are executed in a straightforward parallel fashion on a LoPRAM, their execution time is given by a parallel version of the master theorem with optimal speedup.
Consider a recursive divide-and-conquer sequential algorithm whose time complexity T (n) is given by the recurrence:
where a, b > 1 are constants, and f (n) is a nonnegative function. By the master theorem, T (n) is such that [9] :
When p processors are available, we assume that recursive calls can be assigned to different processors, which can execute their instances independently of those of others. All of the processors finish their computations before the results are merged. We denote by Tp(n) the running time of an algorithm that uses p processors, and by T (n) that of its sequential version.
We first consider problems for which the merging phase of the algorithm can only be done sequentially in each instance. Multiple processors can still be used to merge subproblems of different instances, but only one processor deals with a particular instance.
Theorem 1 Let Tp(n) be the time taken by a recursive algorithm that uses p = O(log n) processors whose sequential version has time complexity given by Equation (1) . Then, the time Tp(n) is a recurrence of the form:
The bounds for Tp(n) are given by:
Alternatively if we can merge the results of subproblems in parallel with optimal speedup, the parallel master theorem for this setting is as before with the exception of case 3 for which we have:
Theorem 1 implies that optimal parallel algorithms can be readily derived for important divide-and-conquer algorithms such as Mergesort, Matrix multiplication, Delaunay triangulation, Polygon triangulation and Convex hull, among others.
Dynamic Programming
In the past parallel versions of certain dynamic programming algorithms have been proposed (see, for example, [4] , [13] , [5] ). Most of these studies provide parallel algorithms that are specific to a few dynamic programming problems, and assume a classical PRAM model with Θ(n) processors. In our case we restrict ourselves to p = O(log n) processors.
We describe now a general procedure such that, given the specification of the dynamic programing solution to a problem, generates a scheduling strategy to solve it in parallel.
Let the specification of the solution be of the form:
(3) For the dynamic programming solution to be effective we require that the object M which stores partial solutions can be efficiently indexed using a partial input x as key and that the recursive order yi ≺ x be efficiently constructible in a bottom-up fashion. If this is not the case, we can use memoization, which stores the partial solutions as they are required in the top-down expansion of the recursion. In most cases these two techniques are equivalent, though there are known cases in which the use of one over the other (for either of them) is provably superior.
The recursion described in Equation (3) can be modeled by a directed acyclic graph (DAG), where each vertex vx corresponds to a subproblem x (or equivalently partial result M [x]), and there is an edge from the vx to vy iff subproblem y depends on subproblem x. The source vertices in this DAG correspond to the base cases. The goal is to compute this DAG in parallel. The speedup will be proportional to the amount of parallelism embedded in the graph. In certain cases, such as one dimensional dynamic programming the DAG is a path and hence there is no speedup possible. In others such as most common examples of two dimensional tables for dynamic programming, there is a row, column or diagonal order which allows for a high degree of parallelism.
Given the specification D of a dynamic programming solution of the form (3) and an input I, we give a parallel algorithm for solving the problem. We do not explicitly create the entire dependency graph in the beginning. Instead, we create the relevant parts of the graph as the computation proceeds. Each vertex v has a counter cv that indicates, at any time, the number of vertices that v depends on directly and that have not been computed yet. The counters of all vertices are initialized based on D and I. Initially a pal-thread is created for each base case vertex. After a thread computes the value corresponding to a vertex v, it determines its outgoing neighbors according to D and I, decreases the values of their counters, and creates pal-threads for each of the vertices whose counter becomes zero.
Our goal is to compute the solution optimally in time Tp(n) = O(T (n)/p). The speedup factor will depend on the parallelism embedded in the graph: if most antichains have size smaller than p, then we cannot obtain much parallelism. However, for most dynamic programming algorithms of dimension more than one, antichains are usually large enough to support high parallelism. In addition, the concurrent update of the counter values in the graph cannot always be done in parallel in a CREW model. Hence, a standard simulation technique can be used to obtain CRCW behaviour on a CREW PRAM with a log p slowdown factor [11] .
