We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs O(n 3 / √ Z + (P n) 1/3 n 2 ) cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring O(n 2 /Z + n + √ P n 3+ ) cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring O(n 2 /Z + √ P n 3.58 ) cache misses with high probability.
Introduction
In this paper we derive bounds to the number of cache misses (the cache complexity) incurred by a computation when executed by an idealized parallel machine with multiple processors. We assume that the computation is multithreaded: The computation expresses a partial order on its instructions, and a scheduler external to the computation maps pieces of the computation onto processors. The computation itself has no control over the schedule. Our main focus is on analyzing the cache complexity of parallel multithreaded cache oblivious algorithms [16] , although, as a special case, our bounds also apply to a sequential process migrated from one processor to another by an operating system. Past studies of the cache complexity have considered two complementary settings, each modeling different aspects of real machines. In the distributed-cache model, each processor is connected to a private cache that interacts somehow with the other caches to maintain a desired memory model. In the shared-cache model, a single cache is common to all processors, which are also referred to as (hardware) threads. This paper focuses on the distributed-cache model.
A multithreaded computation defines a partial execution order on its instructions, which we view as a directed acyclic graph (dag) [1, 6, 9, 26] . The work T 1 is the total number of nodes in the dag, and the critical path T ∞ is the length of a longest path in the dag. It is well-known that these two parameters characterize the dag for scheduling purposes: the execution time T P of the dag on P processors satisfies T P ≥ max(T 1 /P , T ∞ ), and a greedy scheduler [10, 21] matches this lower bound within a factor of 2. A greedy scheduler is no longer asymptotically optimal when taking cache effects into account, however, and the best choice of a scheduler depends upon whether caches are distributed or shared. Roughly speaking, on a shared cache, threads that use the same data should be scheduled concurrently so as to maximize data reuse. On distributed caches, threads that do not share data should be scheduled concurrently so as to minimize inter-cache communication.
Multithreaded computations in the shared-cache model have been investigated by Blelloch and Gibbons [5] who proved a strong result: If the cache complexity of a computation is Q 1 on one processor with cache size Z 1 , then a parallel schedule of the computation exists such that the cache complexity Q P on P processors satisfies Q P ≤ Q 1 , assuming that the P processors share a cache of slightly larger size Z P ≥ Z 1 +P T ∞ . Blelloch and Gibbons explicitly show a scheduler that achieves this bound.
The analysis of distributed caches is more involved. Acar et al. [1] construct a family of dags with work (n) and critical path (lg n) whose cache complexity is bounded by O(Z) on one processor with a cache of size Z, but it explodes to (n) when the dag is executed in parallel on distributed caches. This result shows that a low cache complexity on one processor does not imply a low cache complexity on multiple processors for general dags. For series-parallel dags, however, more encouraging results are known. Blumofe et al. [7] prove that the Cilk randomized work-stealing scheduler [9] executes a series-parallel computation on P processors incurring
cache misses with high probability, where Q 1 is the number of cache misses in a sequential execution and Z is the size of one processor's cache. This bound holds for a "dag-consistent" shared memory with LRU caches. Acar et al. [1] prove a similar upper bound for race-free series-parallel computations under more general memory models and cache replacement strategies, taking into account the time of a cache miss and the time to steal a thread. The bound in (1) diverges to infinity as the cache size increases, and is actually tight for certain pathological series-parallel computations [1] .
On the other hand, as we show in this paper, (1) is not tight for those "well-designed" programs whose sequential cache complexity decreases as the cache size increases, including cache oblivious algorithms. In this paper, we introduce the ideal distributed cache model for parallel machines as an extension of the (sequential) ideal cache model [16] , and we give a technique for proving bounds stronger than (1) for cache oblivious algorithms [16] . Our most general result (Theorem 1) has the following form. Consider the sequence of instructions of the computation in program order (the trace). Assume that a parallel scheduler can be modeled as partitioning the trace into S "segments" of consecutive instructions, and that the scheduler assigns each segment to some processor. Cilk's work-stealing scheduler, for example, can be modeled in this way. Assume that the cache complexity of any segment of the trace is bounded by a nondecreasing concave function f of the work of the segment. Then the cache complexity of the parallel execution is at most Sf (T 1 /S). For the majority of existing cache oblivious algorithms, the cache complexity is indeed bounded by a concave function of the work, and therefore this analysis is applicable. 1 Furthermore, for the Cilk scheduler, the number of segments is O(P T ∞ ) with high probability, and thus we derive bounds to the cache complexity in terms of the work T 1 , the critical path T ∞ , and the sequential cache complexity.
For example, a multithreaded program for multiplying two n × n matrices without using additional storage has work T 1 = O(n 3 ), critical path T ∞ = O(n), and sequential cache complexity [7] . When the program is executed on P processors by the Cilk scheduler, we prove that its cache complexity is
with high probability. As another application, we present a new multithreaded cache oblivious algorithm for stencil computations, derived from our sequential algorithm [18] . Our one-dimensional stencil algorithm for a square spacetime region has T 1 = O(n 2 ), T ∞ = O(n), and sequential cache complexity Q 1 = O(n 2 /Z + n). When executed on P processors by the Cilk scheduler, the cache complexity is Q P = O(n 2 /Z + n + √ P n 3+ ) with high probability. These bounds on the cache complexity allow a programmer to determine whether a program has sufficient temporal locality. If this is the case, the programmer can ignore the issues of data distribution and communication schedules without suffering a performance penalty. For example, matrix multiplication requires (n 3 / √ Z + n 2 ) cache misses [24] irrespective of the number of processors, and we prove that a simple recursive matrix multiplication algorithm incurs O(n 3 / √ Z + (nP ) 1/3 n 2 ) cache misses on P processors under the randomized work-stealing scheduler. If n is so large that the first term dominates the cache complexity, then any attempt to carefully orchestrate the communication schedule-which typically leads to complicated message-passing programs-would only yield incremental gains in the negligible second term. Thus, for large n or, equivalently, small P , one can achieve near-optimal performance with a simple program. 1 A trivial concave bound to the cache complexity always exists because the cache complexity is always at most as large as the work, and this linear bound is by definition concave. Our theory yields nontrivial results when the function is strictly concave, as is the case for the algorithms presented in this paper, for the FFT and sorting problems [16] , for dynamic programming [11] , etc., but not for problems such as matrix transposition where each cached datum is reused (1) times.
The remainder of this article is structured as follows. In Sect. 2 we present the ideal distributed cache model. In Sect. 3, we analyze the cache complexity of multithreaded computations on a machine with an ideal distributed cache. Then, in Sect. 4, we apply our cache complexity bounds to the analysis of multithreaded, cache oblivious programs for matrix multiplication, stencil computations, Gaussian elimination and back substitution, and for the length computation part of the longest common subsequence problem.
The Ideal Distributed Cache Model
In this section, we introduce the ideal distributed cache model for parallel machines as an extension of the ideal (sequential) cache model [16] .
An ideal distributed-cache machine has a two-level memory hierarchy. The machine consists of P processors, each equipped with a private ideal cache connected to an arbitrarily large shared main memory. An ideal cache is fully associative and it implements the optimal off-line strategy of replacing the cache line whose next access is farthest in the future [2] ; see [16, 27] for a justification of this assumption.
Each private cache contains Z words (the cache size), and it is partitioned into cache lines consisting of L consecutive words (the line size) that are treated as atomic units of transfers between cache and main memory.
A processor can only access data in its private cache. If an accessed datum is not available in the cache, the processor incurs a cache miss to bring the data from main memory into its cache.
The number of cache misses incurred by a computation running on a processor depends on the initial state of the cache. The cache complexity Q of a computation is defined as the number of cache misses incurred by the computation on an ideal cache starting and ending with an empty cache.
The ideal distributed cache model assumes that the private caches are noninterfering: the number of cache misses incurred by one processor can be analyzed independently of the actions of other processors in the system. Whether this assumption is true in practice depends on the consistency model maintained by the caches. For example, caches are noninterfering in the dag-consistent memory model maintained by the Backer protocol [7] . Alternatively, caches are noninterfering in the HSMS model [1] if the computation is race-free.
Our ideal distributed cache model is almost the same as the dag-consistent model analyzed by Blumofe et al. [7] , except that we assume ideal caches instead of caches with LRU replacement. Bender et al. [3] consider a distributed-cache model, but with cache coherence and atomic operations. This model is harder to analyze than ours but it supports lock-free algorithms that are not possible with noninterfering caches. The shared ideal cache model of Blelloch and Gibbons [5] features an ideal cache which, unlike in our model, is shared by all processors. Like the PRAM [15] and its variants, the ideal distributed cache model aims at supporting a shared-memory programming model. Unlike the lock-step synchronous PRAM, and unlike bulk-synchronous models such as BSP [30] and LogP [14] , our model is asynchronous, and processors operate independently most of the time. Like in the QSM model [19] , each processor in our model features a private memory, but the QSM manages this private memory explicitly in software as part of each application, whereas we envision an automatically managed cache with hardware support.
The Cache Complexity of Multithreaded Computations
In this section, we prove bounds on the cache complexity of a multithreaded computation in terms of its sequential cache complexity, assuming an ideal distributed-cache machine. Specifically, Theorem 1 bounds the cache complexity of a multithreaded computation assuming a "generic" scheduler. Theorem 2 refines the analysis in the case of the Cilk work-stealing scheduler. Finally, Theorem 5 gives a technical result that simplifies the analysis of the cache complexity of divide-and-conquer computations.
Let the trace of a multithreaded computation be the sequence of the computation's instructions in some order consistent with the partial order defined by the multithreaded computation. Let a segment be a subsequence of consecutive instructions of the trace. We denote with |A| the length of segment A, and with Q(A) the number of cache misses incurred by the execution of segment A on an ideal cache that is empty at the beginning and at the end of the segment.
We assume that the computation is executed in parallel by a scheduler whose operation can be modeled as partitioning the trace into segments and assigning segments to processors. For each segment assigned to it, a processor executes the segment fully, and then proceeds to the execution of the next segment. When completing a segment, we assume that a processor completely invalidates and flushes its own cache (but not other caches), and we count the cache misses incurred by these actions as part of the parallel cache complexity. This technical assumption makes our upper-bound proofs easier; a real scheduler may apply optimizations to avoid redundant flushes. For correctness of the parallel execution, the scheduler must ensure that the assignment of segments to processors respects the data dependencies of the multithreaded computation, but our analysis holds for all partitions, including incorrect ones.
Recall
for all x 0 and x 1 in the domain of f . For a concave function f and integer S ≥ 1, Jensen's inequality holds:
Our first result relates the parallel cache complexity to the sequential cache complexity and the number of segments. 
Theorem 1 Let
Proof Let A i , 0 ≤ i < S be the segments generated by the scheduler. Because we assume that the scheduler executes a segment starting and ending with an empty cache, and because caches do not interfere with each other in the ideal-cache model, the parallel execution incurs exactly Q P (M) = 0≤i<S Q(A i ) cache misses. By assumption, we have 0≤i<S Q(A i ) ≤ 0≤i<S f (|A i |). By Jensen's inequality we have 0≤i<S f (|A i |) ≤ Sf ( 0≤i<S |A i |/S) = Sf (|M|/S), and the theorem follows.
Because a segment incurs at most as many cache misses as its number of memory accesses, Theorem 1 can always be applied trivially with f (x) = x. Theorem 1 becomes useful when we can find nontrivial concave functions, as in the examples in Sect. 4 .
We now analyze the cache complexity of multithreaded Cilk [8, 17] programs assuming a dag-consistent shared memory [7] . Cilk extends the C language with fork/join parallelism managed automatically by a provably good work-stealing scheduler [9] . In general, a Cilk procedure is allowed to execute one of three actions: (1) execute sequential C code; (2) spawn a new procedure; or (3) wait until all procedures previously spawned by the same procedure have terminated. The latter operation is called a "sync". When a parent procedure spawns a child procedure, Cilk suspends the parent, makes it available to be "stolen" by another processor, and begins work on the child. When a processor returns from a child procedure, it resumes work on the parent if possible, or otherwise the processor becomes idle. An idle processor attempts to steal work from another, randomly selected processor. A procedure executing a sync may block, in which case the processor executing the procedure suspends it and starts stealing work.
The execution of a Cilk program can be viewed as a dag of dependencies among instructions. Whenever the dag contains an edge from node u to a node v executing on a different processor, the Backer protocol [7] inserts the following actions to enforce dag-consistent memory. The processor executing u, after executing it, writes all dirty locations in its cache back to main memory. The processor executing v, before executing it but after the write back succeeding u, flushes and invalidates its cache and resumes with an empty cache. In terms of the Cilk source program, Backer can be viewed as inserting memory consistency actions in two places: (1) after a spawn at which a procedure is stolen, and (2) before the sync that waits for such a spawn to complete (the sync associated with the steal). If we view each processor as working on a segment, then a steal breaks the segment into four parts such that the noninterference assumption holds within each part: (1) the portion of the segment executed by the victim before the steal; (2) the portion of the segment executed by the victim after the steal; (3) the portion of the segment executed by the thief before the sync associated with the steal; and (4) the portion of the segment executed by the thief after the sync associated with the steal. Thus, each steal operation increases the number of segments by three. Combining this insight with Theorem 1 and the upper bounds of Acar et al. [1] , we obtain the following theorem.
Theorem 2 (Cilk cache complexity) Consider a Cilk computation with work T 1 and critical path T ∞ , executed on an ideal distributed-cache machine with memory consistency maintained by the Backer protocol. Assume that a cache miss and a steal operation take constant time. Let f be a concave function such that Q(A) ≤ f (|A|) holds for all segments A of the trace of the computation.
Then, the parallel execution on P processors incurs
cache misses, where with high probability
Proof Acar et al. [1, Lemma 16] prove that the Cilk scheduler executes O( m/s P T ∞ ) steals with high probability, where m is the time for a cache miss and s is the time for a steal. Their proof depends neither on the cache replacement policy nor on the memory model. By assumption, m/s = (1) and we hide these parameters in the O-notation from now on. Each steal creates three segments, and therefore the number of segments is S = O(P T ∞ ) with high probability, proving (2b).
The length of the trace is the same as the work T 1 . Caches maintained by Backer are noninterfering within each segment and therefore Theorem 1 applies, proving (2a).
While we derived Theorem 2 for Cilk with dag-consistent shared memory, we could have applied the same analysis to race-free computations in the HSMS model of Acar et al. [1] , obtaining the same bound.
In order to apply Theorems 1 and 2, one must prove a bound on the number of cache misses incurred by each of the unique segments of trace M, which is hard to do in general. For example, consider a divide-and-conquer computation that recursively solves a problem of size n by reducing it to problems of size n/r. One can prove bounds on the cache complexity by induction on complete subtrees of the recursion tree, but this proof technique does not work for segments that do not correspond to complete subtrees.
To aid these proofs, we now prove Theorem 5 below, which bounds the cache complexity of an arbitrary segment in terms of the cache complexity of a subset of recursively nested segments. We call such a subset a recursive decomposition of the trace. In a divide-and-conquer computation, segments in the recursive decomposition would correspond to complete subtrees of the recursion tree. Then, Theorem 5 extends a bound on complete subtrees to a bound valid for all segments.
Definition 3 (Recursive segment decomposition) Let A be a segment and r ≥ 2 be an integer. A r-recursive decomposition of A is any set R of subsegments of A produced by the following nondeterministic algorithm:
We say that segment A is the parent of the segments A i .
Before proving Theorem 5, we state a rather obvious property of ideal caches.
Lemma 4 (Monotonicity of an ideal cache) Let A and B be segments of a trace with B ⊂ A. Then Q(B) ≤ Q(A).
Proof Execute B on a cache that incurs exactly the same cache misses as an optimal execution of A, in the same order. In this case, execution of B incurs exactly Q(A) cache misses. An optimal replacement policy for B incurs at most as many cache misses as the policy that we have just discussed.
Theorem 5 Let R be a r-recursive decomposition of trace M. Let f be a nondecreasing function such that
Proof We prove that A is included in the concatenation of at most two segments in R of length at most r|A|. The theorem then follows from Lemma 4. Let B be the smallest segment in R that includes A. Such a segment exists because the entire trace M ∈ R.
If a child B ∈ R of B exists that is included in A, then |B|/r ≤ |B | ≤ |A|. By Lemma 4 we have
and we are done.
Otherwise, two consecutive children B and B of B exist in R such that A is included in the concatenation of B and B . Let A = A ∩ B and A = A ∩ B . Then, by construction, A is a suffix of B and A is a prefix of B .
We now prove that
Otherwise, a child of C exists in R. By construction, A is a suffix of C , and therefore the rightmost child D of C is included in A . Therefore, we have |C |/r ≤ |D | ≤ |A |.
By Lemma 4, we have
A symmetric argument, substituting "prefix" for "suffix," proves that
. By monotonicity of f , we conclude that Q(A) ≤ 2f (r |A|) and the theorem is proven.
By combining Theorems 2 and 5, we obtain the following bound on the parallel cache complexity in terms of the number of segments and of the sequential cache complexity of a recursive decomposition.
Corollary 6 Let R be a r-recursive decomposition of trace M. Let f be a nondecreasing concave function such that Q(A) ≤ f (|A|) for all A ∈ R. Assume a Cilk scheduler with Backer as in Theorem 2.
Then, the total number Q P (M) of cache misses incurred by the parallel execution of the trace on P processors is
cache misses, where, with high probability,
for all segments A of M. The corollary then follows from Theorem 2.
Remark If Sf (rT 1 /S) happens to be a concave function of S, then the bounds hold in expectation as well, because then we have E[Sf (rT 1 
Applications
In this section, we apply Corollary 6 to the analysis of parallel cache oblivious algorithms for matrix multiplication, stencil computations, a linear system solver, and longest common subsequence computations. All applications are programmed in Cilk [17] . A similar analysis could be applied to suitable parallelizations of other cache oblivious algorithms. Figure 1 shows the pseudo code of multithreaded procedure matmul for multiplying two n × n matrices for n = 2 k [7] . This procedure executes T 1 = O(n 3 ) work and its critical path is T ∞ = O(n). 2 If we ignore the spawn annotations and the sync statements, we obtain a special case of the sequential cache oblivious matrix multiplication algorithm, which incurs [16] when executing on one processor with an ideal cache of size Z and cache line size L, assuming a "tall" cache with Z = (L 2 ). This cache complexity is asymptotically optimal [24] .
Matrix Multiplication
The trace of procedure matmul admits a simple 8-recursive decomposition comprising all segments that compute a complete subtree of the call tree. Moreover, the analysis of the sequential case applies to each complete subtree. Hence, on our ideal distributed-cache machine, each subtree that multiplies n × n matrices incurs
To apply Corollary 6, we must find a concave function f that bounds the number of cache misses Q as a function of a segment's length, for all segments in the recursive decomposition. Consider a segment in the recursive decomposition that multiplies n × n matrices. Ignoring constant factors, let w = n 3 be the length of the segment.
Since f is concave, we obtain the cache complexity of a parallel execution of matmul by Corollary 6 as
where S = O(P n) with high probability. 1 Cilk pseudo code for computing C = C + AB, where A, B, and C are n × n matrices. The code for partitioning each matrix into four quadrants is not shown. The spawn keyword declares that the spawned procedure may be executed in parallel with the procedure that executes the spawn. A sync statement waits until all procedures spawned by the current procedure have terminated. Cilk implicitly sync's before returning from a procedure
Comparison with Previous Bounds Assume now for simplicity that
and the Cilk cache complexity is
How does the "new" bound (5) compare to the "old" bound
that was derived by Blumofe et al. [7] ? As Z → ∞, (5) remains bounded, whereas (6) diverges, and thus the new bound is asymptotically tighter than the old bound for some values of the parameters. If
and the old bound is (n 3 / √ Z), and therefore the old bound is not tighter than the new one. Otherwise, we have n 3 / √ Z ≤ (P n) 1/3 n 2 , and thus √ Z ≥ (P n) −1/3 n, from which ZP n ≥ (P n) 1/3 n 2 follows. Consequently, the new bound is O((P n) 1/3 n 2 ), whereas the old bound is (ZP n) = ((P n) 1/3 n 2 ), and therefore the old bound is not tighter than the new one in this case either. Thus, we conclude that the new bound strictly subsumes the old bound.
1D Parallel Stencil Algorithm
In this section, we present a parallel cache oblivious algorithm for stencil computations, derived from our sequential cache oblivious algorithm [18] , and we analyze its cache complexity in the ideal cache model. Bilardi and Preparata [4] analyze a more complicated parallel cache oblivious stencil algorithm in a limiting technology where signal propagation at the speed of light is the primary performance bottleneck.
A stencil defines the computation of an element in an n-dimensional spatial grid at time t as a function of neighboring grid elements at time t − 1, . . . , t − k. The n-dimensional grid plus the time dimension span an (n + 1)-dimensional spacetime. For brevity we restrict our discussion to one-dimensional stencils. The algorithm can be extended to stencils with arbitrary dimensions as discussed in [18] .
Our parallel, cache oblivious stencil algorithm applies to in-place computations that store only a bounded number of spacetime points for each space position, as opposed to storing the entire spacetime. For example, the sequential, non cache oblivious program in Fig. 2 computes (t 1 − t 0 )(x 1 − x 0 ) spacetime points using only 2(x 1 − x 0 ) memory locations. We call a multidimensional array, such as u [2] [N ] in Fig. 2 , with two places to store two versions of each value a toggle array. 3 Most stencil computations used in practice are in-place, predominantly employ variants of toggle arrays, and therefore this restriction is not serious. Reusing storage is necessary to benefit from a cache: A stencil computation that did not reuse storage would incur a number of cache misses proportional to the size of the spacetime irrespective of the cache size.
Description of the 1D Stencil Algorithm
Procedure walk1 in Fig. 6 visits all points (t, x) in a rectangular spacetime region, where 0 ≤ t < T , 0 ≤ x < N, and t and x are integers. The procedure visits point −1), (t, x) , and (t, x + 1).
Although we are ultimately interested in traversing rectangular spacetime regions, the procedure operates on more general trapezoidal regions such as the one shown in Fig. 3 . For integers t 0 , t 1 , x 0 ,ẋ 0 , x 1 , andẋ 1 , we define the trapezoid T (t 0 , t 1 , x 0 ,ẋ 0 , x 1 ,ẋ 1 ) to be the set of integer pairs (t, x) such that t 0 ≤ t < t 1 and
The height of the trapezoid is h = t 1 − t 0 , and we define the width to be the average of the lengths of the two parallel sides, i.e. w = (
The center of the trapezoid is point (t, x), where t = (t 0 + t 1 )/2 and x = (x 0 + x 1 )/2 + (ẋ 0 +ẋ 1 )h/4 (i.e., the average of the four corners). The area of the trapezoid is the number of points in the trapezoid. We only consider well-defined trapezoids, for which these three conditions hold: t 1 ≥ t 0 , x 1 ≥ x 0 , and
Procedure walk1 decomposes T recursively into smaller trapezoids, according to the following rules.
Parallel Space Cut Whenever possible, the procedure executes a parallel space cut, decomposing T into r "black" trapezoids and some number of "gray" trapezoids, as illustrated in Fig. 4 . The procedure spawns the black trapezoids in parallel, waits for all of them to complete, and then spawns the gray trapezoids in parallel. Such an execution order is correct because the procedure operates the cut so that (1) points in different black trapezoids are independent of each other, (2) points in different gray trapezoids are independent of each other, and (3) points in a black trapezoid do not depend on points in a gray trapezoid.
The base of each black trapezoid has length l = (x 1 − x 0 )/r , except for the rightmost one, which may be larger because of rounding. A black trapezoid has the form T (t 0 , t 1 , x, σ, x + l, −σ ). Slope σ of the edges is necessary to guarantee that a point in a black trapezoid does not depend on points in a gray trapezoid. A black trapezoid is well-defined only if the condition l ≥ 2σ h holds, or else the trapezoid The algorithm cuts the trapezoid along the horizontal line through its center, it recursively visits T 1 , and then it visits T 2 would be self-intersecting. Therefore, r black trapezoids fit into T only if x 1 − x 0 ≥ 2rσ h, which is the condition for the applicability of the parallel space cut.
The procedure always generates r + 1 gray trapezoids, of which r − 1 are located between black trapezoids, as in Fig. 4 , and two are located at the left and right edges of T . In Fig. 4 , the trapezoids at the edges happen to be have zero area. The gray trapezoids in the middle are in fact triangles of the form T (t 0 , t 1 , x, −σ, x, σ ).
We leave the constant r unspecified for now. The choice of r involves a tradeoff between the critical path and the cache complexity, which we analyze in Sect. 4.2.2.
Time Cut If h > 1 and the parallel space cut is not applicable, procedure walk1 cuts the trapezoid along the horizontal line through the center, as illustrated in Fig. 5 . The recursion first traverses trapezoid
Base Case If h = 1, then T consists of the line of spacetime points (t 0 , x) with x 0 ≤ x < x 1 . The base case visits these points, calling the application-specific procedure kernel for each of them. The traversal order is immaterial because these points are independent of each other.
The work (sequential execution time) of procedure walk1, when traversing a trapezoid, is proportional to the trapezoid's area, i.e., T 1 = (wh) where w is the width of the trapezoid and h is its height. This fact is not completely obvious because the procedure may spawn up to two empty gray trapezoids in case of a space cut, and the procedure needs nonconstant (h) time to execute an empty trapezoid of height h. This additional work is asymptotically negligible, however. Procedure walk1 obeys the bound T 1 (w, h) ≤ 2rT 1 (w/(2r), h) + O(h) in case of a space cut, 0 const int σ ; /* assert(σ >= 1) */ 1 const int r; /* assert(r >= 2) */
if (h >= 1 && x >= 2 * σ * h * r) {/* parallel space cut */ 7 int l = x / r; /* base of a black trapezoid, rounded down */ 8 for (i = 0; i < r -1; ++i) 6 One-dimensional parallel stencil algorithm implemented in the Cilk language. The procedure is parametrized by two integers σ and r, whose meaning is described in the text. In lines 8-10, we spawn r black trapezoids. Because of the rounding of l in line 7, the length of the base of the last trapezoid is not necessarily l, and we handle this trapezoid separately in line 10. The sync statement in line 11 waits for the black trapezoids to complete, before spawning the gray trapezoids in lines [12] [13] [14] [15] and bound T 1 (w, h) ≤ 2T 1 (w, h/2) + O(1) in case of a time cut. One can verify by induction that T 1 (w, h) ≤ c(wh − w − h) holds for some constant c and sufficiently large w and h. Alternatively, one can modify the procedure to test for empty trapezoids at the beginning, thus avoiding this problem altogether, but we prefer to keep the code simple even if the analysis becomes slightly harder.
We conclude this section with the analysis of the critical path length of walk1.
Theorem 7 The critical path of walk1 when visiting trapezoid T is
T ∞ (T ) = O σ rhw 1/ lg(2(r−1)) ,
where h is the height of T , w is its width, σ is the slope of the stencil, and r is the number of black trapezoids created by the procedure in the space-cut case.
Proof To avoid cluttering the proof with the O-notation, assume that a call to the kernel procedure and a spawn cost at most one unit of critical path. Furthermore, let α = 1/ lg(2(r − 1)) for brevity. Because procedure walk1 uses r ≥ 2 to spawn at least two threads in the space cut, we have α ≤ 1. We now prove that
by induction on the area of the trapezoid.
Base Case If h = 1 and 1 ≤ w < 2σ r, then the procedure enters its base case with a critical path T ∞ (h, w) = w ≤ 2σ r ≤ 2σ r(2w α − 1), and (7) holds.
Inductive
Step Otherwise, the procedure recursively cuts the trapezoid into strictly smaller trapezoids for which we assume inductively that (7) holds. Depending on whether the procedure executes a time cut or a parallel space cut, we distinguish two cases.
Time Cut If the procedure executes a time cut, we have
where w 1 and w 2 are the widths of the two trapezoids produced by the cut. By inductive hypothesis, we have
Since α ≤ 1 holds, w α is a concave function of w. By Jensen's inequality, we have w α 1 + w α 2 ≤ 2((w 1 + w 2 )/2) α = 2w α . Consequently, the following inequalities hold:
thereby proving (7) in the time-cut case.
Space Cut If the procedure executes a parallel space cut, it generates at least r − 1 gray trapezoids of width w g = σ h, and r black trapezoids of width w b . The critical path is the sum of the critical paths of one black and one gray trapezoid, plus an additional critical path 2r for spawning the recursive subproblems. Therefore, we have
The sum of the widths of the black trapezoids is at most w − (r − 1)w g , and therefore we have
Consequently, we have
Again by Jensen's inequality, we have
from which we conclude that
Since (7) holds in the base case, in the time-cut case, and in the space-cut case, the theorem follows by induction.
Cache Complexity of the 1D Stencil Algorithm
We now analyze the cache complexity of our parallel stencil procedure walk1. We assume an ideal cache with line size L = (1), because a general line size only complicates the analysis without yielding further insights. The analysis depends on two geometric invariants which we now state.
Lemma 8 (Aspect ratio) If procedure walk1 traverses a trapezoid of height h 0 , then for each subtrapezoid of height h and width w created by the procedure, the invariant h ≥ min(h 0 , w/(4σ (r + 1))) holds, where σ is the slope of the stencil and r is the number of black trapezoids created by the procedure in the space-cut case.
Proof The proof is by induction on the number of cuts required to produce a subtrapezoid. The invariant holds by definition of h 0 at the beginning of the execution. The base case produces no subtrapezoids, and therefore it trivially preserves the invariant. A parallel space cut does not change h and does not increase w, thus preserving the invariant. In the time-cut case, x = x 1 − x 0 ≤ 2σ rh holds by construction of the procedure. Because |ẋ i | ≤ σ , we have w ≤ x + σ h. The time cut produces trapezoids of height h = h/2 and width w ≤ w + σ h ≤ x + 2σ h ≤ 2σ (r + 1)h. Thus, a time cut preserves the invariant, and the lemma is proven. Proof Let W be the maximum integer such that the working set of any trapezoid of width W fits into the cache. We have W = (Z). If w ≤ W , then the procedure incurs O(w) cache misses to read and write the working set once, and the theorem is proven.
If w > W , consider the set of maximal subtrapezoids of width at most W generated by the procedure while traversing T . These trapezoids are generated either by a space cut or by a time cut. Trapezoids generated by a time cut have width w = (W ) and height h = (min(h 0 , w /(σ r))) by Lemma 8. Trapezoids generated by a space cut have width w = (W/r) and height h = (min(h 0 , w /σ ))) by Lemma 9.
In either case, we have h = (min(h 0 , W/(σ r))) = (min(h 0 , Z/(σ r))).
Execution of each maximal subproblem visits w h spacetime points incurring O(w ) cache misses. Hence, the ratio of useful work to cache misses for the execution of the subproblem is h = (min(h 0 , Z/(σ r))). Thus, the same ratio holds for the entire execution of T which, therefore, incurs at most
wh (min(h 0 , Z/(σ r)))
cache misses, from which the theorem follows.
We are now ready to analyze the parallel cache complexity of our cache oblivious stencil algorithm. We first derive the sequential cache complexity of a trapezoid in terms of its area A, which is proportional to the work of the trapezoid. Since the cache complexity turns out to be a concave function of the work, we can then derive the Cilk cache complexity from Corollary 6. Proof Let w be the width of T . We first prove that cache misses with high probability, where α = 1/ lg(2(r − 1)).
Lemma 11 Let procedure walk1 traverse a trapezoid of height h 0 on a single processor with an ideal cache of size Z and line size L = (1). Then each sub-trapezoid T of area
Proof Consider the trace of the execution with the recursive decomposition consisting of all segments corresponding to trapezoids completely executed by the procedure. We identify the length of a segment in the decomposition with the area of the trapezoid. Then, from Lemma 11, the cache complexity of a segment B in the recursive decomposition is bounded by Q(B) ≤ f (|B|), for some nondecreasing concave function f such that
The critical path is T ∞ = O(σ rh 0 w α 0 ), as proven in Theorem 7. The theorem then follows from Corollary 6.
Remark Practical instances of procedure walk1 operate with a constant value σ , and a relatively large constant value r, such that α = 1/ lg(2(r − 1)) = , where is a "small" constant. Then, the cache complexity of procedure walk1 applied to a trapezoid of width and height n is with high probability
Linear System Solver
In this section, we present and analyze a multithreaded cache oblivious linear system solver based upon Gaussian elimination without pivoting followed by back substitution. Unlike the multithreaded cache oblivious algorithm for Gaussian elimination presented by Blumofe et al. [7] , which employs mutually recursive procedures for LU decomposition, triangular solve, and Schur's complement, our algorithm consists of a single recursive procedure. In this respect, our algorithm is similar to the Gaussian Elimination Paradigm of Chowdhury and Ramachandran [12] , but more general because it handles rectangular matrices of arbitrary integral size. Toledo [29] studies the related but harder problem of cache oblivious Gaussian elimination with pivoting.
Description of the Gaussian Elimination
Our solver can handle multiple right-hand sides simultaneously by solving the system CX = B, where C is an N × N matrix and B consists of M column vectors, each corresponding to one right-hand side associated with one of M solution vectors. We store both matrices C and B in a N × (N + M) matrix A by concatenating the columns of B to the columns of C. Both the Gaussian elimination and the back substitution are implemented in-place, and the solution X can be found in A in place of the right-hand side B.
The Gaussian elimination shown in Fig. 7 consists of procedure walk2, which could be seen as variation of the 2D spacetime traversal for stencil computations specialized for σ = 0, and of the kernel procedure gauss. Except for the order in which it visits points in (k, i, j)-space, our algorithm is equivalent to the familiar iterative Gaussian elimination expressed as triply nested loop [20, Sect. 3 
.2]:
for (k=0; k<N -1; k++) for (i=k+1; i<N ; i++) for (j =k; j <N +M; j ++) gauss(k, i, j );
Here, variable k indexes the pivot element, and i and j traverse the submatrix in the lower right corner of A for the update operation. Procedure walk2 in Fig. 7 traverses all points
Unlike in the stencil algorithm, the slopes of the upper bounds are always zero and we do not represent them explicitly. To make procedure walk2 equivalent to the iterative Gaussian elimination, we invoke it as shown in Fig. 8 .
We now argue informally that procedure walk2 is equivalent to the iterative Gaussian elimination. To establish correctness, we must argue that procedure walk2 traverses the same (k, i, j)-space as the iterative procedure, and that the traversal occurs in an order that respects the kernel dependencies.
To see that the (k, i, j)-space is the same as the iterative procedure, observe first that this property holds trivially in the base case in lines [30] [31] [32] [33] and that this property is preserved by the k-cut in lines 25-28 assuming inductively that the property holds Fig. 7 Multithreaded cache oblivious Gaussian elimination for the two recursive calls. The i-cut in lines 15-18 partitions the (k, i, j) into two disjoint subspaces, which are not self-intersecting because of the condition i ≥ 2 * k, which is analogous for the condition for the space cut in the stencil case. A similar argument applies to the j -cut in lines 20-23. To see that the procedure respects the dependencies of the gauss kernel, observe that i ≥ k and j ≥ k hold by construction, and that the kernel requires that point (k, i, j) depend upon points (k − 1, i, j), (k, k, j), and (k, i, k) . The first dependency is enforced by the sync statement in line 27, which, for fixed i and j , guarantees that the procedure visits points (k, i, j) in ascending order of k. The second dependency is enforced by the sync statement in line 17. If i 0 ≥ k 1 , then if (k, i, j) is in the region traversed by the procedure, then (k, k, j) is not in the region. Thus, the second dependency holds vacuously and it is safe to execute the two subproblems in parallel. Otherwise i 0 < k 1 , and the procedure conservatively respects the dependency by traversing the two subproblems in ascending order of i. A symmetric argument holds for the third dependency in the j -cut case.
Work and Critical Path of Gaussian Elimination
We now determine the work, critical path, and sequential cache complexity of procedure walk2. Then we apply Corollary 6 to derive its parallel cache complexity.
Let k = k 1 − k 0 , i = i 1 − i 0 , and j = j 1 − j 0 . We state without proof that the work T 1 of the procedure is
The analysis of the critical path is more involved. One way to attack the problem is to recognize that, depending on the parameters, procedure walk2 computes an LU decomposition, the solution of a triangular linear system, or a Schur's complement, and then apply the analysis from [7] to conclude that the critical path is T ∞ = O(N lg 2 N) when M = 0. Another possibility would be to coerce procedure walk2 into the Parallel Gaussian Elimination Paradigm [12] , leading to the same conclusion when M = 0. However, since these reductions would force us to multiply entities beyond necessity and they would anyway be limited to a special case, we analyze the critical path of procedure walk2 directly in Theorem 14 for general M.
Lemma 13
When procedure walk2 is invoked as in Fig. 8 , these invariants hold:
Proof Invariants 1 and 2 hold by construction, because the initial conditions specify a region of (k, i, j) space such that i ≥ k and j ≥ k.
Invariant 3 trivially holds initially because di 0 = 1. The recursive calls in lines 16, 21, and 23 trivially preserve the invariant. In the recursive call in line 26, we have k 0 + k 2 ≤ k 1 , and the invariant is preserved. In the recursive call in line 28, we have i 0 + di 0 k 2 ≥ i 0 , and the invariant is preserved. We are left to prove that i m ≥ k 1 in line 18. We have 2
where the last inequality follows from Invariant 1.
The proof of Invariant 4 is the same as Invariant 3 after swapping i and j .
Theorem 14
When procedure walk2 is invoked as in Fig. 8 , its critical path is
Proof Recall Iverson's APL notation [22, Sect.
2.1]:
[A] = 1, if A is true, 0, otherwise.
We now prove that, when procedure walk2 is invoked as in Fig. 8 , a constant c ≥ 1 exists such that the critical path is (10) for all sufficiently large values of k, i, and j , where
The proof is by well-founded induction on the triple ( k, i, j ) under the product order: 2, 2) , a constant c ≥ 1 can be found such that (10) holds.
Inductive
Step Otherwise, depending on the relative values of k, i, and j , the procedure recursively cuts one dimension in half. We have therefore three cases.
i-cut (Lines 15-18) We distinguish two subcases, depending upon whether the sync statement in line 16 is executed or not.
If i 0 < k 1 , then the sync statement is executed and the critical path obeys the relation
where the constant 1 accounts for the critical path of the constant number of instructions executed by the procedure excluding the recursive calls. By Lemma 13, i 0 ≥ k 1 holds inside the recursive call in line 18. Thus we have
where the last inequality holds because c ≥ 1.
If i 0 ≥ k 1 , then the sync statement is not executed and the critical path obeys the relation
Thus we have 
where the constant 1 accounts for the critical path of the constant number of instructions executed by the procedure excluding the recursive calls.
Thus we have
where the last inequality holds because c ≥ 1. Because (10) holds in the base case and in all the inductive cases, the theorem is proven.
Cache Complexity of Gaussian Elimination
Lemma 15 (Aspect ratio) If procedure walk2 is invoked with initial parameters k 0 , i 0 , and j 0 , then for all recursive subproblems the following invariants hold:
Proof The invariants are true initially, and the procedure always cuts the dimension corresponding to the largest of 2 k, i, or j . 
cache misses, where V = k i j .
Proof Procedure walk2 visits a subsequence of the (k, i, j)-space visited by the cache oblivious matrix multiplication procedure from [16] , and thus the sequential cache complexity of each subproblem is [16, Theorem 1] N + M) ). Then, the execution of the Gaussian elimination procedure incurs N + M) ). The number of cache misses is proportional to the work associated with the (k, i, j)-space, which is proportional to its volume V . According to Lemma 16 and for L = (1), the sequential cache complexity of each subproblem of the recursive procedure walk2 is a concave function
. Hence, the theorem follows from Corollary 6, with the critical path length given by Theorem 14.
Remark For the degenerate cases M = 0 (LU decomposition without right-hand side) and M = 1 (single right-hand side), the parallel cache complexity of the Gaussian elimination is
Description of the Back-Substitution
Given an upper triangular matrix, we can apply a column-oriented back-substitution [20, Sect. 3.1.3] to solve a linear system of equations. For multiple right-hand sides the iterative, column-oriented back-substitution consists of the triply nested loop:
We implement a multithreaded cache oblivious version of the back-substitution by reusing traversal procedure walk2 of Fig. 7 . The new kernel procedure backsub is shown in Fig. 9 , and the invocation of walk2 to compute the back-substitution is shown in Fig. 10 .
We can reuse procedure walk2 by observing that the back-substitution and the Gaussian elimination traverse similar iteration spaces. The iterative version traverses the space (k, i, j) such that N > k ≥ 0, N ≤ j < N + M, and k > i ≥ 0. As shown in Fig. 10 , we set i 0 = 1 and j 0 = 0, so that procedure walk2 invokes the backsub kernel with parameters (k , i , j) such that 0 ≤ k < N, k ≤ i < N, and N ≤ j < N + M. With the substitutions k = N − 1 − k and i = N − 1 − i in the backsub kernel, the iteration spaces of walk2 and of the iterative routine are the same.
We state without proof that the work T 1 of the back-substitution is
The critical path T ∞ of the back-substitution is similar to that of the Gaussian elimination determined in Theorem 14. 
cache misses with high probability, where
Proof Analogous to the proof of Theorem 17, the theorem is a consequence of Corollary 6 and the critical path length from Theorem 18.
Remark For a single right-hand side, M = 1, the parallel cache complexity of the back-substitution is
Because in this case the algorithm performs (N 2 ) work on (N 2 ) input elements, the cache is asymptotically useless. The algorithm benefits from the cache when solving for multiple right-hand sides, however.
Longest Common Subsequence
In this section, we analyze a multithreaded algorithm for computing the length of a longest common subsequence (LCS) of two sequences. The length computation is a subproblem of the problem of computing a LCS of two sequences, and its textbook solution is a dynamic programming algorithm [13] that could in principle be formulated as a 1D-stencil computation. The stencil formulation is inconvenient, however, if one wishes to use the length computation as a subroutine in Hirschberg's algorithm [23] for computing the LCS. Here, we present a more direct Cilk program for computing the length of the LCS which can easily be incorporated into Hirschberg's algorithm. Our program is derived from the cache oblivious sequential algorithm of Chowdhury and Ramachandran [11] , and its parallelization is based upon [25] .
Description of the Longest Common Subsequence Algorithm
Cilk procedure lcslen in Fig. 11 The work of the length computation is T 1 
= O(MN).
To compute the critical path and cache complexity, we assume for simplicity w.l.o.g. that N ≥ M. The length of the critical path is T ∞ = O(M lg 3 N/M), as can be seen from the following argument. 5 Procedure lcslen splits both dimensions in half until the base case is reached, which occurs when the shorter of the two dimensions is 1. Because M is the shorter dimension by assumption the recursion depth is lg M. We maintain the length values, which the naive algorithm [13] stores in a M × N tableau, in array c of length M + N − 1. Intuitively, array c covers one row and one column of the M × N dynamic programming tableau spanned by the two sequences x and y. In particular, in the initial state, we assume that array c stores length value 0 in each element, and serves as the top row and leftmost column of the tableau. In the final state, array c contains the lengths of subsequences associated with the bottom row and rightmost column of the tableau. The 3-point stencil of the tableau prescribes the data dependencies such that point C(i, j ) depends on points C(i − 1, j − 1), C(i − 1, j), and C(i, j − 1), cf. base case in Fig. 11. 
Cache Complexity of the Longest Common Subsequence Algorithm
Chowdhury and Ramachandran [11] introduced a cache oblivious version of the longest common subsequence computation. They have shown that the sequential ex-ecution of the length computation has cache complexity Q(M, N, Z) = O(MN/Z + M + N), where we assume that the cache line size L = (1). This cache complexity is asymptotically optimal, and their result holds under the assumption of an ideal cache. Their analysis also applies to the sequential execution of procedure lcslen in Fig. 11 .
The multithreaded version of the length computation in Fig. 11 consists of four subproblems corresponding to tableau quadrants Q 0 , Q 1 , Q 2 , and Q 3 . Hence, the recursive computation of the quadrants produces a 4-recursive decomposition of the trace of procedure lcslen, which consists of four complete subtrees of the call tree, each of which incurs O(MN/Z + M + N) cache misses on subproblems of size M × N . The synchronization before and after the computations of quadrants Q 1 and Q 2 constrains the concatenation of the subsegments, yet does not affect the applicability of Corollary 6.
To bound the number of cache misses Q, let w = MN be the length of a segment. Then Q ≤ f (w) for some concave function f (w) ∈ O(w/Z + w 1/2 ). For our 4-recursive decomposition, the cache complexity of the parallel execution of lcslen is according to Corollary 6
where S = O(P M lg 3 N/M)) with high probability.
Remark The sequential cache complexity of procedure lcslen for square problems of size n = M = N is Q(n, Z) = O(n 2 /Z + n), and the Cilk cache complexity is Q P (n, Z) = O n 2 /Z + P n 3.58 ,
if we approximate 2 + lg 3 ≈ 3.58.
Conclusion
We presented a technique for analyzing the cache complexity of multithreaded cache oblivious algorithms on an idealized parallel machine. While our technique yields stronger upper bounds than previously known, our bounds are not optimal. For example, the matrix multiplication procedure in Fig. 1 can be scheduled statically by partitioning the trace into S = P 3/2 segments, each computing a matrix multiplication of size (n/ √ P ) × (n/ √ P ), thereby yielding cache complexity O(n 3 / √ Z + √ P n 2 ), which is lower than the bound O(n 3 / √ Z + √ nP n 2 ) that we derived for the work-stealing scheduler. Similarly, a matrix multiplication procedure that uses temporary arrays [7] has a shorter critical path and can be scheduled so as to incur O(n 3 / √ Z + 3 √ P n 2 ) cache misses. For the one-dimensional stencil computation, we conjecture that a smart scheduler should incur no more than O(n 2 /Z +P n 1+ ) cache misses. Experiments suggest that these three expressions for the cache complexity constitute a more accurate model of the work-stealing scheduler than our upper bounds in (4) and (9), but we lack proof that this is the case.
