For many cache-oblivious algorithms for dynamic programming and linear algebra, we observe that the key factor that affects the cache complexity is the number of input entries involved in each basic computation cell. In this paper, we propose a level of abstraction to capture this property, and refer to it as the k-d grid computation structure. We then show the computational lower bounds for this grid structure, and propose efficient and highly-parallel algorithms to compute such grid structure that optimize the number of arithmetic operations, parallel depth, and the cache complexity in both the classic setting when reads and writes have the same cost, and the asymmetric variant that considers writes to be more expensive than reads.
Introduction
The ideal-cache model [51] is widely used in designing algorithms that optimize the communication between CPU and memory. The model is comprised of an unbounded memory and a cache of size M . Data are transferred between the two levels using cache lines of size B, and all computation occurs on data in the cache. An algorithm is cache-oblivious if it is unaware of both M and B. The goal of designing such algorithms is to reduce the cache complexity 1 (or the I/O cost indistinguishably) of an algorithm, which is the number of cache lines transferred between the cache and the main memory assuming an optimal (offline) cache replacement policy. Sequential cache-oblivious algorithms are flexible and portable, and adapt to all levels of a multi-level memory hierarchy. Such algorithms are well studied [8, 30, 42] . Regarding parallelism, Blelloch et al. [22] suggest that analyzing the depth and sequential cache complexity of an algorithm is sufficient for deriving upper bounds on parallel cache complexity.
In this paper, we focus on a class of cache-oblivious algorithms that have a similar computation structure as matrix multiplication and can be coded up in nested for-loops. Their implementations are based on a divide-and-conquer approach that partitions the ranges of the loops and recurses on the subproblems until the base case is reached. Such algorithms are in the scope of dynamic programming (e.g., the LWS/GAP/RNA/Parenthsis problems) and linear algebra (e.g., matrix multiplication, Gaussian elimination, LU decomposition) [22, 34, 37, 39, 51, 69, 88, 91, 92, 98] .
However, there are several challenges in designing these cache-oblivious algorithms. The first issue is that the computation of these algorithms usually involves complicated data dependencies that are entangled with the divide-and-conquer approach, making the algorithm design and analysis less obvious. As a result, although these algorithms have been studied for 20 years, some of them are suboptimal regarding the sequential cache complexity and/or parallel depth, or are complicated and can be greatly simplified. For example, previous algorithms to solve the GAP problem (i.e., the GAP recurrence) and protein accordion folding shown in [34, 37, 39, 69, 88, 91] have cache complexity Θ(n 3 /B √ M ). However, because of the complication of data dependencies, as we show in this paper, this bound is not-optimal. Some other problems in higher (greater than three) dimensions (e.g., the RNA problem) are even harder, and directly applying the automatic systems in [34, 69] cannot yield algorithms with optimal cache complexity. In short, we lack a general method to analyze the lower and upper bounds of the cache complexity on these problems. Similar situations show up in terms of parallelism as well, and there exist spaces for improvements.
The second and more critical challenge emerges with the arrival of new non-volatile memory (NVM) technologies [62, 67] . NVMs will likely serve as the new main memories in the near future providing strong capacity-to-performance ratios. However, writes are more expensive than reads on NVMs. Roughly speaking, the reason is that writing to memory requires a change to the state of the material, while reading only requires detecting the current state. This trend poses the need to design write-efficient algorithms. However, these classic cache-oblivious algorithms based on divide-and-conquer use asymptotically the same numbers of reads and writes to the main memory, which is inefficient on asymmetric memories. Meanwhile, given the various combinations of computation structures and data dependencies, proving the lower bound and designing the individual algorithm for each specific problem can take significant effort.
To tackle the challenges, in this paper we propose a level of abstraction of the computation in these cacheoblivious algorithms. Previous algorithms are usually designed and analyzed based on the number of nested loops, or the number of the dimensions of which the input and output are stored and organized. However, we observe that the key underlying factor in determining the cache complexity of these computations is the number of input entries involved in each basic computation cell (e.g., two input values for the multiplication in matrix product). This observation and the associated solution is one of the reasons we show improvements of the bounds. We abstract this relationship as k-d grid computation structures (short for the k-d grids). A more formal definition is given Section 3. We then discuss the lower bounds to compute such k-d grids with and without considering the asymmetric cost between writes and reads. Based on the analysis of the lower bounds, we also propose highly-parallelized algorithms with the matching bound to compute a k-d grid assuming no data dependency within it.
We claim that the extra level of abstraction helps us to better understand the difficulties (lower bounds) of some problems. This is because the structures of the algorithms and recurrences in this paper (without considering data dependence) are always equivalent to one or more k-d grids. Hence the lower bounds on k-d grids apply to these algorithms and recurrences, indicating whether the previous algorithms are optimal, and if not, what improvement can be achieved. For write-efficient algorithms, we can also show the minimum weighted reads and writes required. These lower bounds are summarized in Table 1 . We note that the lower bounds shown in this setting are non-trivial, and raised as an unsolved problem in the first papers in the research of asymmetric algorithms [18, 31] .
For upper bounds, the data dependencies between the computations come in. However, we observe that the computations of all cache-oblivious algorithms in this paper can all be abstracted as or decomposed into multiple k-d grids without any local dependencies within each of them. Then by applying the efficient algorithms on k-d grids to these problems, we show better upper bounds as well as parallel depths. For problems we provide lower bounds in this paper, all but one algorithms can be shown optimal. We note that the decomposition is independent of whether the read and write costs are symmetric. Therefore, we will show
Dimension

Problems
Cache Complexity Symmetric Asymmetric that once we propose an optimal algorithm on k-d grids in the asymmetry setting, we automatically improve the cache complexities of all the problems from the classic algorithms that do not consider such asymmetry.
We believe that the framework for analyzing cache-oblivious algorithms based on k-d grids provides a better understanding of these algorithms. In particular, the contributions include:
• We show algorithms with improved symmetric cache complexity on many problems, including the GAP recurrence, protein accordion folding, and RNA recurrence. We show that the previous cache bound O(n 3 /B √ M ) for the GAP recurrence and protein accordion folding is not optimal, and we improve the bound to O(n 2 /B · (n/M + log min{n/ √ M , √ M })) and Θ(n 2 /B · (1 + n/M )) respectively 2 . For RNA recurrence, we show an optimal cache complexity of Θ(n 4 /BM ), which improves the best existing result by Θ(M 3/4 ).
• We show improved, linear parallel depth for cache-oblivious algorithms solving all-pair shortest-paths, Gaussian elimination, triangular system solver, LWS recurrences and protein accordion folding. For some of these problems, the depth bounds are achieved by previous algorithms [44, 89] , but they assumes a much stronger model and the algorithms are also much more complicated (more discussion in Section 2). Our approaches are under the standard nested-parallel model and are much simpler. Our algorithms are not in-place, but in Section 6.1 (before Theorem 6.6) we show that the extra storage can bounded to arbitrarily small (any value no less than the cache size).
• We provide write-efficient cache-oblivious algorithms for all problems we discussed in this paper, including matrix multiplication, many linear algebra algorithms, all-pair shortest-paths, and a number of dynamic programming recurrences. If a write costs ω times more than a read (the formal computational model shown in Section 2), the asymmetric cache complexity is improved by a factor of Θ(ω 1/2 ) or Θ(ω 2/3 ) on each problem compared to the previous results [20] . We also show that this improvement is optimal under certain assumptions (the CBCO paradigm, defined in Section 4.2).
• The analysis framework is concise. In this single paper, we discuss the lower bounds and parallel algorithms on a dozen or so computations and DP recurrences, which can be further applied to dozens 2 The improvement is O( √ M ) from an asymptotic perspective (i.e., n approaching infinity). For smaller range of n that of real-world problems 3 . The results are shown in both settings with or without considering the asymmetric costs between reads and writes.
Preliminaries and Related Work
Ideal-cache model and cache-oblivious algorithms. In modern computer architecture, a memory access is much more expensive compared to an arithmetic operation due to larger latency and limited bandwidth (especially in the parallel setting). To capture the cost of an algorithm on memory access, the ideal-cache model, a widely-used cost model, is a two-level memory model comprised of an unbounded memory and a cache of size M . 4 Data are transferred between the two levels using cache lines of size B, and all computation occurs on data in the cache. The cache complexity (or the I/O cost indistinguishably) of an algorithm is the number of cache lines transferred between the cache and the main memory assuming an optimal (offline) cache replacement policy. An algorithm on this model is cache-oblivious with the additional feature that it is not aware of the value of M and B. In this paper, we refer to this cost as the symmetric cache complexity (as opposed to asymmetric memory as discussed later). Throughout the paper, we assume that the input and output do not fit into the cache since otherwise the problems become trivial.
The nested-parallel model and work-depth analysis. In this paper, the parallel algorithms are within the standard nested-parallel model, which is a computation model and provides easy analysis of the workefficiency and parallelism. In this model, a computation starts and ends with a single root task. Each task has a constant number of registers and runs a standard instruction set from a random access machine, except it has one additional instruction called FORK, which can create two independent tasks one at a time that can be run in parallel. When the two tasks finish, they join back and the computation continues.
A computation can be viewed as a (series-parallel) DAG in the standard way. The cost measures on this model are the work and depth: we say work W to be the total number of operations in this DAG and depth (span) D equals to the longest path in the DAG. Such a computation can be scheduled on the round-synchronous PRAM model with p processors in W/p + O(D) round (time) using the randomized work-stealing scheduler [28] . All algorithms in this paper assume concurrent reads and exclusive writes (CREW). Here we do not distinguish the extra write cost for asymmetric memory on W and D to simplify the description of the results, and we only capture this asymmetry using cache complexity.
Regarding parallel cache complexity, Blelloch et al. [22] suggest that analyzing the depth and sequential cache complexity of an algorithm is sufficient for deriving upper bounds on parallel cache complexity. In particular, let Q 1 be the sequential cache complexity. Then for a p-processor shared-memory machine with private caches (i.e., each processor has its own cache) using a work-stealing scheduler, the total number of cache misses Q p across all processors is at most Q 1 + O(pDM/B) with high probability [1] . For a p-processor shared-memory machine with a shared cache of size M + pBD using a parallel-depth-first (PDF) scheduler, Q p ≤ Q 1 [17] . We can extend these bounds to multi-level hierarchies of private or shared caches, respectively [22] . Parallel and cache-oblivious algorithms for dynamic programming and linear algebra. Dynamic Programming (DP) is an optimization strategy that decomposes a problem into subproblems with optimal substructure. It has been studied for over sixty years [4, 13, 41] . For the problems that we consider in this paper, the parallel DP algorithms were already discussed by a rich literature in the eighties and nighties (e.g., [47, 55, 57, 64, 65, 86] ). Later work not only considers parallelism, but also optimizes symmetric cache complexity (e.g., [22, 34, 35, 37, 39, 44, 51, 69, 87, 88, 89, 91] ). The algorithms in linear algebra that share the similar computation structures (but with different orders in the computation) are also discussed (e.g., [12, 27, 39, 43, 44, 75, 92, 98] ). Problem definitions. Since we are showing many optimal cache-oblivious algorithms, we need formal problem definitions. It is hard to show general lower bounds that any type of computations or operations is allowed. For example, for matrix multiplication on a semiring, the only known lower bound of operations is just Ω(n 3−o(1) ) assuming SETH (more details of fine-grain complexity in [97] ). Here we make no assumptions of the set of the ring, and we consider both operators to be atomic and unable to be decomposed or batched. We borrow the term combinatorial matrix multiplication to indicate this specific problem. Such an algorithm requires Θ(n 3 ) operations on square matrices of size n (n 2 inner products). Regarding dynamic programming in Section 8, we discuss the recurrences rather than the problems, and again each operation is atomic and non-decomposable or batchable. We abstract such operation as a computation cell in Section 3. Algorithms with asymmetric read and write costs. The future of main memory appears to lie in a new wave of non-volatile memory (NVM) technologies that promise persistence, significantly lower energy costs, and higher density than the DRAM technology used in today's main memories [62, 67, 82, 102] . Despite the advantages, a key property of such memory technologies, however, is their asymmetric read-write costs: compared to reads, writes can incur much higher energy, higher latency, lower (per-module) bandwidth, and prone to wear-out [6, 9, 18, 19, 31, 33, 45, 46, 63, 66, 71, 78, 81, 85, 99, 100, 103, 104] . Moreover, because bits are stored in these technologies as at rest "states" of the given material that can be quickly read but require physical change to update, this asymmetry appears fundamental. This motivates the need for write-efficient algorithms that reduce the number of writes compared to existing algorithms. The early research studied such read-write asymmetry on NAND Flash chips [14, 48, 54, 84] and database operators [32, 93, 94] .
Blelloch et al. [15, 18, 19] formally defined and analyzed several sequential and parallel computation models that take asymmetric read-write costs into account. The model Asymmetric RAM (ARAM) extends the two-level memory model and contains a parameter ω, which corresponds to the cost of a write relative to a read to the non-volatile main memory. In this paper, we refer to the asymmetric cache complexity Q as the number of write transfers to the main memory multiplied by ω, plus the number of read transfers. This model captures different system consideration (latency, bandwidth, or energy) by simply plugging in a different value of ω, and also allows algorithms to be analyzed theoretically and practically. Similar scheduling results (upper bounds) on parallel running time and cache complexity are discussed in [15, 19] based on work W , depth D and asymmetric cache complexity Q of an algorithm. Based on this idea, many interesting algorithms and lower bounds are designed and analyzed by various recent works [15, 16, 18, 19, 21, 23, 25, 58, 70] .
In the analysis, we always assume that the input size is much larger than the cache size (which is usually the case in practice). Otherwise, both the upper and the lower bounds on cache complexity also include the term for output-ω times the output size. For simplicity, this term is ignored in the asymptotic analysis.
Carson et al. [31] also discussed algorithms using less writes. Since their results are under some different assumptions, we discuss the difference in the Appendix E. Discussions to previous work. We now discuss several possible confusions of this paper.
In this paper, we define and analyze the k-d grid computation structure. Similar grid structures and the bounds on computing them have been discussed in many previous work (e.g., [2, 10, 11, 34, 37, 39, 68, 69, 91] ). However, we note that the definition in this paper is different from those papers. In the k-d grid, the dimension of the grid is related to the number of entries per basic computation unit (formal definition in Section 3), not the dimension of the input/output arrays. As a result, the readers should not confuse the k-d grid in this paper with the previous grid structures.
Since we believe that the abstraction and definition are new, the proof of lower bounds in Section 4 focuses on the simplest sequential setting with a limited-size cache, which is different from the parallel or distributed settings with infinite-size local memory discussed in previous work (e.g., [2, 10, 11, 68] ). It is an interesting future work to extend the results in this paper to the more complicated settings, especially in the asymmetric setting.
Although the definition of k-d grid is different, in the special case when the number of input entries per basic block is the same as the dimension of input/output arrays, the analysis based on k-d grid provides the same sequential symmetric cache complexity as the previous work [34, 37, 39, 51, 69, 91] . The sequential symmetric versions of these algorithms based on the k-d grid are the same as or similar to the previous algorithms. One of such examples is the matrix multiplication [51] , and we recommend the reader to use it as an instantiation to understand our improved parallel and asymmetric algorithms in Section 5 and 6. This is also the case for other linear algebra algorithms in Section 7.2, and the LWS and Parenthesis recurrences in Section 8.1 and 8.4. For the algorithms with the same bounds as previous, we give another way to describe and analyze them -this is a minor result, which we do not claim it as a contribution of this paper.
Some previous work [44, 89] achieves the same linear depth bounds in several problems discussed in this paper. We note that they assume a much stronger model (i.e., enabling the DAG to be non-nested), so their algorithms either need a specially designed scheduler [44] , or cannot benefit from the scheduling results discussed in this preliminary section. Our algorithms are much simpler and under the nested-parallel model.
The dynamic programming recurrences discussed in this paper have non-local dependencies (definition given in [57] ), and we point out that they are pretty different from the problems like edit distance or stencil computations (e.g., [36, 52, 59, 77] ) that only have local dependencies. We did not consider other types of dynamic programming approaches like rank convergence or hybrid r-way DAC algorithms [38, 79, 80] that cannot guarantee processor-and cache-obliviousness simultaneously.
The k-d Grid Computation Structure
The k-d grid computation structure (short for the k-d grid) is defined as a k-dimensional grid C of size n 1 × n 2 × · · · × n k . Here we consider k to be a small constant greater than 1. This computation requires k − 1 input arrays I 1 , · · · , I k−1 and generate one output array O. Each array has k − 1 dimensions and is the projection of the grid out of one different dimension. Each cell in the grid represents some certain computation that requires k − 1 inputs and generates a temporary value. This temporary value is "added" to the corresponding location in the output array, and we assume this addition is associative. The k − 1 inputs of this cell are the projections of this cell removing each (but not the last) dimensions, and the output is the projection removing the last dimension. They are referred to as the input and output entries of this cell. When computing each cell, the input and output entries must be in the cache. Figure 1 illustrates such a computation in 2 and 3 dimensions.
We refer to a k-d grid computation structure as a square grid computation structure (short for a square grid) of size n if it has size n 1 = · · · = n k = n. More concisely, we say a k-d grid has size n if it is square and of size n.
For instance, when multiplying two matrices of size n-by-n on a semiring (×, +), the computation of the classic cubic algorithm exactly matches the 3d square grid. A corresponding 2d case is when computing a matrix-vector multiplication where the matrix is implicit (i.e., we can query the value of each cell, but the matrix is not stored explicitly). Such applications are commonly seen in dynamic programming algorithms.
The key aspect to decide the dimension of a computation is the number of inputs that each basic cell element requires. For example, when multiplying two dense tensors, although each tensor may have multiple dimensions, each multiplication operation is only based on two entries, so it is a 3d grid. The cache-oblivious algorithms discussed in this paper are based on k-d grids with k = 2 or 3, but we can also find applications with larger k (e.g., a Nim game with some certain rules on multiple piles [29] ). For some dynamic programming and linear algebra algorithms in this paper, the computations can be abstracted as k-d grids, but a constant fraction of the grid cells are empty. We call such a grid as an α-full grid for some constant 0 < α < 1 if at least an α ± o(1) fraction of the cells are non-empty. We will show that all properties we show for a k-d grid also work for the α-full case, since the constant α affects the analysis of neither the lower bounds nor the algorithms.
The Lower Bounds
We first discuss the lower bounds of the cache complexities for a k-d grid computation structure, which sets the target to design the algorithms in the following sections. In Section 4.1 we show the symmetric cache complexity. This is a direct extension of the classic result by Hong and Kong [61] to an arbitrary dimension. Then in Section 4.2 we discuss the asymmetric cache complexity when writes are more expensive than reads, which is more interesting and has involved analyses.
Symmetric Cache Complexity
The symmetric cache complexity of a k-d grid is simple to analyze, yielding to the following result:
Proof. Let's say the computations of n k cells are sequentialized in an array. For blocks of size S = 2 k M k/(k−1) , the minimum number of inputs and outputs of each block is at least 2 k−1 kM ≥ 2M . Since only a total amount of M entries can be held in the cache at the beginning of the computation, the cache complexity to load the input/output and finish the computation for such a block is Ω(M/B). Since there are n k /S = Θ(n k M −k/(k−1) ) of such blocks, the cache complexity of the entire computation is Ω(M/B) · n k /S = Ω(n k /(M 1/(k−1) B)).
Notice that the proof does not assume cache-obliviousness, but the lower bound is asymptotically tight by applying a sequential cache-oblivious algorithm that is based on 2 k -way divide-and-conquer [51] .
Asymmetric Cache Complexity
We now consider the asymmetric cache complexity of a k-d grid computation structure. Unfortunately, this case is significantly harder than the symmetric setting. Again for simplicity we first analyze the square grid of size n, which can be extended to the more general cases similar to [51] .
Interestingly, there is no specific pattern that a cache-oblivious algorithm has to follow. Some existing algorithms use "buffers" to support cache-obliviousness (e.g., [7] ), and many others use a recursive divideand-conquer framework. For the recursive approaches, when the cache complexity of the computation is not leaf-dominated (like various sorting algorithms [18, 51] ), a larger fan-out in the recursion is more preferable (usually set to O( √ n)). Otherwise, when it is leaf-dominated, existing efficient algorithms all pick a constant fan-out in the recursion in order to reach the base case as soon as possible. All problems we discuss in this paper are in this category, so we make our analysis under the following constraints. More discussion about this constraint in given in Section 9.
Definition 1 (CBCO paradigm). We say a divide-and-conquer algorithm is under the constant-branching cache-oblivious (CBCO) paradigm if it has an input-value independent computational DAG, such that each task has constant 5 fan-outs of its successive subtasks until the base cases, and the partition of each task is decided by the (sub)problem size and independent with the cache parameters (M and B).
Notice that ω is a parameter of the main memory, instead of a cache parameter. One can define resourceobliviousness [40] so that the value of ω is not exposed to the algorithms, but this is out of the scope of this paper.
We now prove the (sequential) lower bound on the asymmetric cache complexity of a k-d grid under the CBCO paradigm. The two places that we use this assumption in the proof are the constant branching of the recursion, and "scale-free" of the cache-oblivious algorithms: the structure in the recursive levels around the boundary of cache size is identical.
Proof. We prove the lower bound using the same approach in Section 4.1-analyzing blocks of size S based on the sequence of the operations executed by the algorithm. The cache can hold M entries as temporary space for the computation. For the lower bound, we only consider the computation in each cell without considering the step adding the calculated value back to the output array. Again when computing each cell, the k input and output entries have to be in the cache. For a block of operations of size S, the cache needs to hold the entries in I 1 , · · · , I k−1 and O corresponding to the cells in this sequence at least once during the computation. The number of entries is minimized when the sequence of operations are within a k-d cuboid of size S = a 1 × a 2 × · · · × a k where the projections on I i and O are (k − 1)-d arrays with sizes a 1 × · · · × a i−1 × a i+1 × · · · × a k and a 1 × · · · × a k−1 . Namely, the number of entries is at least S/B · 1/a i for the corresponding input or output array.
Note that the input arrays are symmetric to each other regarding the access cost, but in the asymmetric setting storing the output is more expensive. As a result, the cache complexity is minimized when a 1 = · · · = a k−1 = a, and let's denote a k = ar where r is the ratio between a k and other a i . Here we assume r ≥ 1 since reads are cheaper. Due to the scale-free property that M and n are arbitrary, r should be fixed (within a small constant range) for the entire recursion.
Similar to the analysis in the proof of Theorem 4.1, for a block of size S, the read transfers required by the cache is Ω n k SB · max{a k−1 r − M, 0} , where n k /S is the number of such blocks, and max{a k−1 r − M, 0}/B lower bounds the number of reads per block because at most M entries can be stored in the cache from the previous block. Similarly, the write cost is Ω ωn k SB · max{a k−1 − M, 0} . In total, the cost is:
The second step is due to S = O(a k r).
The cost decreases as the increase of S, but it has two discontinuous points
and this leads to the theorem.
A Matching Upper Bound on Asymmetric Memory
In the sequential and symmetric setting, the classic cache-oblivious divide-and-conquer algorithms to compute the k-d grid (e.g., 3D case in [51] ) is optimal. In the asymmetric setting, the proof of Theorem 4.2 indicates that the classic algorithm is not optimal. The key to improve the cost is the balancing factor r = ω (k−1)/k that leads to more cheap reads and less expensive writes in each sub-computation. We now show that the lower bound in Theorem 4.2 is tight by a (sequential) cache-oblivious algorithm with such asymmetric cache complexity. The algorithm is given in Algorithm 1, which can be viewed as a variant of the classic approach with minor modifications on how to partition the computation. Notice that in line 6 and 10, "conceptually" means the partitions are used for the ease of algorithm description. In practice, we can just pass the ranges of indices of the subtask in the recursion, instead of actually partitioning the arrays.
Compared to the the classic approaches (e.g., [51] ) that partition the largest input dimension among n i , the only underlying difference in the new algorithm is in line 4: when partitioning the dimension not related to the output array O, n k has to be ω (k−1)/k times larger than n 1 , · · · , n k−1 . This modification introduces an asymmetry between the input size and output size of each subtask, which leads to an improvement in the cache efficiency.
For simplicity, we show the asymmetric cache complexity for square arrays (i.e., n 1 = · · · = n k ) and n = Ω(ω (k−1)/k M ), and the general case can be analyzed similar to [51] .
Algorithm 1: ASYM-ALG(I 1 , · · · , I k−1 , O)
Input: k − 1 input arrays I 1 , · · · , I k−1 , read/write asymmetry ω Output: Output array O 1 The computation has size n 1 × n 2 × · · · × n k 2 if I 1 , · · · , I k−1 , O are small enough then 3 Solve the base case and return
Theorem 5.1. Algorithm 1 computes the k-d grid of size n with asymmetric cache complexity
Proof. We separately analyze the numbers of reads and writes required by Algorithm 1. In the sequential execution of Algorithm 1, each subproblem only requires O(1) extra temporary space. Also, our analysis ignores rounding issues since they will not affect the asymptotic bounds. When starting from the square grid at the beginning, the algorithm first partitions in the first k − 1 dimensions (via line 10 to 12) into ω (k−1)/k -by-ω (k−1)/k subproblems (refer to as second-phase subproblems) each with size (n/ω (k−1)/k ) × · · · × (n/ω (k−1)/k ) × n, and then partition k dimensions in turn.
The number of writes of the algorithm W (n) (to array O) follows the recurrences that:
and
where W (n) is the number of writes required by the second-phase subproblems with the size of O being n × · · · × n. The base case is when W (M 1/(k−1) ) = O(M/B). Solving the recurrences gives
We can analyze the reads similarly by defining R(n) and R (n). The recurrences on reads are:
The difference occurs in the base case since the input fits into the cache when n = M 1/(k−1) /ω 1/k . Namely, R (M 1/(k−1) /ω 1/k ) = O(M/B). Therefore, by solving the recurrences, we have R (n/ω
The overall (sequential) asymmetric cache complexity for Algorithm 1 is:
and combining with the lower bound of Theorem 4.2 proves the theorem.
Comparing to the classic approach, the new algorithm improves the asymmetric cache complexity by a factor of O(ω (k−1)/k ), since the classic algorithm requires Θ(n k /(M 1/(k−1) B)) reads and writes. Again here we assume n k−1 is much larger than M . Otherwise, the lower and upper bounds should include Θ(ωn k−1 /B) for just writing down the output O.
Parallelism
We now show the parallelism in computing the k-d grids. We show that the parallel versions of the cacheoblivious algorithms only has polylogarithmic depth, indicating that they are highly parallelized.
The Symmetric Case
We first discuss the classic algorithm on symmetric memory. For a square grid, the algorithm partition the k-dimensions in turn until the base case is reached.
Notice that in every k consecutive partitions, there are no dependencies in k − 1 of them, which we can take full parallelism from them. The only exception is during the partition in the k-th dimension, whereas both subtasks share the same output array O and cause write concurrence. If such two subtasks are sequentialized (like in [51] ), the depth is D(n) = 2D(n/2) + O(1) = O(n).
We now introduce the algorithm with logarithmic depth. As just explained, to avoid the two subtasks from editing the same elements in the output array O, our algorithm works in the following way when partitioning the k-th dimension:
1. Allocating two stack-allocated temporary arrays with the same size of the output array O before the two recursive function calls.
We first analyze the cost of square grids of size n in the symmetric setting, and will discuss the asymmetric setting later. Lemma 6.1. The overall stack space for a subtask of size n is O(n k−1 ).
Proof. When partitioning the output (k-th) dimension, the algorithm allocates and computes two subtasks of size n/2. This leads to the following recurrence:
which solves to S(n) = O(n k−1 ) when k > 2. When k = 2, we can apply the version that only allocates temporary space for one subtask, which changes the constant before S(n/2) to 1, and yields to the same bound as S(n) = O(n). Note that we only need to analyze one of the branches, since the temporary spaces that are not allocated in the direct ancestor of this subtask have already been deallocated, and will be reused for later computations for the current branch.
With the lemma, we have the following corollary: This corollary indicates that this modified parallel algorithm has the same sequential cache complexity since it fits into the cache in the same level as the classic algorithm (the only minor difference is the required cache size increases by a small constant factor). Therefore we can apply the a similar analysis in [51] (k = 3 in the paper) to show the following lemma: Assuming that we can allocate a chunk of memory in constant time, the depth of this approach is simply O(log 2 n)-O(log n) levels of recursion, each with O(log n) depth for the additions [22] .
We have shown the parallel depth and symmetric cache complexity. Therefore, we have the following result for parallel symmetric cache complexity by applying the scheduling theorem in the preliminary section. Lemma 6.1 shows that the extra space required is S 1 = O(n k−1 ) for sequentially running the parallel algorithm. Naïvely the parallel space requirement is pS 1 , which can be very large. We now show a better upper bound for the extra space. Proof. We analyze the amount of space allocated in total for all processors. Lemma 6.1 indicates that if the root of the computation on one processor has the output array of size (n ) k−1 , then the space requirement for this task is O((n ) k−1 ). There can be at most 2 k processors starting with their computations of size n k−1 /2 k−1 , (2 k ) 2 of size n k−1 /(2 k−1 ) 2 , until (2 k ) q processors of size n k−1 /(2 k−1 ) q where q = log 2 k p, when each of the p processors all get a task. This is the most pessimistic case that maximizes the overall space requirement, which is:
which gives the stated bound.
We believe the space requirement for the parallel cache-oblivious algorithm is acceptable since it is asymptotically the same as the most intuitively (non-cache-oblivious) parallel algorithm that partitions the computation into p square subtasks each with size n/p 1/k . In practice nowadays it is easy to fit several terabyte main memory onto a single powerful machine such that the space requirement can usually be satisfied. For example, when k = 2, p = 100 and the main memory can hold 10 12 entries, the grid needs to contain ≈ 10 22 cells to exceed the memory size: such computation can be too big to run on a single shared-memory machine. For k ≥ 3, the extra term p 1/k only affects a small range of input sizes that cannot fit. Even if it falls into this small region, we can always slightly change the algorithm to bound the extra space. We first partition the input dimensions for log 2 p rounds to bound the largest possible output size to be O(n k−1 /p) that one processor can steal (similar to the case discussed in Section 6.2). Then the overall extra space is limited to O(n k−1 ), the same as the input/output size. If needed, the constant in the big-O can also be bounded. Such change will not affect the cache complexity and the depth as long as the main memory size is larger than pM . In practice, it is always several orders of magnitude larger than pM . As a conclusion, we believe the extra space requirement is acceptable in most cases in practice. Throughout the rest of the paper, we focus on the cache complexity and the depth, similar to the previous work [22] .
Combining all results gives the following theorem: Theorem 6.6. There exists a cache-oblivious algorithm for k-d grid computation structure of size n that requires Θ(n k ) work, Θ n k M 1/(k−1) B symmetric cache complexity, O(log 2 n) depth, and O(p 1/k n k−1 ) main memory size.
The Asymmetric Case
Algorithm 1 considers the write-read asymmetry, which involves some minor changes to the classic cacheoblivious algorithm. In terms of parallelism, the changes only affect the order of the partitioning of the k-d grid in the recurrence, but not the parallel version and the analysis in Section 6.1. As a result, the depth of the parallel variant of Algorithm 1 is also O(log 2 n). The extra space requirement is actually decreased, because the asymmetric algorithm has a higher priority in partitioning the input dimensions that does not requires allocation temporary space.
Lemma 6.7. The space requirement of Algorithm 1 on p processors is O(n k−1 (1 + p 1/k /ω (k−1)/k )).
Proof. Algorithm 1 first partition the input dimensions until q = O(ω (k−1) 2 /k ) subtasks are generated. Then the algorithm will partition k dimensions in turn. If p < q, then each processor requires no more than O(n k−1 /q) extra space at any time, so the overall extra space is O(p · n k−1 /q) = O(n). Otherwise, the worst case appears when O(p/q) processors work on each of the subtasks. Based on Lemma 6.5, the extra space is bounded by O((p/q) 1/k · q · n k−1 /q) = O(p 1/k n k−1 /ω (k−1)/k ). Combining the two cases gives the stated bounds. Lemma 6.7 indicates that Algorithm 1 requires extra space no more than the input/output size asymptotically when p = O(ω k−1 ), which should always be true in practice.
The challenge arises in scheduling this computation. The scheduling theorem for the asymmetric case [15] constraints on the non-leaf stack memory to be a constant size. This contradicts the parallel version in Section 6.1. This problem can be fixed based on Lemma 6.1 that upper bounds the overall extra memory on one task. Therefore the stack-allocated array can be moved to the heap space. Once a task is stolen, the first allocation will annotate a chunk of memory with size order of |O| where O is the current output. Then all successive heap-based memory allocation can be simulated on this chunk of memory. In this manner, the stack memory of each node corresponding to a function call is constant, which allows us to apply the scheduling theorem in [15] . Theorem 6.8. Algorithm 1 with input size n requires Θ(n k ) work, Θ n k ω 1/k M 1/(k−1) B asymmetric cache complexity, and O(log 2 n) depth for a k-d grid of size n.
Numerical Algorithms and All-Pair Shortest Paths
In this section we discuss matrix multiplication, Kleene's algorithm on all pair shortest-paths, and some linear algebra algorithms including Strassen algorithm, Gaussian elimination (LU decomposition), and triangular system solver. The common theme in these algorithms is that their computation structures are very similar to that of matrix multiplication, which is a 3d grid. Strassen algorithm is slightly different and introduced separately in Appendix A. Other algorithms are summarized in Section 7.2 and the details are given in Appendix B-D. We show improved asymmetric cache complexity for all problems. For Gaussian elimination and triangular system solver, we show linear depth algorithms in both symmetric and asymmetric settings which are based on the parallel algorithm discussed in Section 6. There exist work-optimal and sublinear depth algorithm for APSP [90] , but we are unaware of how to make it I/O-efficient. Compared to previous linear depth algorithms in [44] , our algorithm is in the classic nested-parallel model and can be scheduled using the work-stealing scheduler. Also, we believe our algorithms are significantly simpler.
Matrix Multiplication
The combinatorial matrix multiplication (definition in Section 2) is one of the simplest cases of the 3d grid. Given a semiring (×, +), in matrix multiplication each cell corresponds to a "×" operation of the two corresponding input values and the "+" operation is associative. Since there are no dependencies between the operations, we can simply apply Theorem 6.6 and 6.8 to get the following result. 
All-Pair Shortest Paths, Gaussian Elimination, and Triangular System Solver
We now discuss the well-known cache-oblivious algorithms to solve all-pair shortest paths (APSP) on a graph, Gaussian elimination (LU decomposition), and triangular system solver. These algorithms share similar computation structures and can usually be discussed together. Chowdhury and Ramachandran [37, 39] categorized matrix multiplication, APSP, and Gaussian Elimination into the Gaussian Elimination Paradigm (GEP) and discussed a unified framework to analyze complexity, parallelism and actual performance. We show how the parallel depth and the asymmetric cache complexity can be improved using the algorithms we just introduced in Section 5 and 6. We discuss the details of these cache-oblivious algorithms in the appendix. The common theme in these algorithms is that, the computation takes one or two square matrix(ces) of size n × n as input, applies n 3 operations, and generates output as a square matrix of size n × n. Each output entry is computed by an inner product of one column and one row of either the input matrices or the output matrix in some intermediate state. Namely, the output A i,j requires input B i,k and C k,j for 1 ≤ k ≤ n (A, B and C may or may not be the same matrix). Therefore, we can apply the results of 3d grids on these problems. 6 Note that some of the grids are full (e.g., Kleene's algorithm) while others are not, but they are all α-full and contain O(n 3 ) operations.
The data dependencies in these algorithms are quite different from each other, but the recursions for cache complexity Q(n) and depth D(n) for APSP, Gaussian elimination and triangular system solver are all in the following form:
where Q 3C (n) and D 3C (n) are the cache complexity and depth of a 3d grid of size n. Here, as long as the the recursive subtask fits into the cache together with the 3d grid computation and the β, γ and δ are constants and satisfy β < 8, we can show the following bounds. 
Dynamic Programming Recurrences
In this section we discuss a number of new results on dynamic programming (DP). To show interesting lower and upper bounds on parallelism and cache efficiency in either symmetric and asymmetric setting, we focus on the specific DP recurrences instead of the problems. We assume the operations in the recurrences are atomic, and not decomposable or batchable.
The goal of this section is to show how the DP recurrences can be viewed as and decomposed into the k-d grids. Then the lower and upper bounds discussed in Section 4 and 5, as well as the analysis of parallelism in Section 6, can be easily applied to the computation of these DP recurrences. When the dimension of the input/output is the same as the number of entries in each grid cell, then the sequential and symmetric versions of the algorithms in this section are the same as the existing ones discussed in [34, 37, 39, 51, 91] , but the others are new. Also, the asymmetric versions and most parallel versions are new. We improve the existing results on symmetric/asymmetric cache complexity, as well as parallel depth. Symmetric cache complexity. We show improved algorithms on a number of problems when the number of entries per cell differs from the dimension of input/output arrays, including the GAP recurrence, protein accordion folding, and RNA recurrence. We show that the previous cache bound O(n 3 /B √ M ) for the GAP recurrence and protein accordion folding is not optimal, and we improve the bounds in Theorem 8.2 and 8.3. For RNA recurrence, we show an optimal cache complexity of Θ(n 4 /BM ) in Theorem 8.2, which improves the best existing result by O(M 3/4 ). Asymmetric cache complexity. By replacing the computation of the k-d grid using the asymmetric version discussed in Section 5, we show a uniform approach to provide write-efficient algorithms for all problems in this section. We also prove the optimality of all these algorithms expect for the GAP recurrence. Parallelism. The parallelism of these algorithms is provided by using the parallel algorithms discussed in Section 6 as the underlying building blocks for computing the DP recurrences. We can achieve polylogarithmic depth in computing the 2-knapsack recurrence, and linear depth in LWS recurrence and protein accordion folding. The linear depth for LWS can be achieved by previous work [44, 89] , but they are not in the nested parallel model and does not have the guarantee by the randomized work-stealing scheduler. Meanwhile, our algorithms are arguably simpler.
The LWS Recurrence
The LWS (least-weighted subsequence) recurrence [60] is one of the most commonly-used DP recurrences in practice. Given a real-valued function w(i, j) for integers 0 ≤ i < j ≤ n and D 0 , for 1 ≤ j ≤ n,
This recurrence is widely used as a textbook algorithm to compute optimal 1D clustering [73] , line breaking [74] , longest increasing sequence, minimum height B-tree, and many other practical algorithms in molecular biology and geology [56, 57] , computational geometry problems [3] , and more applications in [76] . Here we assume that w(i, j) can be computed in constant work based on a constant size of input associated to i and j, which is true for all these applications. Although different special properties of the weight function w can lead to specific optimizations, the study of recurrence itself is interesting, especially regarding cache efficiency and parallelism.
We note that the computation structure (without considering dependencies) of this recurrence is a standard 2d grid. Each cell requires the input entry of D i , computes D i + w(i, j) and updates D j as the output entry, so Theorem 4.1 and 4.2 show lower bounds on cache complexity on this recurrence (the grid is (1/2)-full).
We now introduce cache-oblivious implementation considering the data dependencies. Recent work by Chowdhury and Ramachandran [37] solves the recurrence with O(n 2 ) work and O(n 2 /BM ) symmetric cache complexity. The algorithm is simply a divide-and-conquer approach and we describe and extend it based on k-d grids. A task of range (p, q) computes the cells (i, j) such that p ≤ i < j ≤ q. To compute it, the algorithm generates two equal-size subtasks (p, r) and (r + 1, q) where r = (p + q)/2, solves the first subtask (p, r) recursively, then computes the cells corresponding to w(i, j) for p ≤ i ≤ r < j ≤ q, and lastly solves the subtask (r + 1, q) recursively. Note that the middle step exactly matches a 2d grid with no dependencies between the cells, which can be directly solved using the algorithms in Section 5. This leads to the cache complexity and depth to be: Q(n) = 2Q(n/2) + Q 2C (n/2) D(n) = 2D(n/2) + D 2C (n/2) Here 2C denotes the computation of a 2d grid. The recurrence is root-dominated and D(1) = 1. This solves to the following theorem.
Theorem 8.1. The LWS recurrence can be computed in Θ(n 2 ) work, Θ n 2 BM and Θ ω 1/2 n 2 BM optimal symmetric and asymmetric cache complexity respectively, and O(n) depth.
The GAP Recurrence
The GAP problem [55, 57] is a generalization of the edit distance problem that has many applications in molecular biology, geology, and speech recognition. Given a source string X and a target string Y , we can apply a sequence of consecutive deletes corresponds to a gap in X, and a sequence of consecutive inserts corresponds to a gap in Y . For simplicity here we assume both strings with length n, but the algorithms and analyses can easily adapt to the more general case. Since the cost of such a gap is not necessarily equal to the sum of the costs of each individual deletion (or insertion) in that gap, we define w(p, q) (0 ≤ p < q ≤ n) to be the cost of deleting the substring of X from (p + 1)-th to q-th character, and w (p, q) for inserting the substring of Y accordingly. Let D i,j be the minimum cost for such transformation from the prefix of X with i characters to the prefix of Y with j characters, the recurrence for i, j > 0 is:
corresponding to either inserting or deleting a substring. The boundary is set to be D 0,0 = 0, D 0,j = w(0, j) and D i,0 = w (0, i). The diagonal dependency from D i−1,j−1 can be added if required, without affecting the asymptotic analysis. The best existing algorithms on GAP Recurrence [37, 88] have symmetric cache complexity of O(n 3 /B √ M ). This bound seems to be reasonable, since in order to compute D i,j , we need the input of two vectors D i,q and D p,j , which is similar to matrix multiplication and other algorithms in Section 7. However, as indicated in this paper, each update in GAP only requires one entry, while matrix multiplication has two. Therefore, if we ignore the data dependencies, the first line of the GAP recurrence can be viewed as n LWS recurrences, independent of the dimension of i (similarly for the second line). This derives a lower bound on cache complexity to be that of an LWS recurrence multiplied by 2n, which is Ω(n 3 /BM ) (assuming n > M ). Hence, the gap is Θ( √ M ) between the lower and upper bounds. We now discuss an I/O-efficient algorithm to reduce this gap. Unfortunately, the algorithm is not optimal, and we leave it as an open problem. Chowdhury and Ramachandran's approach [37] is based on divideand-conquer to compute output D. The algorithm recursively partitions D into four equal-size quadrants D 00 , D 01 , D 10 and D 11 , and starts to compute D 00 recursively. After this is done, it uses the computed value in D 00 to update D 01 and D 10 using the recurrence. This step can be considered to compute 2 × n /2 LWS recurrences (with no data dependencies) each with size n /2 (assuming D has size n × n ). Then the algorithm computes D 01 and D 10 within their own ranges. After that, it updates D 11 using the results from D 01 and D 10 , and solves D 11 recursively at the end.
Our modified version reorganizes the data layout and the order of computation to take advantage of our I/O-efficient and parallel algorithm on 2d grids. Since the GAP recurrence has two independent sections in different directions, we keep two copies of D, one organized in column major and the other in row major. Then when computing on the inter-quadrant updates (e.g., from D 00 to D 01 ) we start n /2 parallel tasks, each to compute a 2d grid on the corresponding row or column, taking the input and output with the correct representation. This update takes the work and cache complexity shown in Theorem 8.1. We also need to keep the consistency of the two copies. After the update of a quadrant D 01 or D 10 is finished, we apply a matrix transpose [22] to update the other copy of this quadrant, and the cost of the transpose is a lower-order term. For the quadrant D 11 , we wait until the two updates from D 01 and D 10 finish, and then apply the matrix transpose to update the values in each other. It is easy to check that by induction, the values in both copies in a quadrant are update-to-date after each recursion.
The updated algorithm still requires Θ(n 3 ) work since it does not require extra asymptotic work. The cache complexity and depth satisfy: Q(n) = 4Q(n/2) + 4n · Q 2C (n/2) D(n) = 3D(n/2) + 2D 2C (n/2) The base cases are Q( Protein accordion folding. The recurrence for protein accordion folding [91] is D i,j = max 1≤k<j−1 {D j−1,k + w(i, j, k)} for 1 ≤ j < i ≤ n, with O(n 2 /B) cost to precompute w(i, j, k). Although there are some minor differences, from the perspective of the computation structure, the recurrence can basically be viewed as only containing the first section of the GAP recurrence. As a result, the same lower bounds of GAP can also apply to this recurrence.
In terms of the algorithm, we can compute n 2d grids with the increasing order of j from 1 to n, such that the input are D j−1,k for 1 ≤ k < j − 1 and the output are D i,j for j < i ≤ n. For short, we refer to a 2d grid as a task. However, the input and output arrays are in different dimensions. To handle it, we use the similar method as the GAP algorithm that keeps two separate matrices, one in column-major and one in row-major. They are used separately to provide the input and output for the 2d grid. We apply the transpose in a divide-and-conquer manner: once the first half of the tasks finish, we transpose all computed values from the output matrix to the input matrix (which is a square matrix), and then compute the second half of the task. Both matrix transposes in the first and second halves are applied recursively with geometrically decreasing sizes. The correctness of this algorithm can be verified by checking the data dependencies so that all required values are computed and moved to the correct positions before they are used for further computations.
The cache complexity is from two subroutines: the computations of 2d grids and matrix transpose. The cost of 2d grids is simply upper bounded by n times the cost of each task, which is O(n 2 /B · (1 + n/M )) and O(n 2 /B · (ω + ω 1/2 n/M )) for symmetric and asymmetric cache complexity, and O(n log 2 n) depth. For matrix transpose, the cost can be verified in the following recursions. The cache bounds in both symmetric and asymmetric cases are optimal with respect to the recurrence.
RNA Recurrence
The RNA problem [57] is a generalization of the GAP problem. In this problem a weight function w(p, q, i, j) is given, which is the cost to delete the substring of X from (p + 1)-th to i-th character and insert the substring of Y from (q + 1)-th to j-th character. Similar to GAP, let D i,j be the minimum cost for such transformation from the prefix of X with i characters to the prefix of Y with j characters, the recurrence for i, j > 0 is:
with the boundary values D 0,0 , D 0,j and D i,0 . This recurrence is widely used in computational biology, like to compute the secondary structure of RNA [96] . While the cache complexity of this recurrence seems to be hard to analysis in previous papers, it fits into the framework of this paper straightforwardly. Since each computation in the recurrence only requires one input value, the whole recurrence can be viewed as a 2d grid, with both the input and output as D. The 2d grid contains a constant fraction of the cells, so we can apply the lower bounds in Section 5 here.
Again for a matching upper bound, we need to consider the data dependency. We can apply the similar technique in GAP algorithm to partition the output D into four quadrants, compute D 00 , then D 01 and D 10 , and finally D 11 . Each inter-quadrant update corresponds to a 1/2-full 2d grid. Here maintaining two copies of the array is not necessary with the tall-cache assumption M = Ω(B 2 ). Applying the similar analysis in GAP gives the following result: 
Parenthesis Recurrence
The Parenthesis recurrence solves the following problem: given a linear sequence of objects, an associative binary operation on those objects, and the cost of performing that operation on any two given (consecutive) objects (as well as all partial results), the goal is to compute the min-cost way to group the objects by applying the operations over the sequence. Let D i,j be the minimum cost to merge the objects indexed from i + 1 to j (1-based), the recurrence for 0 ≤ i < j ≤ n is:
where w(i, k, j) is the cost to merge the two partial results of objects indexed from i + 1 to k and those from k + 1 to j. Here the cost function is only decided by a constant-size input associated to indices i, j and k. D i,i+1 is initialized, usually as 0. The applications of this recurrence include the matrix chain product, construction of optimal binary search trees, triangulation of polygons, and many others shown in [41, 56, 57, 101] . The computation of this recurrence (without considering dependencies) is a (1/3)-full 3d grid, which has the same lower bound shown in Corollary 7.1.
The divide-and-conquer algorithm that computes this recurrence is usually hard to describe (e.g., it takes several pages in [34, 69] although they also describe their systems simultaneously). We claim that under the view of our k-d grids, this algorithm is conceptually as simple as the other algorithms. Again this divide-and-conquer algorithm partitions the state D into quadrants, but at this time one of them (D 10 ) is empty since D i,j does not make sense when i > j. The quadrant D 01 depends on the other two. The algorithm first recursively computes D 00 and D 11 , then updates D 01 using the computed values in D 00 and D 11 , and finally recursively computes D 01 . Here D 01 is square, so the recursive computation of D 01 is almost identical to that in RNA or GAP recurrence (although the labeling of the quadrants is slightly changed): breaking a subtask into four quadrants, recursively solving each of them in the correct order while applying inter-quadrant updates in the middle. The only difference is when the inter-quadrant updates are processed, each update requires two values, one in D 01 and another in D 00 or D 11 . This is the reason that Parenthesis is 3d while RNA and GAP are 2d. The correctness of this algorithm can be shown inductively. 
2-Knapsack Recurrence
Given A i and B i for 0 ≤ i ≤ n, the 2-knapsack recurrence computes
The cost function w(j, i − j, i) relies on constant input values related on indices i, i − j and j.
To the best of our knowledge, this recurrence is first discussed in this paper. We name is the "2-knapsack recurrence" since it can be interpreted as the process of finding the optimal strategy in merging two knapsacks, given the optimal local arrangement of each knapsack stored in A and B. Although this recurrence seems trivial, the computation structure of this recurrence actually forms some more complicated DP recurrence. For example, many problems on trees 7 can be solved using dynamic programming, such that the computation essentially applies the 2-knapsack recurrence a hierarchical (bottom-up) manner. We start by analyzing the lower bound on cache complexity of the 2-knapsack recurrence. The computational grid has two dimensions, corresponding to i and j in the recurrence. If we ignore B in the recurrence, then the recurrence is identical to LWS (with no data dependencies), so we can apply the lower bounds in Section 8.1 here.
Note that each update requires two input values A j and B i−j , but they are not independent. When computing a subtask that corresponding to (i, j) ∈ [i 0 , i 0 + n i ] × [j 0 , j 0 + n j ], the projection sizes on input and output arrays A, B and D are no more than n j , n i + n j and n i . This indicates that the computation of this recurrence is a variant of 2d grid, so we can use the same algorithm discussed in Section 5.
Corollary 8.6. 2-knapsack recurrence can be computed using O(n 2 ) work, optimal symmetric and asymmetric cache complexity of Θ n 2 BM and Θ ω 1/2 n 2 BM , and O(log 2 n) depth.
Conclusion and Future Work
In this paper, we abstract the k-d grid as a basic block to analyze the computation structure of many classic cache-oblivious algorithms. This abstraction provides a more simple and intuitive framework on better understanding the actual computation of these algorithms, proving lower bounds, and designing algorithms that are both I/O-efficient and highly parallelized. It also provides a unified framework to bound the asymmetric cache complexity of these algorithms. Because of the new perspective to review these algorithms, we already provide many new results, but we also observe many new open problems. Among them are:
1. The only non-optimal algorithm regarding cache complexity in this paper is for the GAP recurrence.
The I/O cost has an additional term of O((n 2 log M )/B). Although in practice this term will not dominate the running time (the computation has O(n 3 ) arithmetic operations), it is theoretically interesting to know if we can get rid of this term (even without the constraints of being cache-oblivious or based on divide-and-conquer).
2. We show our algorithms in the asymmetric setting are optimal under the assumption of constantbranching (the CBCO paradigm). Since the cache-oblivious algorithms discussed in this paper are leaf-dominate, we believe this assumption is always true. We wonder if this assumption is necessary (i.e., if there exists a proof without using it, or if there are cache-oblivious algorithms on these problems with non-constant branching but still I/O-optimal).
3. The parallel symmetric cache complexity Q p on p processors is Q 1 + O(pDM/B), which is a loose upper bound when D is large. Although it might be hard to improve this bound on any general computation under randomized work-stealing, it can be a good direction to show tighter bounds on more regular computation structures like the k-d grids or other divide-and-conquer algorithms. We conjecture that the additive term can be shown as optimal (i.e., O(pM/B)) for the k-d grid computation structures. An all-pair shortest-paths (APSP) problem takes a (usually directed) graph G = (V, E) (with no negative cycles) as input. Here we discuss the Kleene's algorithm (first mentioned in [49, 53, 72, 83] , discussed in full details in [5] ). Kleene's algorithm has the same computational DAG as Floyd-Washall algorithm [50, 95] , but it is described in a divide-and-conquer approach, which is already I/O-efficient, cache-oblivious and highly parallelized.
The pseudocode of Kleene's algorithm is provided in Algorithm 2. The matrix A is partitioned into 4 submatrices indexed as A 00 A 01 A 10 A 11 . The matrix multiplication is defined in a closed semi-ring with (+, min).
Kleene's algorithm is a divide-and-conquer algorithm to compute APSP. Its high-level idea is to first compute the APSP between the first half of the vertices only using the paths between these vertices. Then by applying some matrix multiplication we update the shortest-paths between the second half of the vertices using the computed distances from the first half. We then apply another recursive subtask on the second half vertices. The computed distances are finalized, and we use them to again update the shortest-paths from the first-half vertices.
The asymmetric cache complexity Q(n) of this algorithm follows the recursion of: Q(n) = 2Q(n/2) + 6Q 3C (n/2) D(n) = 2D(n/2) + 2D 3C (n/2)
Considering this cost, the recursion is root-dominated, which indicates that computing all-pair shortest paths of a graph has the same upper bound on cache complexity as matrix multiplication.
C Gaussian Elimination
Gaussian elimination (without pivoting) is used in solving of systems of linear equations and computing LU decomposition of symmetric positive-definite or diagonally dominant real matrices. Given a linear system AX = b, the algorithm proceeds in two phases. The first phase modifies A into an upper triangular matrix (updates B accordingly), which is discussed in this section. The second phase solves the values of the variables using back substitution, which is shown in Section D. The process of Gaussian elimination can be viewed as a three nested-loops and computing the value of A i,j requires A i,k , A k,j and A k,k for all 1 ≤ k < i. If required, the corresponding value of the LU decomposition matrix can be computed simultaneously. The underlying idea of divide-and-conquer approach is almost identical to Kleene's algorithm, which partitions A into four quadrants A 00 A 01 A 10 A 11 . The algorithm:
(1) recursively computes A 00 ; (2) updates A 10 and A 01 using A 00 ; (3) updates A 11 using A 10 and A 01 ; and (4) recursively computes A 11 . Note that each inter-quadrant update in step (2) and (3) is a 3d grid, which gives the following recurrence: Q(n) = 2Q(n/2) + 4Q 3C (n/2) D(n) = 2D(n/2) + 3D 3C (n/2)
D Triangular System Solver
A Triangular System Solver computes the back substitution step in solving the linear system. Here we assume that it takes as input a lower triangular n × n matrix T (can be computed using the algorithm discussed in Section C) and a square matrix B and outputs a square matrix X such that T X = B. A triangular system can be recursively decomposed as:
B 00 B 01 B 10 B 11 = T 00 0 T 10 T 11 X 00 X 01 X 10 X 11 = T 00 X 00 T 00 X 01 T 10 X 00 + T 11 X 10 T 10 X 01 + T 11 X 11 such that four equally sized subquadrants X 00 , X 01 , X 10 , and X 11 can be solved recursively. In terms of parallelism, the two subtasks of X 00 and X 01 are independent, and need to be solved prior to the other independent subtasks X 10 , and X 11 .
The asymmetric cache complexity Q(n) of this algorithm follows the recursion of: Q(n) = 4Q(n/2) + 2Q 3C (n/2) D(n) = 2D(n/2) + D 3C (n/2)
E Discussions to Carson et al. [31]
Carson et al. also discussed algorithms using less writes in the paper [31] . They concluded that many cacheoblivious algorithms like matrix multiplication could not be write-avoiding. Their definition of write-avoiding is different from our write-efficiency, and it requires the algorithm to reduce writes without asymptotically increasing reads. Hence, their negative conclusion does not contradict the result in this paper. We now show an example that optimal number of reads leads to worse overall asymmetric cache complexity. Let's say a write is n times more expansive than a read. A smart cache-oblivious matrix multiplication algorithm should apply n 2 inner products, and the asymmetric cache complexity is O(n 3 /B): O(n/B) reads and O(1/B) writes per inner product. However, the algorithms using optimal number of reads requires O(n 3 /B √ M ) reads and writes, so the overall cache complexity is O(n · n 3 /B √ M ). The algorithm requiring more reads is a factor of O(n/ √ M ) better on the asymmetric cache complexity. Notice that we always assume n = ω( √ M ) since otherwise the whole computation is trivially in the cache and has no cost. As a result, an algorithm with a good asymmetric cache complexity does not always need to be write-avoiding.
The algorithms on asymmetric memory in this paper all require extra reads, but can greatly reduce the overall asymmetric cache complexity compared to the previous cache-oblivious algorithms. The goal of this paper is to find the optimal cache-oblivious algorithms for any given write-read asymmetry ω.
