In several emerging technologies for computer memory (main memory) the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but O(1) memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the (M, ω)-ARAM, in which there is a small (symmetric) memory of size M , a large unbounded (asymmetric) memory, both random access, and the cost of reading from the large memory is unit, but the cost of writing is ω ≥ 1.
Introduction
Fifty years of algorithms research has focused on settings in which reads and writes (to memory) have similar cost. But what if reads and writes to memory have significantly different costs? How would that impact algorithm design? What new techniques are useful for trading-off doing more cheaper operations (say more reads) in order to do fewer expensive operations (say fewer writes)? What are the fundamental limitations on such trade-offs (lower bounds)? What well-known equivalences for traditional memory fail to hold for asymmetric memories?
Such questions are coming to the fore with the arrival of new memory technologies, such as phasechange memory (PCM), spin-torque transfer magnetic RAM (STT-RAM), and memristor-based resistive RAM (ReRAM), which have the property that the cost of reading is significantly cheaper than the cost of writing [4, 5, 7, 16, 17, 27, 28, 30, 38, 44] . Reads are an order of magnitude or more lower energy, lower latency, and higher (per-module) bandwidth than writes, and this gap appears to be fundamental to this new wave of non-volatile memory technologies [7] . Such memories are starting to appear in the marketplace [29] , and are projected to become the dominant main memory within the decade [33, 46] . This paper provides a first step towards answering these fundamental questions about asymmetric memories. We introduce a simple model for studying such memories, and a number of new results. In the simplest model we consider, there is an asymmetric random-access memory such that reads cost 1 and writes cost ω 1, as well as a constant number of symmetric "registers" that can be read or written at unit cost. More generally, we consider settings in which the amount of symmetric memory is M n, where n is the input size: We define the (M, ω)-Asymmetric RAM (ARAM), comprised of a symmetric small-memory of size M , an asymmetric large-memory of unbounded size, and an integer write cost ω. The ARAM cost Q is the number of reads from large-memory plus ω times the number of writes to large-memory. The time T is Q plus the number of reads and writes to small-memory.
We present a number of lower and upper bounds for the (M, ω)-ARAM, as summarized in Table 1 . These results consider a number of fundamental problems and demonstrate how the asymptotic algorithm costs decrease as a function of M , e.g., polynomially, logarithmically, or not at all.
For FFT we show an Ω(ω(n log ωM n)) lower bound on the ARAM cost, and a matching upper bound. Thus, even allowing for redundant (re)computation of nodes (to save writes), it is only possible to achieve asymptotic improvements with cheaper reads when ω > M . Prior lower bound approaches for FFTs for symmetric memory fail to carry over to asymmetric memory, so a new lower bound technique is required. We use an interesting new accounting argument for fractionally assigning a unit weight for each node of the network to subcomputations that each have cost ωM . The assignment shows that each subcomputation has on average at most M log(ωM ) weight assigned to it, and hence the total cost across all Θ(n log n) nodes yields the lower bound.
For sorting, we show the surprising result that on asymmetric memories, comparison sorting is asymptotically faster than sorting networks. This is in contrast with the RAM model (and parallel models such as the PRAM, I/O models, etc.), in which the asymptotic costs are the same! The lower bound leverages the same key partitioning lemma as in the FFT proof.
We present a tight lower bound for DAG computation on diamond DAGs that shows there is no asymptotic advantage of cheaper reads. On the other hand, we also show that allowing a vertex to be "partially" computed before all its immediate predecessors have been computed (thereby violating a DAG computation rule), we can beat the lower bound and show asymptotic advantage. Specifically, for both the longest common subsequence and edit distance problems (normally thought of as diamond DAG computations), we devise a new "path sketch" technique that leverages partial aggregation on the DAG vertices. Again we know of no other models in which such techniques are needed. problem ARAM cost Q(n) or Q(n, m) time T (n) or T (n, m) section FFT Θ(ωn log n/ log(ωM )) † Θ(Q(n) + n log n) 3, 4 sorting networks Ω(ωn log n/ log(ωM )) † Ω(Q(n) + n log n) 3 sorting (comparison) O(n(log n + ω)) Θ(n(log n + ω)) 4, [7] diamond DAG Θ(n 2 ω/M ) Θ(Q(n) + n 2 ) 3 longest common subsequence, O(n 2 ω/ min(ω 1/3 M, M 3/2 )) † O(n 2 (1+ 5 edit distance ω/ min(ω 1/3 M 2/3 , M 3/2 ))) † search tree, priority queue O(ω + log n) per update O(ω + log n) per update 4 planar convex hull, triangulation O(n(log n + ω)) Θ(n(log n + ω)) 4 BFS, DFS, topological sort, Θ(ωn + m) Θ(ωn + m) 4 biconnected components, SCC all-pairs shortest-path O(n 2 (ω + n/ √ M )) O(Q(n) + n 3 ) 4 single-source shortest path O(min(n(ω + m/M ), ω(m O(Q(n, m) + n log n) 6 +n log n), m(ω + log n))) † minimum spanning tree O(m min(log n, n/M ) + ωn) † O(Q(n, m) + n log n) 7
Finally, we show how to adapt Dijkstra's single-source shortest paths algorithm using phases so that the priority queue is kept in small-memory, and how to adapt Borůvka's minimum spanning tree algorithm to reduce the number of shortcuts and hence writes that are needed. A common theme in many of our algorithms is that they use redundant computations and require a tradeoff between reads and writes. Related Work. Prior work [6, 18, 22, 34, 35, 43] has looked at read-write asymmetries in the context of NAND flash memory, but this work has focused on (i) the fact that in NAND flash chips, bits can only be cleared by incurring the overhead of erasing a large block of memory, and/or (ii) avoiding individual cell wear out due to too many writes to the cell. Emerging memories, in contrast, can write arbitrary bytes in-place and system software can use the virtual-to-physical mapping to balance application writes across individual physical cells. Other prior work [9, 43, 42] has looked at algorithms for database query processing under asymmetric read-write costs in emerging memories, or at other systems considerations [10, 27, 32, 45, 47, 48] . Our recent paper [7] presented models for a variety of settings (sequential, parallel, external memory) under asymmetric read and write costs, and write-efficient algorithms for sorting, FFT and matrix multiplication. It neither presented lower bounds nor considered any of the other problems studied here. Finally, Carson et al. [8] recently presented interesting upper and lower bounds for various linear algebra problems and direct Nbody methods under asymmetric read and write costs, but restricted to the class of "communication-avoiding" algorithms, i.e., algorithms that minimize the (unweighted) sum of reads and writes, or to "bounded data reuse" algorithms in which each input or computed value is used only a constant number of times. For example, they show that under the bounded data reuse assumption, the number of writes to large-memory for FFT is asymptotically the same as the sum of the reads and writes to large-memory.
Model and Preliminaries
We analyze algorithms in an (M, ω)-ARAM. In the model we assume a symmetric small-memory of size M ≥ 1, an asymmetric large-memory of unbounded size, and an integer write cost ω ≥ 1. We assume standard random access machine (RAM) instructions. We consider two cost measures for computations in the model. We define the (asymmetric) ARAM cost Q as the total number of reads from large-memory plus ω times the number of writes to large-memory. We define the (asymmetric) time T as the ARAM cost plus the number of reads from and writes to small-memory. 1 Because all instructions are from memory, this includes any cost of computation. In the paper we present results for both cost measures.
The model contrasts with the widely-studied external-memory model [1] in the asymmetry of the read and write costs. Also for simplicity in this paper we do not partition the memory into blocks of size B. Another difference is that the asymmetry implies that even the case of M = O(1) (studied in [7] for sorting) is interesting.
We use the term value to refer to an object that fits in one word (location) of the memory. For simplicity we assume values cannot be split. We assume words are of size Θ(log n) for input size n. The size M is the number of words in small-memory. All logarithms are base 2 unless otherwise noted. The DAG computation problem is given a DAG and a value for each of its input vertices (in degree = 0), compute the value for each of its output vertices (out degree = 0). The value of any non-input vertex can be computed in unit time given the value of all its immediate predecessors. The DAG computation problem can be modeled as a pebbling game on the DAG [26] . Note that we allow (unbounded) recomputation of a DAG vertex, and indeed recomputation is a useful technique for reducing the number of writes (at the cost of additional reads).
Lower Bounds
We start by showing lower bounds for FFT DAGs, sorting networks and diamond DAGs. The idea in showing the lower bounds is to partition a computation into subcomputations that each have a lower bound on cost, but an upper bound on the number of inputs and outputs they can use. Our lower bound for FFT DAGs then uses an interesting accounting technique that gives every node in the DAG a unit weight, and fractionally assigns this weight across the subcomputations. In the special case ω = 1, this leads to a simpler proof for the lower bound on the I/O complexity of FFT DAGs than the well-known bound by Hong and Kung [25] .
We refer to a subcomputation as any contiguous sequence of instructions. The outputs of a subcomputation are the values written by the subcomputation that are either an output of the full computation or read by a later subcomputation. Symmetrically, the inputs of a subcomputation are the values read by the subcomputation that are either an input of the full computation or written by a previous subcomputation. The space of a computation or subcomputation is the number of memory locations both read and written. An (l, m)-partitioning of a computation is a partitioning of instructions into subcomputations such that each has at most l inputs and at most m outputs. We allow for recomputation-instructions in different subcomputations might compute the same value. Lemma 1. Any computation in the (M, ω)-ARAM has an ((ω + 1)M, 2M )-partitioning such that at most one of the subcomputations has ARAM cost Q < ωM .
Proof. We generate the partitioning constructively. Starting at the beginning, partition the instructions into contiguous blocks such that all but possibly the last block has cost Q ≥ ωM , but removing the last instruction from the block would have cost Q < ωM . To remain within the cost bound each such subcomputation can read at most ωM values from large-memory. It can also read the at most M values that are in the small-memory when the subcomputation starts. Therefore it can read at most (ω + 1)M distinct values from the input or from previous subcomputations. Similarly, each subcomputation can write at most M values to large-memory, and an additional M that remain in small-memory when the subcomputation ends. Therefore it can write at most 2M distinct values that are available to later subcomputations or the output.
FFT. We now consider lower bounds for the DAG computation problem for the family of FFT DAGs (also called FFT networks, or butterfly networks). The FFT DAG of input size n = 2 k consists of k + 1 levels each with n vertices (total of n log 2n vertices). Each vertex (i, j) at level i ∈ 0, . . . , k − 1 and row j has two out edges, which go to vertices (i + 1, j) and (i + 1, j ⊕ 2 i ) (⊕ is the exclusive-or of the bit representation). This is the DAG used by the standard FFT (Fast Fourier Transform) computation. We note that in the FFT DAG there is at most a single path from any vertex to another. Lemma 2. Any (l, m)-partitioning of a computation for simulating an n input FFT DAG has at least n log n/(m log l) subcomputations.
Proof. We refer to all vertices whose values are outputs of any subcomputation, as partition output vertices. We assign each such vertex arbitrarily to one of the subcomputations for which it is an output.
Consider the following accounting scheme for fractionally assigning a unit weight for each non-input vertex to some set of partition output vertices. If a vertex is a partition output vertex, then assign the weight to itself. Otherwise take the weight, divide it evenly between its two immediate descendants (out edges) in the FFT DAG, and recursively assign that weight to each. For example, for a vertex x that is not a partition output vertex, if an immediate descendant y is a partition output vertex, then y gets a weight of 1/2 from x, but if not and one of y's immediate descendants z is, then z gets a weight of 1/4 from x. Since each non-input vertex is fully assigned across some partition output vertices, the sum of the weights assigned across the partition output vertices exactly equals |V | − n = n log n. We now argue that every partition output vertex can have at most log l weight assigned to it. Looking back from an output vertex we see a binary tree rooted at the output. If we follow each branch of the tree until we reach an input for the subcomputation, we get a tree with at most l leaves, since there are at most l inputs and at most a single path from every vertex to the output. The contribution of each vertex in the tree to the output is 1/2 i , where i is its depth (the root is depth 0). The leaves (subcomputation inputs) are not included since they are partition output vertices themselves, or inputs to the whole computation, which we have excluded. By induction on the tree structure, the weight of that tree is maximized when it is perfectly balanced, which gives a total weight of log l.
Therefore since every subcomputation can have at most m outputs, the total weight assigned to each subcomputation is at most m log l. Since the total weight across all subcomputations is n log n, the total number of subcomputations is at least n log n/(m log l).
Theorem 1 (FFT Lower Bound). Any solution to the DAG computation problem on the family of FFT DAGs parametrized by input size n has costs Q(n) = Ω(ωn log n/log(ωM )) and T (n) = Ω(Q(n) + n log n) on the (M, ω)-ARAM.
Proof. By Lemma 1 every computation must have an ((ω + 1)M, 2M )-partitioning with subcomputation cost Q ≥ ωM (except perhaps one). Plugging in Lemma 2 we have Q(n) ≥ ωM n log n/(2M log((ω + 1)M )), which gives our bound on Q(n). For T (n) we just add in the cost of the computation of each vertex.
Note that whichever of ω and M is larger will dominate in the denominator of Q(n). When ω ≤ M , these lower bounds match those for the standard external memory model [25, 1] assuming both reads and writes have cost ω. This implies that cheaper reads do not help asymptotically in this case. When ω > M , however, there is a potential asymptotic advantage for the cheaper reads. Sorting Networks. A sorting network is a acyclic network of comparators, each of which takes two input keys and returns the minimum of the keys on one output, and the maximum on the other. For a family of sorting networks parametrized by n, each network takes n inputs, has n ordered outputs, and when propagating the inputs to the outputs must place the keys in sorted order on the outputs. A sorting network can be modeled as a DAG in the obvious way. Ajtai, Komlós and Szemerédi [3] described a family of sorting networks that have size O(n log n) and depth O(log n). Their algorithm is complicated and the constants are very large. Many simplifications and constant factor improvements have been made, including the well known Patterson variant [37] and a simplification by Seiferas [40] . Recently Goodrich [23] gave a much simpler construction of an O(n log n) size network, but it requires polynomial depth. Here we show lower bounds of simulating any sorting network on the (M, ω)-ARAM.
Theorem 2 (Sorting Lower Bound). Simulating any family of sorting networks parametrized on input size n has Q(n) = Ω ωn log n log(ωM ) and T (n) = Ω(Q(n) + n log n) on the (M, ω)-ARAM.
Proof. Consider an (l, m)-partitioning of the computation. Each subcomputation has at most l inputs from the network, and m outputs for the network. The computation is oblivious to the values in the network (it can only place the min and max on the outputs of each comparator). Therefore locations of the inputs and outputs are fixed independent of input values. The total number of choices the subcomputation has is therefore
Since there are n! possible permutations, we have that the number of subcomputations k must satisfy (l m ) k ≥ n!. Taking logs of both sides, rearranging, and using Stirling's formula we have k > log(n!)/(m log l) > 1 2 n log n/(m log l) (for n > e 2 ). By Lemma 1 we have
These bounds are the same as for simulating an FFT DAG, and, as with FFTs, they indicate that faster reads do not asymptotically affect the lower bound unless ω > M . These lower bounds rely on the sort being done on a network, and in particular that the location of all read and writes are oblivious to the data itself. As discussed in the next section, for general comparison sorting algorithms, we can get better upper bounds than indicated by these lower bounds. Diamond DAG. We consider the family of diamond DAGs parametrized on size n. Each DAG has n 2 vertices arranged in a n × n grid such that every vertex (i, j), 0 ≤ i < (n − 1), 0 ≤ j < (n − 1) has two out-edges to (i + 1, j) and (i, j + 1). The DAG has one input at (0, 0) and one output at (n − 1, n − 1). Diamond DAGs have many applications in dynamic programs, such as for the edit distance (ED), longest common subsequence (LCS), and optimal sequence alignment problems.
Lemma 3 (Cook and Sethi, 1976) . Solving the DAG computation problem on the family of diamond DAGs of input parameter n (size n × n) requires n space to store vertex values from the DAG.
Proof. Cook and Sethi [12] show that evaluating the top half of a diamond DAG (i + j ≥ n − 1) , which they call a pyramid DAG, requires n space to store partial results. Since all paths of the diamond DAG must go through the top half, it follows for the diamond DAG.
Theorem 3 (Diamond DAG Lower Bound). The family of diamond DAGs parametrized on input size n has
By Lemma 3 any subcomputation that computes the last output vertex of the sub-DAG requires 2M memory to store values from the diamond. The extra in-edges along two sides and out-edges along the other two can only make the problem harder. Half of the 2M required memory can be from small-memory, so the remaining M must require writing those values to large-memory. Therefore every 2M × 2M diamond requires M writes of values within the diamond. Partitioning the full diamond DAG into 2M × 2M subdiamonds, gives us n 2 /(2M ) 2 partitions. Therefore the total number of writes is at least
, each with cost ω. For the time we need to add the n 2 calculations for all vertex values.
This lower bound is asymptotically tight since a diamond DAG can be evaluated with matching upper bounds by evaluating each M/2 × M/2 diamond sub-DAG as a subcomputation with M inputs, outputs and memory. These bounds show that for the DAG computation problem on the family of diamond DAGs there is no asymptotic advantage of having cheaper reads. In Section 5 we show that for the ED and LCS problems (normally thought of as a diamond DAG computation), it is possible to do better than the lower bounds. This requires breaking the DAG computation rule by partially computing the values of each vertex before all inputs are ready. The lower bounds are interesting since they show that improving asymptotic performance with cheaper reads requires breaking the DAG computation rules.
Upper Bounds
We start in this section by showing that a variety of problems have reasonably easy optimal upper bounds. In the two sections that follow and Appendix 7 we study problems that are more challenging. FFT. For the FFT we can match the lower bound using the algorithm described elsewhere [7] , although in that case the computation cost was not considered. The idea is to first split the DAG into layers of log(ωM ) levels. Then divide each layer so that the last log M levels are partitioned into FFT networks of output size M . Attach to each partition all needed inputs from the layer and the vertices needed to reach them (note that these vertices will overlap among partitions). Each extended partition will have ωM inputs and M outputs, and can be computed in M small-memory with Q = O(ωM ), and T = O((ω + log M )M ). This gives a total upper bound Q = O(ωM × n log n/(M log(ωM ))) = O(ωn log n/ log(ωM )), and T = O(Q(n) + n log n), which matches the lower bound (asymptotically). All computations are done within the DAG model. Search Trees and Priority Queues. We now consider algorithms for some problems that can be implemented efficiently using balanced binary search trees. In the following discussion we assume M = O(1). Red-black trees with appropriate rebalancing rules require only O(1) amortized time per update (insertion or deletion) once the location for the key is found [41] . For a tree of size n finding a key's location uses O(log n) reads but no writes, so the total amortized cost Q = T = O(ω + log n) per update in the (M, ω)-ARAM. For arbitrary sequences of searches and updates, Ω(ω + log n) is a matching lower bound on the amortized cost per operation when M = O(1). Since priority queues can be implemented with a binary search tree, insertion and delete-min have the same bounds. It seems more difficult, however, to reduce the number of writes for priority queues that support efficient melding or decrease-key. Sorting. Sorting can be implemented with Q = T = O(n(log n + ω)) by inserting all keys into a red-black tree and then reading them off in priority order [7] . We note that this bound on time is better than the sorting network lower bound (Theorem 2). For example, when ω = M = log n it gives a factor of log n/ log log n improvement. The additional power is a consequence of being able to randomly write to one of n locations at the leaves of the tree for each insertion. The bound is optimal for T since n writes are required for the output and comparison-based sorting requires O(n log n) operations. Convex Hull and Triangulation. A variety of problems in computational geometry can be solved optimally using balanced trees and sorting. The planar convex-hull problem can be solved by first sorting the points by x coordinates and then either using Overmars' technique or Graham's scan [14] . In both cases the second part takes linear time so the overall cost is O(Sort(n)). The planar Delaunay triangulation problem can be solved efficiently with the plane sweep method [14] . This involves maintaining a priority queue on x coordinate, and maintaining a balanced binary search tree on the y coordinate. A total of O(n) operations are required on each, again giving bounds O(Sort(n)). BFS and DFS. Breadth-first and depth-first search can be performed with Q = T = O(ωn + m). In particular each vertex only requires a constant number of writes when it is first added to the frontier (the stack or queue) and a constant number of writes when it is finished (removed from the stack or queue). Searches along an edge to an already visited vertex requires no writes. This implies that several problems based on BFS and DFS also only require Q = T = O(DFS(n)). For example, topological sort, biconnected components, and strongly connected components. However when using priority-first search on a weighed graph (e.g., Dijkstra's or Prim's algorithms) then the problem is more difficult to perform optimally. This is because the priority queue might need to be updated for every edge that is visited. We cover single source shortest paths (SSSP) using Dijkstra in Section 6 and minimum spanning trees (MST) in Appendix 7. Dynamic Programming. With regards to dynamic programming, some problems are reasonably easy and some harder. The standard Floyd-Warshal algorithm for the all-pairs shortest-path (APSP) problem uses O(n 3 ) writes. However, by rearranging the loops and carefully scheduling the writes it is possible to implement a variant of the algorithm using only O(n 2 ) writes and O(n 3 ) reads, giving T = O(ωn 2 + n 3 ). We describe this in Appendix C. This version, however, is not efficient in terms of Q. Kleene's divide-andconquer algorithm [2] can be used to reduce the ARAM cost [36] . Each recursive call makes two calls to itself on problems of half the size, and six calls to matrix multiply over the semiring (min, +). Here we analyze the algorithm in the (M, ω)-ARAM. The matrix multiplies on two matrices of size n × n can be done in the model in [7] . This leads to the recurrence:
which solves to Q Kleene (n) = O(Q M (n)) since the cost is dominated at the root of the recurrence. It is not known whether this is optimal. A similar approach can be used for several other problems, including sequence alignment with gaps, optimal binary search trees, and matrix chain multiplication [11] . Longest common subsequence (LCS) and edit distance (ED) are more challenging, and covered next.
Longest Common Subsequence and Edit Distance
This section describes a more efficient dynamic-programming algorithm for longest common subsequence (LCS) and edit distance (ED). The standard approach for these problems (an M × M tiling) results in an ARAM cost of O(mnω/M ) and time of O(mn + mnω/M ), where m and n are the length of the two input strings. Lemma 3 states that the standard bound is optimal when limited to the DAG/pebbling model where all inputs must be available before evaluating a node. Perhaps surprisingly, we are able to beat these bounds by leveraging the fact that dynamic programs do not perform arbitrary functions at each node, and hence we do not necessarily need all inputs to evaluate a node.
Our main result, proved in Appendix B, is captured by the following theorem for large input strings. For smaller strings, we can do even better (see Appendix B).
Then it is possible to compute the ED or length of the LCS with time
Then it is possible to compute the ED or length of the LCS with an ARAM cost of
To understand these bounds, our algorithm beats the ARAM cost of the standard tiling algorithm by a k Q factor. And if ω ≥ M , our algorithm (using different tuning parameters) beats the time of the standard tiling algorithm by a k T factor. Overview. The dynamic programs for LCS and ED correspond to computing the shortest path through an m × n grid with diagonal edges, where m and n are the string lengths. We focus here on computing the length of the shortest path, but Appendix B discusses how to output the path as well. Without loss of generality, we assume that m ≤ n, so the grid is at least as wide as it is tall. For LCS, all horizontal and vertical edges have weight 0; the diagonal edges have weight −1 if the corresponding characters in the strings match, and weight ∞ otherwise. For ED, horizontal and vertical edges have weight 1, and diagonal edges have weights either 0 or 1 depending on whether the characters match. The algorithm of this section is not sensitive to the particular weights of the edges, and thus it applies to both problems and their generalizations.
Note that the m × n grid is not built explicitly since building and storing the graph would take Θ(mn) writes if mn M . To get any improvement, it is important that subgrids reuse the same space. The weights of each edge can be inferred by reading the appropriate characters in each input string.
Our algorithm partitions the implicit grid into size-hM × kM , where h and k are parameters of the algorithm to be set later, and M = M/c for large enough constant c > 1 to give sufficient working space in small-memory. When string lengths m and n ≥ m are both "large", we use h = k and thus usually work with kM × kM square subgrids. If the smaller string length m is small enough, we instead use parameters h < k. To simplify the description of the algorithm, we assume without loss of generality that m and n are divisible by hM and kM , respectively, and that M is divisible by c.
Our algorithm operates on one hM × kM rectangle at a time, where the edges are directed right and down. The shortest-path distances to all nodes along the bottom and right boundary of each rectangle are explicitly written out, but all other intermediate computations are discarded. We label the vertices u i,j for 1 ≤ i ≤ hM and 1 ≤ j ≤ kM according to their row and column in the square, respectively, starting from the top-left corner. We call the vertices immediately above or to the left of the square the input nodes. The input nodes are all outputs for some previously computed rectangle. We call the vertices u hM ,j along the bottom boundary and u i,kM along the right boundary the output nodes.
The goal is to reduce the number of writes, thereby decreasing the overall cost of computing the output nodes, which we do by sacrificing reads and time. It is not hard to see that recomputing internal nodes allows us to reduce the number of writes. Consider, for example, the following simple approach assuming M = Θ(1): For each output node of the k × k square, try all possible paths through the square, keeping track of the best distance seen so far; perform a write at the end to output the best value. 2 Each output node tries 2 Θ(k) paths, but only a Θ(1/k)-fraction of nodes are output nodes. Setting k = Θ(lg ω) reduces the number of writes by a Θ(lg ω)-factor at the cost of ω O(1) reads. This same approach can be extended to larger M , giving the same lg ω improvement, by computing "bands" of nearby paths simultaneously. But our main algorithm, which we discuss next, is much better as M gets larger (see Theorem 4). Path sketch. The key feature of the grid leveraged by our algorithm is that shortest paths do not cross, which allows us to avoid the exponential recomputation of the simple approach. The noncrossing property has been exploited previously for building shortest-path data structures on the grid (e.g., [39] ) and more generally planar graphs (e.g., [19, 31] ). These previous approaches do not consider the cost of writing to large-memory, and they build data structures that would be too large for our use. Our algorithm leverages the available small-memory to compute bands of nearby paths simultaneously. We capture both the noncrossing and band ideas through what we call a path sketch, which we define as follows. The path sketch allows us to cheaply recompute the shortest paths to nodes.
We call every M th row in the square a superrow, meaning there are h superrows in the square. The algorithm partitions the ith superrow into segments i, , r of consecutive elements u iM , , u iM , +1 , . . . , u iM ,r . The main restriction on segments is that r < + M , i.e., each segment consists of at most M consecutive elements in the superrow. Note that the segment boundaries are determined by the algorithm and are input dependent.
A path sketch is a sequence of segments s, s , r s , s + 1, s+1 , r s+1 , s + 2, s+2 , r s+2 , . . . , i, i , r i , summarizing the shortest paths to the segment. Specifically, this sketch means that for each vertex in the last segment, there is a shortest path to that vertex that goes through a vertex in each of the segments in the sketch. If the sketch starts at superrow 1, then the path originates from a node above the 1st superrow (i.e., the top boundary or the topmost M nodes of the left boundary). If the sketch starts with superrow s > 1, then the path originates at one of the M nodes on the left boundary between superrows s − 1 and s. Since paths cannot go left, the path sketch also satisfies s ≤ s+1 ≤ · · · ≤ i . Evaluating a path sketch. Given a path sketch, we refer to the process of determining the shortest-path distances to all nodes in the final segment i, i , r i as evaluating the path sketch or evaluating the segment, with the distances in smallmemory when the process completes. Note that we have not yet described how to build the path sketch, as the building process uses evaluation as a subroutine.
Evaluating a sketch is not expensive. We give more algorithmic details and proof in Appendix B, but the idea is captured by Figure 1 for the example sketch 1, 4, 6 , 2, 6, 6 , 3, 8, 9 . The sketch tells us that shortest paths to u 9,8 and u 9,9 pass through one of u 3,4 , u 3,5 , u 3,6 and the node u 6, 6 . Thus, to compute the distances to u 9,8 and u 9,9 , we need only consider paths through the darker nodes and solid edges -the lighter nodes and dashed edges are not recomputed during evaluation. The necessary computation between each superrow can be done in small-memory by sweeping a short column through horizontally. The following lemma states the overall cost.
Lemma 4. Given a path sketch s, s , r s , . . . , i, i , r i in an hM × kM grid with distances to all input nodes already computed, our algorithm correctly computes the shortest-path distances to all nodes in the segment i, i , r i . Assuming k ≥ h and small-memory size M ≥ 5M + Θ(1), the algorithm requires O(kM 2 ) operations in small-memory, O(kM ) reads, and 0 writes.
Building the path sketch. The main algorithm on each rectangle involves building the set of sketches to segments in the bottom superrow. At some point during the sketch-building process, the distances to each output node is computed, at which point it can be written out. The main idea of the algorithm is a sketch-extension subroutine: given segments in the ith superrow and their sketches, extend the sketches to produce segments in the (i + 1)th superrow along with their sketches. More details are provided in Appendix B, e.g., deciding where to partition the segments in the next superrow. Suffice it to say that the main work of the extension is just a horizontal sweep of a short column between superrows, with each segment in superrow i evaluated as needed. We achieve the following performance lemma:
Lemma 5. Suppose h ≤ k and small-memory M ≥ 11M + Θ(1), and consider an hM × kM grid with distances to input nodes already computed. Then the sketch building algorithm correctly computes the distances to all output boundary nodes using O((hk) 2 M 2 ) operations in small-memory, O((hk) 2 M ) reads from large-memory, and O(h 2 k + X) writes to large-memory, where X = O(kM ) is the number of boundary nodes written out.
Setting h = k = 1 gives the standard M × M tiling with O(M 2 ) time and O(ωM ) ARAM cost per square. As the size of squares increase, the fraction of output nodes and hence writes decreases, at the cost of more overhead for operations in small-memory and reads from large-memory. Assuming both n and m are large enough to do so, choosing h = k = k T or h = k = k Q gives Theorem 4. (See Appendix B for proof.)
Single-Source Shortest Path
The single-source shortest path (SSSP) problem takes a directed weighted graph G = (V, E) and a source vertex s ∈ V , and outputs the shortest distances d(s, v) from s to every other vertex in v ∈ V . For graphs with non-negative edge weights, the most efficient algorithm is Dijkstra's algorithm [15] .
In this section we will study (variants of) Dijkstra's algorithm in the asymmetric setting. We describe and analyze three versions (two classics and one new variant) of Dijkstra's algorithm, and the best version can be chosen based on the values of M , ω, the number of vertices n = |V |, and the number of edges m = |E|.
Theorem 5. The SSSP problem on a graph G = (V, E) with non-negative edge weights can be solved with Q(n, m) = O min n ω + m M , ω(m + n log n), m(ω + log n) and T (n, m) = O(Q(n, m)+n log n), both in expectation on the (M, ω)-ARAM.
We start with the classical Dijkstra's algorithm [15] . In this algorithm, δ(v), a tentative upper bound on the distance, is kept for each vertex v in the graph, and initialized as +∞ (0 for δ(s)). The algorithm consists of n − 1 iterations, and the final distances from the source s are stored in δ(·). In each iteration, the algorithm selects the unvisited vertex u with smallest finite δ(u), marks it as visited, and uses its outgoing edges to relax (update) all its neighbors' distances. A priority queue is required to efficiently select the unvisited vertex with minimum distance. Using a Fibonacci heap [21] , the time of the algorithm is O(m + n log n) in the standard (symmetric) RAM model. In the (M, ω)-ARAM, the costs are T = Q = O(ω(m + n log n)) since the Fibonacci heap requires proportionally as many writes as reads. Alternatively, using a binary search tree for the priority queue reduces the number of writes (see Section 4) at the cost of increasing the number of reads, giving the costs Q = T = O(m log n + ωm). These bounds are better when m = o(ωn). Both these variants store the priority queue in large-memory, requiring at least one write to large-memory per edge.
We now describe an algorithm, which we refer to as phased Dijkstra, that fully maintains the priority queue in small-memory and only requires O(n) writes to large-memory. The idea is to partition the computation into phases such that for a parameter M each phase needs a priority queue of size at most 2M and visits at least M vertices. By selecting M = M/c for a small constant c, the priority queue fits in small-memory, and the only writes to large-memory are the final distances.
Each phase starts and ends with an empty priority queue P and consists of two parts. A Fibonacci heap is used for P , but is kept small by discarding the M largest elements (vertex distances) whenever |P | = 2M . To do this P is flattened into an array, the M -th smallest element d max is found by selection, and the Fibonacci heap is reconstructed from the elements no greater than d max , all taking linear time. All further insertions in a given phase are not added to P if they have a value greater than d max . The first part of each phase loops over all edges in the graph and relaxes any that go from a visited to an unvisited vertex (possibly inserting or decreasing a key in P ). The second part then runs standard Dijkstra's algorithm repeatedly visiting the vertex with minimum distance and relaxing its neighbors until P is empty. To implement relax, the algorithm needs to know whether a vertex is already in P , and if so where in P so that it can do a decrease-key on it. It is too costly to store this information with the vertex in large-memory, but it can be stored in small-memory using a hash table.
The correctness of this phased Dijkstra's follows from the fact that it only ever visits the closest unvisited vertex, as with standard Dijkstra's.
Lemma 6. Phased Dijkstra's has Q(n, m) = O n m M + ω and T (n, m) = O(Q(n, m) + n log n) both in expectation (for M ≤ n).
Proof. During a phase either the size of P will grow to 2M (and hence delete some entries) or it will finish the algorithm. If P grows to 2M then at least M vertices are visited during the phase since that many need to be deleted with delete-min to empty P . Therefore the number of phases is at most n/M . Visiting all edges in the first part of each phase involves at most m insertions and decrease-keys into P , each taking O(1) amortized time in small-memory, and O(1) time to read the edge from large-memory. Since compacting Q when it overflows takes linear time, its cost can be amortized against the insertions that caused the overflow. The cost across all phases for the first part is therefore Q = W = O(m n/M ). For the second part, every vertex is visited once and every edge relaxed at most once across all phases. Visiting a vertex requires a delete-min in small-memory and a write to large-memory, while relaxing an edge requires an insert or decrease-key in small-memory, and O(1) reads from large-memory. We therefore have for this second part (across all phases) that Q = O(ωn + m) and W = O(n(ω + log n) + m). The operations on P each include an expected O(1) cost for the hash table operations. Together this gives our bounds.
Comparing to the first two Dijkstra's algorithms with Q = T = O(ωm + min(ωn log n, m log n)), the new algorithm is strictly better when ωM > n. More specifically, the new algorithm performs better when nm/M < max{ωm, min(ωn log n, m log n)}. Combining these three algorithms proves Theorem 5, when the best one is chosen based on the parameters M , ω, n, and m.
Minimum Spanning Tree (MST)
In this section we discuss several commonly-used algorithms for computing a minimum spanning tree (MST) on a weighted graph G = (V, E) with n = |V | vertices and m = |E| edges. Some of them are optimal in terms of the number of writes (O(n)). Although loading a graph into large-memory requires O(m) writes, the algorithms are still useful on applications that compute multiple MSTs based on one input graph. For example, it can be useful for computing MSTs on subgraphs, such as road maps, or when edge weights are time-varying functions and hence the graph maintains its structure but the MST varies over time. Prim's algorithm. All three versions of Dijkstra's algorithm discussed in Section 6 can be adapted to implement Prim's algorithm [13] . Thus, the upper bounds of ARAM cost and time in Theorem 5 also hold for minimum spanning trees. Kruskal's algorithm. The initial sorting phase requires Q = T = O(m log n + ωm) [7] . The second phase constructs a MST using union-find without path compression in O(m log n) time, and performs O(n) writes (the actual edges of the MST). Thus, the complexity is dominated by the first phase. Neither ARAM cost nor time match our variant of Borůvka's algorithm. Borůvka's algorithm. Borůvka's algorithm consists of at most log n rounds. Initially all vertices belong in their own component, and in each round, the lightest edges that connect each component to another component are added to the edge set of the MST, and components are merged using these edges. This merging can be done using, for example, depth first search among the components and hence takes time proportional to the number of remaining components. However, since edges are between original vertices the algorithm is required to maintain a mapping from vertices to the component they belong to. Shortcutting all vertices on each round to point directly to their component requires O(n) writes per round and hence up to O(n log n) total writes across the rounds. This is not a bottleneck in the standard RAM model but is in the asymmetric case.
We now describe a variant of Borůvka's algorithm which is asymptotically optimal in the number of writes. It requires only O(1) small-memory. The algorithm proceeds in two phases. For the first log log n rounds, the algorithm performs no shortcuts (beyond the merging of components). Thus it will leave chains of length up to log log n that need to be followed to map each vertex to the component it belongs to. Since there are at most O(m log log n) queries during the first log log n rounds and each only require reads, the total time for identifying the minimum edges between components in the first phase is O(m(log log n) 2 ). After the first phase all vertices are shortcut to point to their component. We refer to these components as the phase-one components. In the second phase, on every round, we shortcut the phase-one components to point directly to the component they belong to. Since there can only be at most n/ log n phase-one components, and at most log n − log log n rounds in phase-two, the total number of reads and writes for these updates is O(n). During phase two the mapping from a vertex to its component takes two steps: one to find its phase-one component and another to get to the current component. Therefore the total time for identifying the minimum edges between components in the second phase is O(m log n).
All other work is on the components themselves (i.e. adding the forest of minimum edges and performing DFS to merge components). The number of reads, writes, and other instructions is proportional to the number of components. There are n components on the first round and the number decreases by at least a factor of two on each following round. Therefore the total time on the components is O(ωn). Summing the costs give the following lemma.
Lemma 7. Our variant of Borůvka's algorithm generates a minimum spanning tree on a graph with n vertices and m edges with ARAM cost and time Q(n, m) = T (n, m) = O(m log n + ωn) on the (M, ω)-ARAM.
Theorem 6. A minimum spanning tree on a graph G = (V, E) can be computed with ARAM cost Q(n, m) = O m min n M , log n + ωn and time
The theorem is a combination of the bounds of Prim's and Borůvka's algorithms (the n/M term is in expectation).
[45] Byung-Do Yang, Jae-Eun Lee, Jang-Su Kim, Junghyun Cho, Seung-Yun Lee, and Byoung-Gon Yu. A low power phase-change random access memory using a data-comparison write scheme. In ISCAS, 2007.
[46] Yole Developpement. Emerging non-volatile memory technologies, 2013.
[47] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durable and energy efficient main memory using phase change memory technology. In ISCA, 2009.
[48] O. Zilberberg, S. Weiss, and S. Toledo. Phase-change memory: An architectural perspective. ACM Computing Surveys, 45(3), 2013.
A Motivation from [7] Further motivation for the asymmetry between reads and write costs in emerging memory technologies was provided in [7] . As a convenience to the reviewer, in this appendix we repeat a suitable excerpt from that paper.
"While DRAM stores data in capacitors that typically require refreshing every few milliseconds, and hence must be continuously powered, emerging NVM technologies store data as "states" of the given material that require no external power to retain. Energy is required only to read the cell or change its value (i.e., its state). While there is no significant cost difference between reading and writing DRAM (each DRAM read of a location not currently buffered requires a write of the DRAM row being evicted, and hence is also a write), emerging NVMs such as Phase-Change Memory (PCM), Spin-Torque Transfer Magnetic RAM (STT-RAM), and Memristor-based Resistive RAM (ReRAM) each incur significantly higher cost for writing than reading. This large gap seems fundamental to the technologies themselves: to change the physical state of a material requires relatively significant energy for a sufficient duration, whereas reading the current state can be done quickly and, to ensure the state is left unchanged, with low energy. An STT-RAM cell, for example, can be read in 0.14 ns but uses a 10 ns writing pulse duration, using roughly 10 −15 joules to read versus 10 −12 joules to write [17] (these are the raw numbers at the materials level). A Memristor ReRAM cell uses a 100 ns write pulse duration, and an 8MB Memrister ReRAM chip is projected to have reads with 1.7 ns latency and 0.2 nJ energy versus writes with 200 ns latency and 25 nJ energy [44] -over two orders of magnitude differences in latency and energy. PCM is the most mature of the three technologies, and early generations are already available as I/O devices. A recent paper [30] reported 6.7 µs latency for a 4KB read and 128 µs latency for a 4KB write. Another reported that the sector I/O latency and bandwidth for random 512B writes was a factor of 15 worse than for reads [28] . As a future memory/cache replacement, a 512Mb PCM memory chip is projected to have 16 ns byte reads versus 416 ns byte writes, and writes to a 16MB PCM L3 cache are projected to be up to 40 times slower and use 17 times more energy than reads [16] . While these numbers are speculative and subject to change as the new technologies emerge over time, there seems to be sufficient evidence that writes will be considerably more costly than reads in these NVMs."
Note that, unlike SSDs and earlier versions of phase-change memory products, these emerging memory products will sit on the processor's memory bus and be accessed at byte granularity via loads and stores (like DRAM). Thus, the time and energy for reading can be roughly on par with DRAM, and depends primarily on the properties of the technology itself relative to DRAM.
B Longest Common Subsequence: Further Results
Evaluating a path sketch. We now describe an algorithm for evaluating the path sketch. The algorithm first computes the shortest-path distances to the first segment in the sketch. To do so, horizontally sweep a height M + 1 column across the (M + 1) × kM slab raising above the sth superrow, keeping two columns in small-memory at a time. Also keep the newly computed distances to the first segment in small-memory, and stop the sweep at the right edge of the segment. Given the distances to one segment in small-memory, we can compute the values for the next segment in a similar manner by sweeping a column through the next slab, starting from the left of the previous segment. This algorithm yields the following performance.
Lemma 4 Given a path sketch s, s , r s , . . . , i, i , r i in an hM × kM grid with distances to all input nodes already computed, our algorithm correctly computes the shortest-path distances to all nodes in the segment i, i , r i . Assuming k ≥ h and small-memory size M ≥ 5M + Θ(1), the algorithm requires O(kM 2 ) operations in small-memory, O(kM ) reads, and 0 writes.
Proof. Correctness follows from the definition of the path sketch -the sweep performed by the algorithm considers all possible paths that pass through these segments.
The algorithm requires space in small-memory to store two columns in the current slab, the previous segment in the sketch, and the next segment in the sketch, and the two segment boundaries themselves, totaling 4M + Θ(1) small-memory. An additional M small-memory suffices to store M consecutive characters of the "vertical" input string, allowing us to reduce the number of reads as well. Due to the monotonically increasing left endpoints of each segment, the horizontal sweep repeats at most M columns for each superrow, giving a total of (k(M ) 2 + h(M ) 2 ) = O(k(M ) 2 ) nodes in total. The "vertical" input string is read once for O(kM ) reads. The "horizontal" input string is read one character per sweep step, for O(kM ) reads. An additional k reads suffice to read the sketch itself, which is a lower-order term.
Building the path sketch. Our algorithm builds up an ordered list of consecutive path sketches, one superrow at a time. The first superrow is partitioned into k segments, each containing exactly M consecutive nodes. The list of sketches is initialized to these segments.
Given a list of sketches to the ith superrow, our algorithm extends the list of sketches to the (i + 1)th superrow as follows. The algorithm sweeps a height M + 1 column across the (M + 1) × kM slab between these superrows (inclusive). The sweep begins at the left end of the slab, reading the input values from the left boundary, and continuing across the entire width of the slab. In small-memory, we evaluate the first segment of the ith superrow (using the algorithm from Lemma 4). Whenever the sweep crosses a segment boundary in the ith superrow, again evaluate the next segment in the ith superrow. For each node in the slab, the sweep calculates both the shortest-path distance and a pointer to the segment in the previous superrow from whence the shortest path originates (or a null pointer if it originates from the left boundary). When the originating segment of the bottom node (the node in the (i + 1)th superrow) changes, the algorithm creates a new segment for the (i + 1)th superrow and appends it to the sketch of the originating segment. If the segment in the current segment in the (i + 1)th superrow grows past M elements, a new segment is created instead and the current path sketch is copied and spliced into the list of sketches. 3 Any sketch that is not extended through this process is no longer relevant and may be spliced out of the list of sketches. When the sweep reaches a node on the output boundary (right edge or bottom edge of the square), the distance to that node is written out.
Lemma 8. The sketching algorithm partitions the ith superrow into at most ik segments.
Proof. The proof is by induction over superrows. As a base case, the first superrow consists of exactly k segments. For the inductive step, there are two cases in which a new segment is started in the (i + 1)th superrow. The first case is that the originating segment changes, which can occur at most ik times by inductive assumption. The second case is that the current segment grows too large, which can occur at most k times. We thus have at most (i + 1)k segments in the (i + 1)th superrow.
Lemma 5 Suppose h ≤ k and small-memory M ≥ 11M + Θ(1), and consider an hM × kM grid with distances to input nodes already computed. Then the sketch building algorithm correctly computes the distances to all output boundary nodes using O((hk) 2 M 2 ) operations in small-memory, O((hk) 2 M ) reads from large-memory, and O(h 2 k + X) writes to large-memory, where X = O(kM ) is the number of boundary nodes written out.
Proof. Consider the cost of computing each slab, ignoring the writes to the output nodes. We reserve 5M + Θ(1) small-memory for the process of evaluating segments in the previous superrow. To perform the sweep in the current slab, we reserve M small-memory to store one segment in the previous row, M small-memory to store characters in the "vertical" input string, 4(M + 1) small-memory to store two columns (each with distances and pointers) for the sweep, and an additional Θ(1) small-memory to keep, e.g., the current segment boundaries. Since there are at most hk segments in the previous superrow (Lemma 8, the algorithm evaluates at most hk segments; applying Lemma 4, the cost is O(hk 2 M 2 ) operations, O(hk 2 M ) reads, and 0 writes. There are an additional O(kM 2 ) operations to sweep through the kM 2 nodes in the slab, plus O(kM ) reads to scan the "horizontal" input string. Finally, there are O(hk) writes to extend existing sketches and O(hk) writes to copy at most k sketches.
Summing across all h slabs and accounting for the output nodes, we get O((hk) 2 M 2 +hkM 2 ) operations, O((hk) 2 M + hkM ) reads, and O(h 2 k + X) writes. Removing the lower-order terms gives the lemma.
Combining across all rectangles in the grid, we get the following corollary. Proof. There are Θ(mn/(hkM 2 )) size-(hM/11) × (kM/11) subgrids. Multiplying by the cost of each grid (Lemma 5) with X = Θ(kM ), gives the theorem.
By tuning the values of h and k to reduce the write cost as far as possible without increasing the overall ARAM cost or time, we are achieve the following main theorem. We assume here that both strings are large enough that the grid can be subdivided into large kM × kM squares, i.e., we can choose the best parameter k. We discuss later how to generalize when the grid is too small and hence h and k are restricted.
Then it is possible to compute the length of the LCS or edit distance with total time T (m, n) = O(mn + mnω/(M k T )).
Let k Q = min{ω 1/3 , √ M } and suppose m, n = Ω(k Q M ). Then it is possible to compute the length of the LCS or edit distance with an ARAM cost of Q(m, n) = O(mnω/(k Q M )).
Proof. As long as h ≤ √ M , the number of writes reduces to O(mn/(hM )). (Increasing h further causes the number of writes to increase.) Consider the time bound first. If k T ≤ 1, then just use algorithm with h = k = 1. Otherwise, let M = M/11 and use the algorithm with h = k = k T . As long as h = k ≤ (ω/M ) 1/3 , which is true for k T , the time of operations is less than the time of writes, giving the bound.
For the ARAM cost, use our algorithm with h = k = k Q . As long as h = k ≤ ω 1/3 , then cost of reads is less than the cost of writes.
Improving the bound for smaller string lengths. If m ≤ M , then the standard I/O algorithm becomes even better -simply sweep a column through, which remains in small-memory, using O(m + n) reads, no writes, and O(mn) time. Since there are no writes, we cannot beat that bound. As described already, if m ≥ kM then our algorithm partitions the grid into kM × kM squares, which for larger k saves writes by sacrificing reads and re-computation. The remaining question is what happens when m is larger than small-memory but not too much larger, i.e., M < m < k T M or M < m < k Q M .
When m falls in this range, we apply the algorithm to m × kM rectangles, i.e., setting h = m/M . It turns out we can achieve a better bound than Theorem 4 by increasing k even further. The key observation here is that the bottom of the rectangle no longer needs to be written out because there is no rectangle below it -only the right edge is an output edge. The number of writes per rectangle (Lemma 5 with X = hM ) reduces to O(h 2 k + hM ). We thus have the following modified version of Corollary 1 The following theorem provides the improved time and ARAM cost in the case that one string is short but the other is long. To understand the bounds here, consider the maximum and minimum values of h for h = m/M and large ω. If h = (ω/M ) 1/3 , i.e., m is large enough that we can divide into k T M × k T M squares, then we get k T = (ω/M ) 1/3 matching the bound in Theorem 4. As h decreases, the bound improves. In the limit, h = Θ(1) (or m = Θ(M )), we get k T = ω/M which is better. Theorem 7. Let h = Θ(m/M ) and suppose that h ≤ k T specified in Theorem 4. Let k T = min{ ω/(hM ), M/h} and suppose that n = Ω(k T M ). Then it is possible to compute the length of the LCS or edit distance with total time of T (m, n) = O(mn + mnω/(k T M )).
Let h = Θ(m/M ) and suppose that h ≤ k Q specified in Theorem 4. Let k Q = min{ ω/h, M/h} and suppose n = Ω(k Q M ). Then it is possible to compute the length of the LCS or edit distance with an ARAM cost of Q(m, n) = O(mnω/(k Q M )).
Proof. With the restrictions on h, we have h ≤ k, so Corollary 2 is applicable. As in proof of Theorem 4, the second term of the min has the effect that O(mnh/M 2 + mn/(kM )) = O(mn/(kM )). The rest of the bound follows by choice of k to makes the cost of writes dominate.
When n is also small, the bound improves further. In this case, the algorithm consists of building the sketch on a single m × n grid, so no boundary nodes are output -the only writes that need be performed are the sketch itself. Proof. The bound follows directly from Lemma 5 with X = 0 and substituting one hk term in the read/time bounds.
Recovering the shortest path. The standard approach for outputting the shortest path is to trace backwards through the grid from the bottom-rightmost node. This approach assumes that the distances to all internal nodes are known, but unfortunately our algorithm discards distances to interior nodes.
Fortunately, the sketch provides enough information to cheaply traceback a path through each square without any additional writes (except the path itself) and without asymptotically more reads or time. In particular, for any node v in superrow i + 1, it is not hard to identify a node u in superrow i such that a shortest path to v passes through u. Consider the sketch s, s , r s , . . . , i, i , r i , i + 1, i+1 , r i+1 to the segment i + 1, i+1 , r i+1 that includes v. The vertex u is one of the vertices in the penultimate segment i, i , r i of the sketch, so the goal is to identify which one. To do so, evaluate the sketch to the segment i, i , r i . Then perform a horizontal sweep through the final slab, keeping track of the originating vertex from the penultimate segment. Now suppose we have these vertices u and v that fall along the shortest path and are in consecutive superrows. We also need to identify the path through the slab between u and v. To do that, we apply Hirschberg's [24] recursive low-space algorithm for path recovery in the ED/LCS grid, splitting the horizontal dimension in half on each recursion. Note that the time reduces by a constant fraction in each recursion, but the ARAM cost does not (the only ARAM cost here is from reading the "horizontal" input string), so it may not be immediately obvious that the ARAM cost is cheap enough. Fortunately, after recursing at most lg k times, the length of the horizontal substring is at most M and the remaining path-recovery subproblem can be done with no further reads from large-memory.
Putting it all together, tracing a path to the previous superrow requires one sketch evaluation, followed by time that is linear in the area and an ARAM cost that corresponds to reading the horizontal string from large-memory lg k times. Rounding up loosely, we get time of O(k(M ) 2 + (M )(kM )) = O(kM 2 ) along with O(kM + (kM ) lg k) = O(k 2 M ) reads. Multiplying by the h superrows, we have O(hkM 2 ) time and O(hk 2 M ) reads from large-memory. Both of these are less than the cost of building the sketch in the first place (Lemma 5).
C Write-Efficient Floyd-Warshall Algorithm
The Floyd-Warshall algorithm solves the all-pairs shortest path problem on weighted graphs [20] . It requires O(n 3 ) writes in its original and currently recognized form and hence O(ωn 3 ) time in (M, ω)-ARAM. Here we describe how to reorganize the computation so that it only requires T = O(n 3 + ωn 2 ) time. Compared to the version described in Section 4, this algorithm is easier to program, and has smaller constant coefficient.
Consider a graph G with vertices V numbered 1 through n and a function ShortestPath(i, j, k) that returns the shortest possible path from i to j using vertices only from the set {1, 2, · · · , k} as intermediate points along the path. ShortestPath(i, j, k) can be achieved using the Floyd-Warshall algorithm with the following update rule: ShortestPath(i, j, k) = min{ShortestPath(i, j, k − 1), ShortestPath(i, k, k − 1) + ShortestPath(k, j, k − 1)} Figure 2 : The dependence of the Floyd-Warshall algorithm on a graph with 3 vertices in the second round. Solid nodes correspond to ShortestPath(·, ·, 1) and dash nodes for ShortestPath(·, ·, 2). The black and red arrows indicate the data dependencies of the first and second term in the function respectively, and red arrows form an outer product of the second column and the second row, which is used to update the dash nodes.
Given a graph with 3 vertices, Figure 2 illustrates the second in computing this function (k = 2), and the dependency of the computational DAG. The pseudocode of the Floyd-Warshall algorithm is provided in Algorithm 1. Now we explain how to reduce the number of writes from O(n 3 ) to O(n 2 ). Notice that the third dimension in the function does not exist explicitly in the pseudocode. This is because the path between i and j in the k-th round are still available in the (k + 1)-th round (black edges in Figure 2) . Hence, the left computations in each round, shown in red edges in Figure 2 , can be viewed as an outer product of the k-th column and k-th row that is computed and used to update the whole matrix. In Figure 2 , k = 2 since we are working on the second round.
The main observation used to reduce writes is that the value of d i,j is only used to compute other distances in the i-th and j-th rounds. Therefore, we reorder the computations as to postpone the updates as much as possible. To be more precise, in the k-th iteration, we only compute and update the elements in the k-th column and k-th row in the matrix. We do a batch of O(n) postponed updates for each of these elements, and in total, each round requires O(n 2 ) reads and instructions, but only 2n writes. Then after the last round, we update all distances to their final values using n 2 writes. The number of instructions of the new ordering is
