This paper initiates the study of I/O algorithms (minimizing cache misses) from the perspective of fine-grained complexity (conditional polynomial lower bounds). Specifically, we aim to answer why sparse graph problems are so hard, and why the Longest Common Subsequence problem gets a savings of a factor of the size of cache times the length of a cache line, but no more. We take the reductions and techniques from complexity and fine-grained complexity and apply them to the I/O model to generate new (conditional) lower bounds as well as faster algorithms. We also prove the existence of a time hierarchy for the I/O model, which motivates the fine-grained reductions.
Introduction
The I/O model (or external-memory model) was introduced by Aggarwal and Vitter [AV88] to model the non-uniform access times of memory in modern processors. The model nicely captures the fact that, in many practical scenarios, cache misses between levels of the memory hierarchy (including disk) are the bottleneck for the program. As a result, the I/O model has become a popular model for developing cache-efficient algorithms.
In the I/O model, the expensive operation is bringing a cache line of B contiguous words from the "main memory" (which may alternately represent disk) to the "cache" (local work space). The cache can store up to M words in total, or M/B cache lines. Computation on data in the cache is usually treated as free, and thus the main goal of I/O algorithms is to access memory with locality. That is, when bringing data into cache from main memory in contiguous chunks, we would like to take full advantage of the fetched cache line. This is preferable to, say, randomly accessing noncontiguous words in memory.
When taking good graph algorithms for the RAM model and analyzing them in the I/O model, the running times are often very bad. Take for example Dijkstra's algorithm or the standard BFS algorithm. These algorithms fundamentally look at an adjacency list, and follow pointers to every adjacent node. Once the new node is reached, the process repeats, accessing all the adjacent nodes in priority order that have not previously been visited. This behavior looks almost like random access! Unless one can efficiently predict the order these nodes will be reached, the nodes will likely be stored far apart in memory. Even worse, this optimal order could be very different depending on what node one starts the algorithm at.
Because of this bad behavior, I/O-efficient algorithms for graph problems take a different approach. For dense graphs, one approach is to reduce the problems to matrix equivalent versions. For example, APSP is solved by (min, +) matrix multiplication [JWK81, PS14b, Sei92] . The locality of matrices leads to efficient algorithms for these problems.
Unfortunately, sparse graph problems are not solved efficiently by (min, +) matrix multiplication. For example, the best algorithms for directed single-source shortest paths in sparse graphs take O(n) time, giving no improvement from the cache line at all [Bro04, CR04, ABD + 07]. Even in the undirected case, the best algorithm takes O(n/ √ B) time in sparse graphs [MZ03] . The Diameter problem in particular has resisted improvement beyond O(|V |/ √ B) even in undirected unweighted graphs [MM02] , and in directed graphs, the best known algorithms still run in time Ω(|V |) [CGG + 95, ABD + 07]. For this reason, we use this as a conjecture and build a network of reductions around sparse diameter and other sparse graph problems.
In this paper we seek to explain why these problems, and other problems in the I/O model, are so hard to improve and to get faster algorithms for some of them.
In this paper we use reductions to generate new algorithms, new lower bounds, and a time hierarchy in the I/O model. Specifically, we get new algorithms for computing the diameter and radius in sparse graphs, when the computed radii are small. We generate novel reductions (which work in both the RAM and I/O models) for the Wiener Index problem (a graph centrality measure). We generate further novel reductions which are meaningful in the I/O model related to sparse graph problems. Finally, we show that an I/O time hierarchy exists, similar to the classic Time Hierarchy Theorem.
Caching Model and Related Work. Cache behavior has been studied extensively. In 1988, Aggarwal and Vitter [AV88] developed the I/O model, also known as the external-memory model [Dem02] , which now serves as a theoretical formalization for modern caching models. A significant amount of work on algorithms and data structures in this model has occurred including items like buffer-trees [Arg03] , B-trees [BM70] , permutations and sorting [AV88] , ordered file maintenance [BCD + 02], (min, +) matrix multiplication [JWK81, PS14b] , and triangle listing [PS14a] . Frigo, Leiserson, Prokop and Ramachandran [FLPR99] proposed the cache-oblivious model. In this model, the algorithm is not given access to the cache size M , nor is it given access to the cache-line size B. Thus the algorithm must be oblivious to the cache, despite being judged on its cache performance. Some surveys of the work include [Vit01, Arg97, Dem02] .
When requesting cache lines from main memory in this paper we will only request the B words starting at location xB for integers x. Another common model, which we do not follow in this paper, allows arbitrary offsets for the cache line pulls. This can be simulated with at most twice as many cache misses and twice as much cache.
Fine-grained Complexity. A popular area of recent study is fine-grained complexity. The field uses efficient reductions to uncover relationships between problems whose classical algorithms have not been improved substantially in decades. Significant progress has been made in explaining the lack of progress on many important problems [BI15, WW10, AGW15, AW14, Bri14, ABW15a, RW13] such as APSP, othogonal vectors (OV), 3-SUM, longest common subsequence (LCS), edit distance and more. Such results focused on finding reductions from these problems to other (perhaps less well-studied) problems such that an improvement in the upper bound on any of these problems will lead to an improvement in the running time of algorithms for these problems. For example, research around the All-Pairs Shortest Paths problem (APSP) has uncovered that many natural, seemingly simpler graph problems on n node graphs are fine-grained equivalent to APSP, so that an O(n 3−ε ) time algorithm for ε > 0 for one of the problems implies an O(n 3−ε ) time algorithm for some ε > 0 for all of them.
History of Upper Bounds
In the I/O model, the design of algorithms for graph problems is difficult. This is demonstrated by the number of algorithms designed for problems like Sparse All Pairs Shortest Paths, Breath First Search, Graph Radius and Graph Diameter where very minor improvements are made (see Table 1 for definitions of problems). Note that dense and sparse qualifiers in the front of problems indicate the problem defined over a dense/sparse graph, respectively.
The Wiener index problem measures the total distance from all points to each other. Intuitively, this measures how close or far points in the graph are from each other. In this respect, Wiener index is similar to the radius, diameter and median measures of graph distance.
The history of improvements to the upper bound of negative triangle in the I/O model is an important example of the difficulty in the design of I/O efficient algorithms for graph problems (see Table 2 for a summary). For a long time, no improvements in terms of M were made to the upper bound for negative triangle. A key was re-interpreting the problem as a repeated scan of lists.
We feel that the history of negative triangle has taught us that upper bounds in the I/O model on graph problems are best achieved by creating efficient reductions from a graph problem to a non-graph problem. Hence the study of fine-grained reductions in the I/O model is crucial to using this approach in solving such graph problems with better I/O efficiency. Graph problems tempt Problem Name Problem Definition Orthogonal Given two sets U and V of n vectors each with Vector (OV) elements {0, 1} d where d = ω(log n), determine whether there exist vectors u ∈ U and v ∈ V such that
Given two strings of n symbols over some alphabet Σ, compute the length Subsequence (LCS) of the longest sequence that appears as a subsequence in both input strings. Edit Distance (ED) Given two strings s 1 and s 2 , determine the minimum number of operations that converts s 1 to s 2 . Sparse Diameter Given a sparse graph G = (V, Given two n by n matrices A and B,
Given a sparse, weighted graph G = (V, E), determine Diameter max u,v∈V d(u, v) where d(u, v) is the distance between nodes u and v in V . the algorithmic designer into memory access patterns that look like random access, whereas matrix and array problems immediately suggest memory local approaches to these problems. We consider the history of the negative triangle problem to be an instructive parable of why the matrix and array variants are the right way to view I/O problems.
Our Results
We will now discuss our results in this paper. In all tables in this paper, our results will be in bold.
We demonstrate the value of reductions as a tool to further progress in the I/O model through our results.
Our results include improved upper bounds, a new technique for lower bounds in the I/O model and the proof of a computational hierarchy. Notably, in this paper we tie the I/O model into both fine-grained complexity and classical complexity.
Upper Bounds
We get improved upper bounds on two sparse graph problems and have a clarifying note to the community about matrix multiplication algorithms.
For both the sparse 2 vs 3 diameter and sparse 2 vs 3 radius problems, we improve the running time from O(n 2 /B) to O(n 2 /(M B)). We get these results by using an insight from a pre-existing reduction to two very local problems which have trivial O(n 2 /(M B)) algorithms that solve them. Note that this follows the pattern we note in Section 1.1 in that we produce a reduction from a graph problem to a non-graph problem to obtain better upper bounds in terms of M .
Furthermore, previous work in the I/O model related to matrix multiplication seems to use the naive matrix multiplication n 3 bound, or the Strassen subdivision. However, fast matrix multiplication algorithms which runs in n ω time imply a nice self-reduction. Thus, we can get better I/O algorithms which run in the most recent fast matrix multiplication time. We want to explicitly add a note in the literature that fast matrix multiplication in the I/O model should run in time
where ω is the matrix multiplication exponent, if it is derived using techniques bounding the rank of the matrix multiplication tensor. The current best ω is ω < 2.373 [Vas12, Gal14] giving us the I/O running time of T M M (n, M, B) = O(n 2.373 /(M 0.187 B)). We give these results in Section 2.3.
I/O model Conjectures
In the I/O model a common way to get upper bounds is to get a self-reduction where a large problem is solvable by a few copies of a smaller problem. We make the small subproblems so small they fit in cache. If the problem is laid out in a memory local fashion in main memory then it will take M/B I/Os to solve a subproblem that fits in memory M .
In Section 2.2, we give an I/O-based Master Theorem which gives the running time for algorithms with recurrences of the form T (n, M, B) = αT (n/β, M, B) + f (n, M, B) (like the classic Master Theorem from [CLRS09] ) and T (n, M, B) = g 2 T (n/β, M, B) + f (n, M, B) (self-reduction). The running times generated by these recurrences match the best known running times of All-Pairs Shortest Paths (APSP), 3-SUM, Longest Common Subsequence (LCS), Edit Distance, Orthogonal Vectors (OV), and more. Thus, if we conjecture that a recursive algorithm has a running time that is optimal for a problem, we are able to transfer this bound over to the I/O model using our Master Theorem and self-reduction framework in a natural way.
Lower Bounds From Fine-Grained Complexity Assumptions. We demonstrate that many of the reductions in the RAM model between problems of interest and common fine-grained assumptions give lower bounds in the I/O model. We generate reasonable I/O conjectures for these problems and demonstrate that the reductions are I/O-efficient. First, we begin with the conjectures. M 1+o(1) B 1+o(1) I/Os. From these conjectures we can generate many lower bounds. Many of our lower bounds are tight to the fastest known algorithms. These reductions have value even if the conjectures are refuted since many of these reductions also give upper bounds for other problems-leading to better algorithms for many problems even if the conjectures are refuted.
Lower Bounds from Sparse Graph Problems
In addition to the upper, lower bounds, and reductions presented in the I/O model for the standard RAM problems listed in Table 3 , we introduce novel upper, lower bounds, and reductions between graph problems. The reason for this focus is the fact that, more than in the RAM model, the I/O model has a history of particularly slow algorithms in graphs. In particular, sparse graph problems have very slow algorithms. We make novel reductions between sparse graph problems, many of which apply to the RAM model as well, such that solving one of these problems will solve many other variations of hard sparse graph problems in the I/O model.
Problem
Upper Bound UB source Lower Bound LB from LB source We provide reductions between problems that currently require Ω(n/ √ B) time to solve. Thus, these problems specifically require linear time reductions. We show equivalence between the following set of problems for undirected/directed and unweighted graphs: (s, t)-shortest path, finding the girth through an edge, and finding the girth through a vertex.
We additionally generate a new reduction from sparse weighted Diameter to the sparse Wiener Index problem in Section 3.1. This reduction holds in the RAM model as well as the I/O model.
Hierarchy
The time and space hierarchy theorems are fundamental results in computational complexity that tell us there are problems which can be solved on a deterministic Turing Machine with some bounded time or space, which cannot be solved on a deterministic Turing Machine which has access to less time or space. See, notably, the famous time and space hierarchies [Sip06] . For some classes, for example BPP, no time hierarchy is known to exist (e.g., [Wil, Bar02] ).
In Section 5.3, we show similar separation hierarchies exist in the I/O model once again using the simulations between the RAM and I/O models and our complexity class CACHE M,B (t (n)) defined in Section 5.2 as the set of problems solvable in O (t (n)) cache misses. Theorem 1.5. If the memory used by the algorithm is referenceable by O(B) words (i.e. the entire input can be brought into cache by bringing in at most O(B) words), then
Notably, this theorem applies any time we use a polynomially size memory and our word size is w = Ω(lg n), which is the standard case in the RAM model. This separation is motivation for looking at complexity of specific problems and trying to understand what computational resources are necessary to solve them.
Improved TM Simulations of RAM Imply Better Algorithms
In Section 5.4, we show that improved simulations of RAM machines by Turing Machines would imply better algorithms in the I/O model. Specifically, if we can simulate RAM more efficiently with either multi-tape Turing machines or multi-dimensional Turing machines, then we can show that we can gain some cache locality and thus save by some factor of B, the cache line size.
Organization
In this paper, we argue that the lens of reductions offer a powerful way to view the I/O model. We show that reductions give novel upper and lower bounds. We also define complexity classes for the I/O model and prove a hierarchy theorem further motivating the analysis of the I/O model using fine-grained complexity.
We begin with faster algorithms obtained through reductions which are collected in Section 2. Section 2.1 develops such algorithms for small diameter and radius. Section 2.2 develops the I/O Master Theorem, which is more broadly a useful tool for analyzing almost all cache-oblivious algorithms. Section 2.3 uses this theorem to show how all recent improvements to matrix multiplication's RAM running time also give efficient cache-oblivious algorithms.
One can get new lower bounds, by using the techniques from fine-grained complexity. Some fine-grained reductions from the RAM model also work in the I/O model, we show examples in 
Algorithms in the I/O Model
In this section, we discuss our improved algorithms, algorithm analysis tools, and how reductions generate algorithms. As is typical in the I/O model, we assume that all inputs are stored in disk and any computation done on the inputs are done in cache (after some or all of the inputs are brought into cache). Section 2.1 gives better algorithms for the 2 vs 3 Diameter problem and the 2 vs 3 Radius problem in the I/O model. Section 2.3 gives improved algorithms for Matrix Multiplication in the I/O model.
Self-reductions are commonly used for cache-oblivious algorithms, because dividing until the subproblems are arbitrarily small allows for the problems to always fit in cache. In the RAM model, self-reductions allow for easy analysis via the Master Theorem. Despite the amount of attention to analyzing self-reductions in the I/O model, no one has written down the I/O-based Master Theorem. In Section 2.2, we describe and prove a version of the Master Theorem for the I/O model. We present a proof of this theorem to simplify our analysis and to help future papers avoid redoing this analysis.
Finally, in Section 2.4 we explain how some reductions in the RAM model imply faster algorithms in the I/O model.
Algorithms for Sparse 2 vs 3 Radius and Diameter
For both the radius and diameter problems on unweighted and undirected graphs, we can show distinguishing between a diameter or radius of 2 and a larger diameter or radius can be solved efficiently. Our algorithm relies on the reinterpretation of the 2 vs 3 problem as a set-disjointness problem. Every node, v, has an associated set, S v , its adjacency list union itself. If two nodes have disjoint sets S v and S u , then they are distance greater than 2 from each other. Our algorithm for 2 vs 3 diameter and radius save an entire factor of M from the previously best known running times. This is a similar idea to the reduction from 2 vs 3 diameter to OV and from 2 vs 3 radius to Hitting Set in the RAM model. These reductions were introduced by Abboud, Vassilevska-Williams and Wang [AWW16] . While these reductions exist in the RAM model, they don't result in faster algorithms for 2 vs 3 diameter and radius in the I/O model because they use a hashing step that results in BFS being run from |E| ∆ nodes for some parameter ∆ that can be set that gives the orthogonal vectors instance a dimension of ∆ 2 . In the I/O model, BFS is quite inefficient: we would need to set ∆ M to get an efficient algorithm using the approach in [AWW16] . But, with a dimension of M 2 the algorithm will run very slowly. Therefore, below we present a solution to the set disjointness problem with no hashing into a smaller dimension.
Below we present the cache-aware algorithm for distinguishing 2 vs 3 diameter in an undirected, . The algorithms and proofs for cache-obliviousness are finicky, but fundamentally are self-reductions of the form T (n) = 4T (n/2) + n/B. We leave the proofs to the Appendix because uneven subdivision and tracking bits are not very illuminating to the overall scope of our paper.
We will start by giving a non-oblivious algorithm which relies on a recursive self-reduction. We will then show how to make this oblivious. It is easier to explain the analysis and algorithm when we can rely on the size of cache, but we can avoid that and get an oblivious algorithm anyway. The previous best algorithm is from Arge, Meyer and Toma which achieves O(|V |sort(|E|)) = O(
. We get an improvement over the previous algorithm in running time whenever ) time. Give each node an extra indicator bit alreadyClose. We split the nodes into those with adjacency lists of length less than or equal to M/4 and those with adjacency lists longer than M/4. We call these the short and long adjacency lists, respectively. Let A S be the ordered set created by concatenating short lists ordered by length from shortest to longest and A L be the ordered set created by concatenating long lists also ordered by length.
Sub-divide A S into subsections of length at least M/4 and less than M/2 in the following way. where i ≤ l 2 < i + 2 then include it in subsection i. A given subsection can have length at most
. Then, create a copy of A S called C S where C S [i] contains the subsection of index i. We create a copy for convenience since we want to maintain the original A S while modifying C S and copying is cheap here. Furthermore, C S is different from A S in the sense that C S is an array of arrays (since it maintains the subsections we created from A S using the procedure above). However, if one wants to be more efficient, one can perform the rest of the algorithm more carefully and can directly modify A S instead of C S . Also note there are at most
contains the subsection of A L with index i (again this copy is an array of arrays). Note there are at most
We would like to check if any of the nodes with long adjacency lists are far from other nodes. Ideally we would just run BFS from each node, but BFS in sparse graphs runs slowly (by a multiplicative factor of √ B) in the I/O model. So, we will instead use a method of scanning through these lists.
First we will check if any two long lists are far from each other. For i ∈ [1, k] we set v i to be the node associated with adjacency list C L [i]. For every j ∈ [1, k] we scan through the adjacency lists
in sorted order progressing simultaneously in both lists to see if the intersection of the sets of nodes (each list also includes v i and v j , respectively) in these two adjacency lists is nonempty. If all are close (intersection non-empty), we move on. If any are far (intersection empty), we return that the diameter is > 2. This takes time i∈ , if all nodes v j represented by adjacency lists in C S have alreadyClose == T rue we move on, but if any of them have alreadyClose == F alse we return that the diameter is > 2. This takes time
M B . Now that we have verified all of the long adjacency lists we need only compare short lists to short lists. We do this by bringing in every pair for i, j ∈ [0, Now that we have this framework we can give a cache-oblivious version of the algorithm. The proofs of cache-oblivious 2 vs 3 Diameter and 2 vs 3 Radius are included in Appendix A.
Master Theorem in the I/O Model
In this section, we formally define our Master Theorem framework for the I/O model and provide bounds on the I/O complexity of problems whose I/O complexity fits the specifications of our framework. In addition, we also describe some example uses of our Master Theorem for the I/O model.
The Master Theorem recurrence in the RAM model looks like T (n) = aT (n/b) + f (n). We will use a similar recurrence but all functions will now be defined over n, M and B. The I/O-Master Theorem function f (n, M, B) includes all costs that are incurred in each layer of the recursive call. This includes the I/O complexity of reading in an input, processing the input, processing the output and writing out the output. In this section, we assume that f (n, M, B) is a monotonically increasing function in terms of n in order to apply our Master Theorem framework. What this means is that for any fixed M and B we want the number of I/Os to increase or stay the same as n increases. Given that f (n, M, B) specifies the I/O complexity of reading in the inputs and writing out the outputs, we prove the following version of the Master Theorem in the I/O model.
Theorem 2.2 (I/O Master Theorem).
If f (n, M, B) contains the cost of reading in the input (for each subproblem) and writing output (after computation of each subproblem), then the following holds. Given a recurrence of T (n, M, B) = αT (n/β, M, B) + f (n, M, B), where α ≥ 1 and β > 1 are constants, and a base case of T (n/x, M, B) = t(x, M, B) (where t(x, M, B) = Ω(1)) for some x ≤ n and some function t(x, M, B).
for some ε 1 , ε 2 > 0, then we get the following cases:
and all sufficiently large n, then
.
, and none of the previous cases are satisfied, then
(note that this includes if A and f are incomparable), with tighter upper bounds provided in our proof (specifically Eqns. 8, 9, and 10) depending on characteristics of the actual function, f (n, M, B).
Proof. First, note that, given our condition on f , we do not have to worry about clever maintaining of previous cache computations since f (n, M, B) includes the cost of reading in input and writing out output.
We first show that the final cost of the recurrence is
where the recursion cost of F (n) is given as below
We consider the recursion tree of the recursion defined in Eq. 2. The root is at level 0. At level j of the tree, there exists α j nodes each of which costs f (n/β j , M, B) I/Os to compute. The leaves of the tree each cost f (x, M, B) = t(x, M, B) time to compute. There exists Θ((
leaves, resulting in a total cost of Θ(( n x ) log β α t(x, M, B)) I/Os and the number of I/Os needed in the remaining nodes of the tree is
. Summing these costs gives the recursion stated in Eq 2. Finally, to obtain our final I/O cost given in Eq. 1, we know that reading in the input incurs a fixed cost of Θ n B I/Os regardless of the efficiency of the rest of the algorithm and the size of cache.
Let g(n, M, B) =
. We now bound g(n, M, B):
log β α t(x, M, B . Since we know that
log β α−ε t(x, M, B)) for some ε > 0 (and all ε 1 ≤ ε), then we know that
. This then yields
where we choose an arbitrarily small ε = ε 1 such that β ε 1 < 1.
Case 2: We prove g(n, M, B) = Θ(f (n, M, B)). For sufficiently large n, we know that α(T (n/β, M, B)) ≤ cf (n, M, B) for some constant c < 1. Let c = c 0 and n be the smallest constants such that this is satsified for all values of
, n] where L B is the smallest value of B that satisfies the algorithm's tall-cache assumption, and n ≥ n . Let j = u be the largest exponent of β j such that n/β u ≥ n . Therefore, we rewrite the equation for g(n, M, B) in this case to be
Trivially (by the case when j = 0), we know that g(n, M, B) = Ω(f (n, M, B)).
. Since we know that
, we then also know that
and thus obtain the following g(n, M, B) in this case
We now prove the first three cases using our bounds on g(n, M, B) above.
Case 4: Given our assumption that f (n, M, B) = Ω(n/B), we first need to show an upper bound on g(n, M, B) (it is trivially Ω(f (n, M, B))).
By using Eq. 3, we can obtain the following upper bound on g(n, M, B) when
Thus, this gives us the final expression for T (n) to be
If α f (β) = 1, then
Finally, if α f (β) < 1, then we obtain the following expression for g(n, M, B):
Trivially, g(n, M, B) = Ω(f (n, M, B)). Therefore, we know that the final expression for T (n) to be
Note that we do not present the proofs for when we need to take log β n x − 1 or log β n x − 1 since the proofs are nearly identical to that presented for the original Master Theorem [CLRS09] .
One-Layer Self-Reductions We state a relationship between one-layer self-reductions and our Master Theorem framework above. We refer to the process of solving a problem by reducing to several problems of smaller size each of which can be solved in cache and one recursive call is necessary as a one-layer self-reduction. Suppose the runtime of an algorithm in the RAM model is n log β α , then by dividing the problems into which is the same result we obtain via our Master Theorem framwork above when t(x, M, B) = M/B.
We now prove formally the theorem related to one-layer self-reductions.
Theorem 2.3. Let P be a problem of size n which can be reduced to g(n/M ) sub-problems, each of which takes T (M, M, B) I/Os to process. The runtime of such a one-layer self reduction for the problem
Proof. The number of I/Os needed to process a subproblem of size M is given by T (M, M, B). If g(n/M ) is the total number of such subproblems of size M that need to be processed, then the total number of I/Os needed to process all subproblems is O(g(n/M )T (M, M, B)). Then, to read in the input of size n requires Ω n B I/Os. Any other I/Os incurred while processing the g(n/M ) would result in a total of f (n, M, B) I/Os.
Faster I/O Matrix Multiplication via I/O Master Theorem
As we mentioned above, any I/O algorithm that has a self-reduction to one of the forms stated in Section 2.2. Using our I/O Master Theorem, we can show a comparable I/O matrix multiplication bound to the matrix multiplication bound based on finding the rank of the Matrix Multiplication Tensor in the RAM model. Recent improvements to matrix multiplication's running time also imply faster cache oblivious algorithms. Recent work has improved the bounds on ω where ω is the constant such that for any 0 < ε < 1 there is an algorithm for n by n matrix multiplication that runs in n ω+ε . The most recent improvements on these bounds have been achieved by bounding the rank of the Matrix Multiplication Tensor [Vas12, Gal14] .
The I/O literature does not seem to have kept pace with these improvements. While previous work discusses the efficiency of naive matrix multiplication and Strassen matrix multiplication, it does not discuss the further improvements that have been generated.
We note in this section that the modern techniques to improve matrix multiplication running time, those of bounding the rank of the Matrix Multiplication Tensor, all imply cache-efficient algorithms.
be the time it takes to do matrix addition on matrices of size n by n. If the matrix multiplication tensor's rank is bounded such that the RAM model running time is n ω +ε for any 0 < ε < 1 then the following self-reduction exists for some constant α,
Self-reductions feed conveniently into cache oblivious algorithms. Notably, when we plug this equation into the I/O Master Theorem from Section 2.2, we obtain the following bound as given in Lemma 2.5. Recursive structures like this tend to result in cache-oblivious algorithms. After all, regardless of the size of cache, the problems will be broken down until they fit in cache. Then, when a problem and the algorithm's execution fit in memory, the time to answer the query is O(M/B), regardless of the size of M and B.
Lemma 2.5. If the matrix multiplication tensor's rank is bounded such that the RAM model running time is n ω +ε for any 0 < ε < 1 and Theorem 2. for any 0 < ε < 1.
Proof. The algorithm that uses the self-reduction implied by bounding the border rank produces the recurrence
by Theorem 2.4. By our bound on
and our base case
f (α) > 1. The algorithm implied by the self-reduction does not change depending on the size of the cache or the cache line. The running time will follow this form regardless of M and B. Thus, this algorithm is cache-oblivious.
RAM Reductions Imply I/O Upper Bounds
Reductions are generally between problems of the same running time in terms of their input size. There are also reductions showing problems that run faster in their running time are harder than problems that run slower. For example if zero triangle is solved in (|E|) 1.5−ε then 3-SUM is solved in O(n 2−ε ) time. An open problem in fine-grained complexity is showing that problem that runs slower is harder than a problem that runs faster in a reductions sense. Some types of reductions in the inverse direction would imply a faster I/O algorithm for 0 triangle and thus imply faster I/O algorithms for APSP and (min,+) matrix multiplication.
Notably 3-SUM saves a factor of M B in the I/O-model whereas the 0 triangle problem saves a factor of only √ M B. One type of reduction in the RAM model is of the form
Where the running times of the T 3SU M problems summed equal n 3−ε if 3-SUM is solvable in truly sub-quadratic time. In this case we will get an improvement over the zero triangle running time when M B = O(n ε ). Notably, if the extra work done is I/O-efficient then the zero triangle problem could be solved faster at a wider range of values of M and B. The reductions we have covered in this paper have had the extra work be efficient. The reductions are efficient in spite of the fact that these reductions were originally RAM model reductions which did not care about memory locality. There is, however, one kind of RAM reduction which does not imply speedups in the I/O-model. Following in the style of Patrascu's convolution 3-SUM reduction [Pǎt10] , we can have a reduction of the form
where g is an integer, c is an arbitrary constant and i ∈ [0, 3/2]. These reductions imply speedups when a polynomial time improvement is made for 3SUM, but does not immediately imply speedups if no polynomial time improvement is made for 3SUM. If the additional n 3 /g work is I/O-inefficient, this reduction might not imply speedups in the I/O model. If one is trying to show hardness for 3SUM from APSP, the approach that does not imply algorithmic improvements must have a large amount of I/O-inefficient work. We suggest the more fruitful reductions to look for have the form of Eq. 11.
Novel Reductions
In this section we cover reductions related to Wiener Index, Median, Single Source Shortest Paths, and s-t Shortest Paths. We first cover our super linear lower bounds, then cover the linear lower bounds.
Super Linear Lower Bounds
We present reductions in the I/O model which yield new lower bounds. We have as corollaries of these same reductions related lower bounds in the RAM model. Many of these reductions relate to the problem of finding the Wiener Index of the graph.
We show diameter reduces to Wiener Index, APSP reduces to Wiener Index, and we show 3 vs 4 radius reduces to median. Proof. Create a graph G by adding two special sets of nodes X and T to G. Specifically, ∀x ∈ X add x to G and add an edge with weight 1 (or if G is an unweighted graph just an edge) to the node x in G. And, ∀t ∈ T add t to G and add an edge with weight 1 (or if G is an unweighted graph just an edge) to the node t in G. Now we will ask for the Wiener index of 4 graphs G , G + X , G + T and G. Let W I(G) be the Wiener index of graph G. Note that the shortest path between x and t in G is δ(x, t) + 2 and the shortest path between x 1 , x 2 ∈ X is δ(x 1 , x 2 ) + 2 and these paths always use edges {(x , x), (t, t )}, {(x 1 , x 1 ), (x 2 , x 2 )}, respectively, and otherwise exclusively edges in G. Thus, we have that the formula
, which is the desired value. We need to add O(n) nodes and edges to the original graph which costs O Proof. Given an undirected graph, G = (V, E), replace all edges with two directed edges (that form a cycle between the two endpoints of the original edge) and then proceed as described below with the new directed graph.
We will generate a new graph G with sections G 1 , G 2 , . . . , G k+1 and nodes s 1 , . . . , s k in the following way. Given a directed graph G make k+1 copies of the vertex set, call them
Now note that the distance between v 1 and u k+1 is at least k. If any s i is ever used, then it is guaranteed that any path from v 1 to u k+1 uses at least k + 1 edges. Otherwise, all paths between nodes in the first layer and k + 1-st layers require k edges. The paths can't be longer than k + 1 because v 1 → s 1 → s 2 → . . . → s k → u k+1 is a k + 1 length path that always exists between every pair of nodes in the first layer and the k + 1-st layer.
A path of length k exists in this new graph iff a path of length k exists in G from u to v. STherefore, the distance between u 1 and v k+1 equals max{min{δ(u, v), k + 1}, k}. Now we can use Lemma 3.2 to compute the sum x∈X t∈T δ G (x 1 , t k+1 ). As discussed δ G (x 1 , t k+1 ) = max{min{δ(x, t), k + 1}, k}. Thus, we get the desired sum. Proof. We use Lemma 3.3 to compute r k = x∈X t∈T max{min{δ(u, v), k}, k − 1}. Let a k be the number of pairs (x, t) where δ(u, v) ≥ k then r k − |X||T |(k − 1) = a k . Note that a k − a k+1 is the number of nodes at exactly distance k. So with two calls to the algorithm from Lemma 3.3 we can compute the number of nodes at distance k. We can now return a k − a k+1 and a k and get both values.
We now show that Wiener index, in sparse graphs, can efficiently return small diameters. Notably, this means that improvements to the sparse Wiener index algorithm will imply faster algorithms for the sparse diameter problem than exist right now. Proof. Using Theorem 3.4 we can binary search to find the largest value such that there is a node at distance d and there are no nodes larger than d. If this value is above k then there will be nodes at distance k + 1 which is efficient to check. Proof. Using Theorem 3.4 we can check every distance from 1 up to k and return the number of nodes at that distance.
Next, we prove that improvements to median finding in sparse graphs improve the radius algorithm, using a novel reduction. Notably, in the I/O model 3 vs 4 radius is slower than 2 vs 3 radius; whereas, in the RAM-model, these two problems both run in n 2 time. The gap in the I/O model of a factor of M is what allows us to make these statements meaningful. Proof. First we run the algorithm from Theorem A.1 to determine if the radius is ≤ 2. If the radius is ≥ 3 then we will produce G by doing the following. Running this algorithm takes O
time.
See Figure 2 for an image of the completed G . We start by adding four copies of the vertex set V 1 , . . . , V 4 to G and add edges between nodes in v ∈ V i and u ∈ V i+1 if v and u are connected in the original graph G. Additionally add nodes x 1 , . . . , x 3 where x i is connected to all nodes in V i and all nodes in V i+1 .
We want to enforce that a node in V 1 is the median, we can do this by adding many nodes close to nodes in V 1 and far from other nodes. We then want to check if there is a node in V 1 that is at distance 3 from all nodes in V 4 . The median will give us the node with the smallest total distance so we will want to correct for how close nodes in V 1 are to nodes in V 2 and V 3 .
First we add 10 nodes y 1 , . . . , y 10 each connected to every node in V 1 . Next add S 1 , . . . , S 10 which are sets of 10n nodes, node y i will connect to all nodes in S i . There are less than 7n nodes in the rest of the graph. Nodes in V 1 are at distance 2 from all the nodes in S i . Nodes in S i are at distance ≥ 3 from all nodes except those in V 1 , y i and S i . So nodes in V 1 are the only possible medians, as they are closer by at least 90n to all nodes in the various S i .
The algorithm we gave for 2 vs 3 radius (Theorem A.1) can return the number of nodes at distance 0, 1, 2, ≥ 3 for each nodes. We run this algorithm on U = V 1 ∪ x 1 ∪ V 2 ∪ x 2 ∪ V 3 and keep track of all of these numbers for each node. We will then add extra structure W so that all nodes in V 1 will have the same total sum of distances to all nodes in G − V 4 . Then, the median returned will be whatever node in V 1 that has a minimum distance to V 4 . Now, we note that if the radius is ≤ 3 the distance from a node to V 4 will be 3n. Next, we describe how to build W .
The total distance to all nodes in U from any node in V 1 will be between 4n + 5 and 6n + 5. So we create a node set A of 2n nodes to which we then add z 1 and z 2 . We connect z 1 to all of V 1 and z 2 . We connect z 2 to A. So, every node in A is at distance ≤ 3 from V 1 . We then add log(2n) nodes T = t 0 , t 1 , . . . , t log(n) . We connect node t i to 2 i nodes which are non-overlapping with any of the other t j . We add a final node, z 3 , which connects to the nodes in V 1 and all nodes in t j . We can now connect a node in V 1 to t j to make its total sum of distances 2 j smaller. We use this to equalize all the sums of distances. We finally need to add another log(2n) nodes T = {t 0 , . . . , t log(n) } and a node z 4 . We connect z 4 to all of V 1 and all of T . We connect a node in V 1 to t j if it isn't connected to t j . Now all nodes in V 1 have the same sum of distances to all nodes in G − V 4 .
We now run median. If the node returned has distance 3n to the nodes in V 4 then the radius is 3. Otherwise, the radius is 4.
It has previously been shown that Wiener Index is equivalent to APSP in the RAM model. Here we show this also holds in the I/O-model. Theorem 3.8. If Wiener Index is solvable in n 3−ε time in a dense graph then APSP is solvable inÕ(n 3−ε + n 2 /B) time.
Proof. We will show that Wiener Index solves the negative triangle problem. We start by taking the tripartite negative triangle instance with weights between −W and W and calling the partitions A, B and C respectively.
We create a new graph G and add to it A, B, C, and A . Where A is a copy of A. We add edges between A and B, B and C, C and A if they existed in the origional graph. We put weights on those edges equal to their weight in the original graph plus 5W . Now if there is a node a ∈ A involved in a negative triangle then the distance from a to a in G will be < 15W .
We add a node x which is connected to every node in G with an edge of length 15W , this ensures all nodes in the graph have a path of length 30W to each other in the worst case. This guarentees that no nodes have infinite path lengths to each other which would result in an un-useful response from the Wiener index.
Finally we add edges between A and A . Specifically, between two copies of the same node say a and a we add an edge of length 15W . Between two nodes that aren't copies, say a and v we add an edge of length 12W .
Note that the distance from a to a is < 15W only if there is a negative triangle through a. Any path through x that doesn't begin or end at x must have length > 30W . A path through A then B then C then A represents a triangle. A path from a to a that goes through some v will have length at least 12W + 4W + 4W = 20W . So, any short paths between nodes represents a negative triangle. If the smallest triangle through a in the original graph had total weight positive or zero, then the distance from a to a will be 15W , using the edge we added between them.
The shortest path from a ∈ A to v ∈ A when the nodes are not copies of each other is 12W . There is an edge between them of this length, so it can be no longer. Using x is less efficient and the shortest path from A to B to C to A will have weight at least 4W on each of the three edges, thus be at least 12W in distance. Now, if there are no negative triangles the total sum of weights between A and A will be 2n(12n + 3)W . If there is a negative triangle, then at least one of the pairs will have total length < 15W causing the sum to be strictly less than 2n(12n + 3)W .
We use Lemma 3.2 to find the total sum of distances between A and A . The time to make these copies and add edges is n 2 /B time.
Linear-Time Reductions
In fine-grained complexity, it often does not make sense to reduce linear-time problems to one another because problems often have a trivial lower bound of Ω(n) needed to read in the entire problem. However, in the I/O model, truly linear time-the time needed to read in the input-is Θ(n/B). Despite significant effort, many problems do not achieve this full factor of B in savings, and thus linear lower bounds of Ω(n) are actually interesting. We can use techniques from finegrained complexity to try to understand some of this difficulty. In the remainder of this section, we cover reductions between linear-time graph problems whose best known algorithms take longer than O(|E|/B) time. This covers many of even the most basic problems, like the s-t shortest path problem, which asks for the distance between two specified nodes s and t in a graph G. The sparse s-t shortest paths problem has resisted improvement beyond O(|V |/ √ B) even in undirected unweighted graphs [MM02] , and in directed graphs, the best known algorithms still run in time Ω(|V |) [CGG + 95, ABD + 07].
Notably, the undirected unweighted s-t shortest path problem is solved by Single Source Shortest Paths (SSSP) and Breadth First Search (BFS). Further note that for directed graphs the best known algorithms for SSSP, BFS, and Depth First Search (DFS) in sparse, when |E| = O(|V |), directed graphs take O(|V |) time. Which is a cache miss for every constant number of operations, giving no speed up at all. SSSP, BFS, and DFS solve many other basic problems like graph connectivity.
By noting these reductions we want to show that improvements in one problem propagate to others. We also seek to explain why improvements are so difficult on these problems. Because, improving one of these problems would improve many others, any problem which requires new techniques to improve implies the others must also need these new techniques. Furthermore, any lower bound proved for one problem will imply lower bounds for the other problems reduced to it. We hope that improvements will be made to algorithms or lower bounds and propagated accordingly.
We show reductions between the following three problems in weighted and unweighted as well as directed and undirected graphs. Definition 3.9 (s-t-shortest-path(G, s, t)). Given a graph G and pointers to two verticies s and t, return the length of the shortest path between s and t. Definition 3.10 (Girth-Containing-Edge(G, e)). Given a graph G and a pointer to an edge e, return the length of the shortest cycle in G which contains e. Definition 3.11 (Girth-Containing-Vertex (G, v)) . Given a graph G and a pointer to a vertex v, return the length of the shortest cycle in G which contains v.
We now begin showing that efficient reductions exist between these hard to solve linear problems. Proof. We construct a modified graph G by taking the target edge with end vertices v 1 and v 2 and deleting it. We now run s-t-shortest-path (G , v 1 , v 2 ) and return this result plus 1. The deleted edge completes the cycle, and since it is the shortest path, it results in the smallest possible cycle. For a directed graph, ensure that the deleted edge pointed from t to s. Proof. To obtain the shortest path between s and t we construct a new graph G which simply add an edge e of weight 1 between s and t. If this edge already exists delete it and add an edge of weight 1. If we are in the undirected case, save the deleted edges weight as d. We then run Girth-Containing-Edge (G , e ) and subtract 1. Again, the edge should be directed from t to s in the case of directed graphs. In the undirected case where we deleted an edge, compare the output shortest cycle length minus 1 to the distance d, return the smaller value.
This requires a single call to Girth-Containing-Edge and a constant number of changes to the original input. Theorem 3.14. Given an algorithm that solves (undirected/directed) Girth-Containing-Vertex in f (n, |E|, M, B) time (undirected/directed) s-t-shortest-path can be solved in O(f (n, |E|, M, B) + O(1)) time.
Proof. We construct a modified graph G by adding a new vertex v and connecting it to the verticies s and t. We then call Girth-Containing-Vertex (G , v ) and return the result minus 2. For the directed case we must direct the edges from t to v and from v to s. Proof. We construct a modified graph G by taking the target edge with end verticies v 1 and v 2 , deleting the edge, and then replacing it with a new vertex v which connects only to v 1 and v 2 . This modification only requires a constant number of operations. Next run Girth-Containing-Vertex (G , v ) and return it's result minus 1. This reduction also works for directed graphs by directing the new edges appropriately. Proof. We construct a modified graph G by splitting v into two verticies v in and v out where v in contains all of the edges from other verticies to v and v out has edges to all of the verticies which v had edges to. We then add an additional edge e directed from v in to v out . We then call GirthContaining-Edge (G , e ) and return its result minus 1.
Let |A(v)| be the length of the adjacency list of the vertex v. These edits take time O(|A(v)|/B), which is upper bounded by O(n/B).
Theorem 3.17. Given an algorithm that solves directed s-t-shortest-path in f (n, |E|, M, B) time then directed Girth-Containing-Vertex is solvable in O(f (n, |E|, M, B) + n/B) time.
Proof. We construct a modified graph G by splitting v into two verticies v in and v out where v in contains all of the edges from other verticies to v and v out has edges to all of the verticies which v had edges to. We now run s-t-shortest-path (G , v out , v in ) and return the result plus 1.
When solving Girth-Containing-Vertex in the directed case, we know which direction the path must follow the edges and can perform this decomposition. Unfortunately this no longer works in the undirected case and a more complex algorithm is needed, giving slightly weaker results. Proof. When attempting to solve Girth-Containing-Vertex in the undirected case, if we wish to split the required vertex v we end up with the issue of not knowing how to partition the edges between the new nodes. However, we only need to ensure that the two edges used in the solution are assigned to opposite nodes. Conveniently, if v has degree d we can generate O(lg d) partitions of the edges such that every pair of edges appears on opposite sides in at least one partition. To do so, label each edge with numbers from 0 to d − 1. These can be expressed by s = lg(d) bit numbers. We generate s partitions where the the assignment of the edges in the i th partition is given by the value of the i th bit of the edge's number. Since each number is different, all pairs of them must differ in at least one bit, yielding the desired property.
To solve undirected Girth-Containing-Vertex we first find all the neighbors of v and number them as above. Now for each bit in this numbering we construct a new graph G i which replaces v with a pair of verticies v i,0 and v i,1 . Additionally, v i,0 is connected to all the neighbors of v which had a 0 in the i th bit of it's number. Similarly, v i,1 is connected to all the neighbors of v which had a 1 in the i th bit of it's number. To solve Girth-Containing-Vertex with s-t-shortest-path , after constructing each G i we call s-t-shortest-path (G i , v i,0 , v i,1 ) and store the answer. After constructing all of the augmented graphs and running the s-t-shortest-path algorithm, our girth is simply the minimum of all the shortest path lengths found. Constructing each augmented graph only requires interacting with each node and edge a constant number of times and can be done in sequential passes. It thus runs in n/B time. Since this is a sparse graph, the degree of v cannot be more than |E| and thus we will not need to construct more than O(lg(n)) graphs and make O(lg(n)) calls to the s-t-shortest-path algorithm. Proof. The reduction from Girth-Containing-Vertex to Girth-Containing-Edge proceeds exactly as in the proof of Theorem 3.18 except that we add an extra edge e i between v i,0 and v i,1 and call Girth-Containing-Edge (G i , e i ) instead of s-t-shortest-path on the input.
Lower Bounds from Fine-Grained Reductions
The fundamental problems in the fine-grained complexity world are good starting points for assumptions in the I/O model because these problems are so well understood in the RAM model. Additionally, both APSP and 3-SUM have been studied in the I/O model [AMT04, PS14b, Pǎt10] . These reductions allow us to propagate believed lower bounds from one problem to others, as well as propagate any potential future algorithmic improvements.
Reductions to 3-SUM
We will show that 3-SUM is reducible to both convolution 3-SUM and 0 triangle in the I/O-model.
Proof. Following the proof of Pǎtraşcu we can hash each value into the range n/g in time n/B [Pǎt10] . We then sort the elements by their hash value in time n lg M/B (n)/B. We scan through and put elements in over-sized buckets in one memory location, and put the elements in buckets with less than 10g elements elsewhere sorted by hash value in time O(n/B).
The expected number of elements in an over-sized buckets is n/g and then solve the 3-SUM problem on lists of length n, n and n/g in n 2 /(gM B) time.
We then go through the small buckets of size < 10g we mark each element in the buckets by their order in the bucket (so each element is assigned a unique number from [1, 10g]). Now we re-sort the elements in small buckets by their order number in time n lg M/B (n). Then we Proof. We will use a reduction inspired by the reduction in Vassilevska-Williams and Williams [WW13] . We produce √ n problems. Specifically the problems will be labeled by i ∈ [1, √ n] and we will produce a graph on L i , R i , S i . We make the problems as follows (as is done in VassilevskaWilliams and Williams [WW13] ).
Zero triangle will need the input adjacency list to be given to it. Given an adjacency matrix of size √ n by √ n indexed by k and j let h √ n (k, j) = k √ n + j. For each problem we will generate the adjacency list and lay it out in memory by labeling each element with its order in memory. It will take n/B time to scan through the convolution 3-sum instance. Given the index i of the problem we can compute the k and j (note for values in list B they will have multiple pairs k and j produced) for the corresponding 0 triangle instance and thus compute h √ n (k, j). We can scan through the values from the lists A,B and C and assign them values h and then sort the lists based on the values of h. This will take time O(n/B + n lg M/B (n)/B) for each subproblem i. For a total time of 
Theorem 4.4. If 0 triangle is solved in time
O(n 3−ε /(M B)) or O(n 3 /(M 1+ε B)) or O(n 3 /(M B 1+ε )) then 3-SUM is solved in O(n 2−ε /(M B)) or O(n 2 /(M 1+ε B)) or O(n 2 /(M B 1+ε )) time,/(M B)) or O(n 2 /(M 1+ε B)) or O(n 2 /(M B 1+ε )) time implies 3-SUM is solved in O(n 2−ε /(M B)) or O(n 2 /(M 1+ε B)) or O(n 2 /(M B 1+ε )) time.10M i + a i to the list L. If A[i] + A[j] + A[i + j] = 0 then 10M i + A[i] + 10M j + A[j] − 10M (i + j) + A[i + j] = 0. If 10M i + A[i] + 10M j + A[j] − 10M (k) + A[k] =
APSP Reductions in the IO-Model
We show reductions between APSP, negative weight triangle finding, (min, +)-matrix multiplication, and all pairs triangle detection, as diagrammed in Figure 3 . Another related version of APSP requires us to return all the shortest paths in addition to the distances. To represent this information efficiently, one is required to return an n by n matrix P where the P [i][j] is the next node after i on the shortest path from i to j. The matrix P allows one to extract the shortest path between two points by following the path through the matrix P . This problem is also called APSP. Definition 4.8 (Three-Layer-APSP(G)). Solve APSP on G where G is promised to be a bipartite graph G which has partitions A, B, and C, such that there are no edges within A,B or C and no edges between A and C. This is shown visually in Figure 4 . Definition 4.9 (Negative-triangle-detection(G)). Given a graph G, retuning true if there is a negative triangle and false if there is no negative triangle. This problem is also called − detection. (A,B) ). This problem is a variant on matrix multiplication.Given an n by n matrix A and an n by n matrix B return an n by n matrix C such that
The motivation for showing I/O equivalences between these problems is two fold. First, just as in the RAM model, these reductions can provide a shared explanation for why some problems have seen no improvement in their I/O complexity for years. The set of reductions.
Proof. In folklore we can solve APSP with lg (n) calls to (min, +). We take the adjacency matrix A and then use repeated squaring to produce A, A 2 , A 2 2 , . . . , A 2 lg(n) . Then we simply multiply these lg (n) matrices together and the output will be the set of shortest paths between all pairs. To get both the min path lengths and the successor matrix after each multiplication, we will use both V and S output by (min, +). Say we are multiplying L 1 and L 2 and they have successor matrices S 1 and S 2 , and the output of the (min, +) multiplication is (V, S). The output length matrix L out = V and S out = S.
Each multiplication takes O n 2 /B cache misses to read in and write out the matrices and f (n, M, B) for the multiplication itself. Note that a trivial lower bound on f (n, M, B) is O n 2 /B . So the total number of cache misses is O (lg (n) f (n, M, B)) Theorem 4.12. If All Pairs Shortest Paths runs in time f (n, M, B), then negative weight triangle detection in a tripartite graph runs in O (f (n, M, B)) cache misses.
Proof. Let us call the whole vertex set V and the three groups of nodes I, J and K.
Now we create a new graph G . Where V = I ∪ J ∪ K ∪ I and all edges (i, k) for i ∈ I and k ∈ K are removed and the edge (i , k) is added. Additionally we add 7m to the weights of all edges (this to force the shortest paths to not 'backtrack' and go through one set multiple times). This takes O n 2 /B cache misses. Now we run AP SP on G . We look at the path lengths between pairs of nodes i and i . If any of those path lengths is < 21m, then the total original triangle was negative, return true. Otherwise we return false.
Setting up the graph takes O n 2 /B cache misses. Running APSP takes f (4/3n, M, B) cache misses. Checking for short paths between i and i takes O (n) time. If n > B, then n = O n 2 /B . If n < B, then the entire computation fits in two cache lines and thus takes O (1) time to compute even if M = Θ (B). Once again n 2 /B is a trivial lower bound on f (n, M, B). So the total number of cache misses is O (f (n, M, B) ) . Proof. We will use the same reduction as [WW10] and analyze it in I/O-model. Let the three layer APSP's layers be called I, J and K. We want to find for every pair (i, k) where i ∈ I and k ∈ K the j such that triangle i,j,k has minimum weight. We will discover this by doing lg (W ) + 1 rounds where we start by re-assigning all w (i, k) = 0 and then binary search on each w (i, j) for the value where
Now, in each round we split each set I, J and K into n 2/3 groups of size n 1/3 . We can then once again keep two matrices V for minimum value so far and S for the j achieving that value.
We can call negative weight triangle detection repeatedly on all n 2/3 3 possible choices of three subsets. This will take at most n 2 + n 2/3 3 calls. One for each edge removed and one for each subset. This results in O n 2 f n 1/3 , M, B cache misses. The total number of cache misses is
Corollary 4.14. If negative weight triangle detection in a tripartite graph runs in time f (n, M, B), then three layer APSP over weights in the range [−poly (n) , poly (n)] runs in O lg (n) n 2 f n 1/3 , M, B cache misses.
Proof. Simply apply Theorem 4.13 with a poly (n) weight. Proof. Given an instance of (min, +) matrix multiplication produce a graph G made up of three sets of size n: I,J and K. Edges will go from I to J and J to K. The length of the edge from i ∈ I to j ∈ J will be w (i, j) = A[i, j]. The length of the edge from j ∈ J to k ∈ K will be w (j, k) = B[j, k]. The length of the edge from k ∈ K to i ∈ I will be w (k, i) = 0.
Run "All pairs min triangle detection in a tripartite graph" on G and it produces a matrix that lists the j that minimize the triangles S. Return a matrix S of j and a matrix V where
This takes O n 2 /B + f (n, M, B) cache misses Corollary 4.18. The following solve APSP faster.
1. If (min, +) matrix multiplication is solvable in f (n) time then APSP is solvable in O(lg(n)f (n, M, B)) time.
If negative triangle detection in a tripartite graph is solvable in
Proof. By using the reductions in Theorems 4.16 and 4.15 we get these values. Proof. Following the reduction from Vassilevska-Williams and Williams we will turn negative triangle on the graph G into lg(W ) copies of the problem [WW10] . We will create a tripartite instance of the − problem by making 3 copies of the vertex set V , V , V and e(v , w ) = e(v , w ) = e(v , w ) = e(v, w) but e(v , w ) = e(v , w ) = e(v , w ) = ∞. We then consider the lg(W ) problems created by replacing edge weights with the highest i bits of that edge length. Creating these new problems takes n 2 /B time, there are lg(W ) problems we need to write. So we take total time O(lg(W )f (n, M, B) + lg(W )n 2 /B).
Proof. We can consider the tripartite version. There are three sets of vertices |A| = |B| = |C| = n.
Then, we can consider the g 3 subproblems A i , B j and C k where i, j, k ∈ [1, g]. Every triangle is contained in some subproblem. We can fit a subproblem in memory if |A i | = √ M . This gives us I/Os
Orthogonal Vectors (OV)
Lemma 4.21. OV is solvable in O(n 2 /(M B) + n/B) I/Os cache obliviously.
Proof. We will give a recursive algorithm, OV (A, B). The base case is when |A| = |B| = 1, then we simply take the dot product. If the dot product is zero then return T RU E.
Given two lists A and B of size greater than one then divide the two lists in half. Call the halfs of A A 1 and A 2 . Call the halves of B B 1 and B 2 . We then run OV on four recursive calls OV (A 1 , B 1 ), OV (A 1 , B 2 ), OV (A 2 , B 1 ), and OV (A 2 , B 2 ). If any return T RU E then return true, else return F ALSE.
The running time of this algorithm is given by T (n) = 4T (n/2) + n/B We use the master's theorem from Section 2.2 to find that the running time is O(n 2 /(M B) + n/B). We note that access pattern of this algorithm is independent of the size of cache and the size of the cache line. Making this algorithm cache oblivious. Proof. The edit distance proof from Backurs and Indyk is generated by taking each vector and making a string not too much longer [BI15] . The total time to produce the strings isÕ(n/B).
There are two stages of the reduction. The first stage reduces orthogonal vectors to a problem they define, Pattern. In this reduction the vectors from the edit distance problem are converted into a string that can be formed by reading the vectors bit by bit. This means the pattern can be produced inÕ(n/B) time.
The second step of the reduction reduces pattern to edit distance. This works by tripling the size of one of the patterns. This also takesÕ(n/B) time.
As a result, we can take the original orthogonal vector problem and turn it into two strings that can be given as input to edit distance. Thus, if the edit distance problem is solvable in time f (n, M, B) then orthogonal vectors is solvable in timeÕ(f (n, M, B) + n/B) Proof. We will generate the same graph as in our sparse diameter reduction. We will use the reduction from Abboud, Vassilevska-Williams and Wang [AWW16] . We can in n/B time output an adjacency list for y a and y b . We read in as many vectors (or fractions of a vector as we can) and we output (a i , d j ) and
. We then sort these vectors which takes nd lg M/B nd/B time. This produces adjacency lists for all nodes.
I/O Model Complexity Classes
In this section we examine the I/O model from a complexity theoretic perspective. Section 5.1 provides some necessary background information. In Section 5.2 we define the classes PCACHE M,B and CACHE M,B (t (n)) describing the problems solvable in a polynomial number of cache misses and O (t (n)) cache misses respectively. We then demonstrate that PCACHE M,B lies between P and P SP ACE for reasonable choices of cache size. In Section 5.4 we provide simulations between the I/O model and both the RAM and Turing machine models. In Section 5.3 we prove the existence of a time hierarchy in CACHE M,B (t (n)). The existence of a time hierarchy in the I/O model grounds the study of fine-grained complexity by showing that such increases in running time do provably allow more problems to be solved. The techniques to achieve many of the results in this section also follow in the same theme of reductions, although the focus of the problems examined is quite different.
Hierarchy Preliminaries: Oracle Model
Oracles are used to prove several results, most notably in the time-hierarchy proof. The oracle model was introduced by Turing in 1938 [Tur39] . The definition we use here comes from Soare [Soa99] and similar definitions can be found in computational complexity textbooks.
In the Turing machine oracle model of computation we add a second tape and corresponding tape head. This oracle tape and its oracle tape head can do everything the original tape can: reading, writing and moving left and right. This oracle tape head has two additional states ASK and RESP ON SE. After writing to the oracle tape, the tape head can go into the ASK state. In the ASK state the oracle computation is done on the input written to the oracle tape and then the tape head is changed to the RESP ON SE state. All of this is done in one computational step. If the oracle is the function language L : {0, 1} n → {0, 1} * , then the output is written on the tape for the input i is L(i). This allows the Turing machine to make O(1) cost black box calls to the oracle language and get strings as output.
The notation A B describes a computational class of the languages decidable by an oracle Turing machine of A with a B oracle. The oracle language will be a language decidable in the function version of B. The oracle machine will then be resource limited as A is resource limited.
In this paper we will also talk about RAM machine oracles. This is a simple extension of the typical Turing machine oracle setup. The RAM machine will have two randomly accessible memories. One will be the standard RAM memory. The other memory will be the oracle memory, the RAM can read and write words to this memory and can additionally enter the ASK state. One time step after entering the ASK state the RAM will be returned to the RESP ON SE state and the contents of the oracle memory will contain the oracle language output, L(i).
PCACHE M,B and its relationship with P and PSPACE
First we define the class of problems solvable given some function, t(n), the number of cache misses, up to constant factors. 
First, let us note that the CACHE class can simulate the RAM class. The IO-model is basically a RAM model with extra power.
Proof. With B = 1 and M = 3 we can simulate all the operations of a RAM machine. There are three things to simulate in O(1) cache misses:
• Write a constant to a given word in memory.
We can write a word to memory with 1 cache miss.
• Read a and write op(a) to a given word in memory.
We can read in a with one cache miss. We can compute op(a) with zero cache misses for the operations doable in one time-step on the RAM model. Finally, we can write op(a) with one cache miss. For a total of two cache misses.
• Read a and b and write op(a, b) to a given word in memory. We can read a and b with two cache misses. We can compute op(a, b) in zero cache misses for all operations that are doable in one time step for the RAM model. Finally, we can write op(a, b) in one cache miss.
CACHE M,B (t(n)) defines M and B asymptotically and thus is a superset of CACHE with any constant M and B.
Now we introduce a complexity class MEM. Note that this class is very similar to SP ACE.
Definition 5.4. We define the class MEM(s(n)) to be the set of problems solvable in SP ACE(s(n)) when the input is of size O(s(n)).
Why MEM and not SP ACE? We want to use the MEM class as an oracle which will model computation doable on a cache machine in one cache miss. When t(n) = Ω(n) then MEM(t(n)) = SP ACE(t(n)); however, these classes differ when we have a small work space. A SP ACE(o(n)) machine is given a read-only tape of size n and compute space o(n). This extra read-only tape gives the SP ACE machine too much power when compared with the cache. Notably, we can scan through the entire input with one step with a SP ACE(lg(n)) oracle. A cache would require n/B time to scan this input. Proof. We can read the entire problem of size M w into memory by bringing in Bw bits at a time, for a total of M/B cache misses. Once the problem is in memory, it can be solved entirely in cache with a cache of size O(M w) bits, or O(M ) words. Now we prove that a RAM machine with oracle access to our MEM oracle can be simulated by a cache machine. We simulate the MEM machine and RAM machine together efficiently in cache.
Proof. We can use 3 words in the cache to simulate our RAM (t(n)) machine using Lemma 5.3. We can use the remaining M words of the cache to simulate the MEM(M w) oracle. Each word the RAM machine writes to the oracle tape can be simulated in O(1) cache misses (pull from main memory and write to the simulated oracle tape). Each word the RAM machine reads from the oracle can be simulated with no cache misses, because both the oracle tape and the RAM simulation are in cache.
The class PCACHE M,B is equivalent to a polynomial time algorithm with oracle access to a MEM oracle. Intuitively, in both cases we get to use a similarly powerful object (the cache or the MEM oracle) a polynomial number of times.
Proof. First, we consider the inclusion PCACHE M,B ⊆ P M EM (M w) . The oracle tape will have the memory address of the next requested cache line, a space for the RAM machine to write the contents of the requested cache line, the state the RAM simulation ended in, and finally the M w bits of the contents of the cache. The oracle will be queried once per simulated cache miss, and return the state of the cache at the next cache miss, as well as the requested cache line. The polynomial machine will write the requested cache line to the oracle tape and then run the oracle. Note, if the cache machine we simulate takes t(n) cache misses the polynomial machine (of the P M EM (M w) oracle machine), then will need to take time O(Bwt(n)). However, Bw is O(poly(n)) and t(n) is O(poly(n)), so Bwt(n) is also O(poly(n)).
Second, we consider the inclusion PCACHE M,B = PCACHE M +3,B ⊇ P M EM (M w) . Note that PCACHE M +3,B and PCACHE M,B are equal because we consider the asymptotic size of the cache and cache line. The first 3 words of the cache will be used to simulate P (by simulating a RAM machine). The next M words will be used to simulate the MEM(M w) oracle and its tape. We can simulate the MEM(M w) oracle with our cache because we can run any RAM program that uses only M w space on the cache in 1 time step.
Given
We then note that in many cases MEM and SPACE are equivalent.
Note that the definition of MEM(s(n)) is problems solvable in s(n) space with an input of size O(s(n)). SPACE machines are given an input in their working tape (and thus an input of size O(s(n))) when s(n) = Ω(n).
Finally, we note that P is a subset of PCACHE M,B .
Corollary 5.9. P ⊆ PCACHE M,B .
Proof. By P ⊆ P M EM (M w) and by Theorem 5.7 P ⊆ P M EM (M w) = PCACHE M,B .
Lemma 5.10. If M w = Θ (poly (n)), then PCACHE M,B ⊆ P SP ACE Proof. First, we consider the inclusion PCACHE M,B ⊆ P SP ACE. The PCACHE can only pull in O (wBpoly(n)) bits from main memory which is polynomial since wB ≤ M = O(poly(n)). Our PSPACE machine reserves two polynomial size sections of tape, one to simulate the cache and the other to store all of the values the PCACHE machine pulls from main memory. Thus PCACHE M,B ⊆ P SP ACE.
Lemma 5.11.
Proof. First, we show any language in P SP ACE is in
Second, every language, L, in
The sum of two polynomials is a polynomial, so any language in ∞ c= 1 P CACHE n c ,B is contained in P SP ACE.
CACHE M,B Hierarchy
In this section we prove that a hierarchy exists in the IO-model. The separation in the CACHE hierarchy is B times the separation for the RAM hierarchy. We know that RAM machines given polynomially more time can solve more problems than those given polynomially less.
Theorem 5.12. For ε ≥ 0,
Let s(n) be the space usage of the algorithm running on the RAM machine. Let α = B+ lg(s(n))/w B , which is the number of cache lines needed to represent both a cache line and its memory address. Note, in the case where one word is large enough to address all of the memory used by the algorithm (a standard assumption) α = 1 + 1/B ≤ 2. We now give a simulation of a CACHE machine by a RAM machine with MEM oracle.
Proof. At a high level we are going to be treating the MEM oracle as the cache, the RAM machine is simply going to be playing the part of moving information from the main memory into the cache.
We reserve the first B words , "the input", of the MEM tape to be the location to write in a cache line to the cache simulation. We reserve the next lg(s(n))/w words to specify where this cache line came from in main memory.
We reserve the next lg(s(n))/w words, "the request", to specify which cache line the cache simulation is requesting from main memory at the end of each run.
Finally the next B words, "the output", specify the contents of the cache line being kicked out of memory and the following lg(s(n))/w words specify where this cache line came from in main memory.
When the cache simulation is run it takes the input and writes it and the lg(s(n))/w words of pointer information into the part of its M wα sized tape where the output was previously written. Then the MEM(M αw) oracle can compute the language which simulates a the cache until its next cache miss.
Note this means we only need to copy words into memory when a cache miss occurs. The process of fetching the requested cache line, writing it to the input and, writing the output to our main memory takes O(B + lg(s(n))/w ) time. Note this is O(Bα) time per cache miss, for a total of O(t(n)Bα) time.
Plugging our simulation into the RAM hierarchy gives a separation result for the CACHE complexity classes.
Theorem 5.14. For all ε > 0
Proof. From Lemma 5.13,
From Theorem 5.12, for ε > 0
Using Corollary 5.6:
Under reasonable assumptions about the values of input and word sizes, we can construct a cleaner version of the above theorem.
Corollary 5.15. When s(n) = 2 O(wB) , in other words the memory used by the algorithm is referenceable by O(B) words,
Proof. Note α = 1 + 1/B = O(1), and thus is a constant with respect to the time and size of memory which are defined asymptotically. Thus this factor disappears.
TM Simulations for the RAM Model
Exploiting the cache line in the I/O-model is a long standing goal for many algorithms. Turing Machines have great locality and perform universal computation. Simulating Turing Machines in the I/O-model has the potential to give a universal transform which utilizes the cache line for improved speeds. Notably, improved simulations of RAM machines by multi-tape or multidimensional Turing Machines would imply savings factors of the cache line in running time.
Important simulations and separations
Here we give some known results about simulations of RAM, d-dimensional Turing, and c-tape Turing Machines by each other. First we define a RAM machine with oracle access.
Definition 5.16. RAM O (t (n)) is a RAM machine with oracle access to the language O and allowed t (n) time steps to do its computation. There is a separate location in memory where we can write down input to the oracle and receive output from the oracle.
We first give the known relativized time hierarchy result for Turing Machines which will provide the basis for a cache hierarchy of a different form. The time hierarchy proof relativizing means that the relationship between the classes remains the same with the introduction of an oracle, O.
A RAM machine can be simulated by a Turing Machine with a quadratic slowdown and consideration for word sizes. This simulation also holds true with respect to any oracle, O.
Here we give a simulation of a RAM machine by a d-dimensional Turing Machine which also holds with respect to oracle access. The larger the dimension of the tape, the more efficient the simulation. 
c-tape Turing Machine Simulations
Lemma 5.20. Let DT IM E (c) be a multi-tape turing machine with c tapes. Then
DT IM E (c) (t(n)) M EM (M w) ⊆ CACHE M +2cB,B (t(n)/(Bw)).
Proof. On the c normal tapes we can maintain 2 cache lines from each tape in cache. We start by keeping the Bw bits before each tape head and the Bw bits after each tape head in cache. If the tape head moves outside of this space, we keep the cache line closest to the head, kick out the cache line farther away, and finally bring in the Bw bits containing the tape head and the closest Bw bits currently uncovered (once again surrounding the head). Note that brining in Bw bits takes one cache miss. We can lay out each tape in contiguous memory. We keep the entire O(M w) sized simulation of the M EM (M w) Oracle in memory. Now we can simulate the Turing machine with no cache misses, until a tape head on one of the c tapes moves outside the area we are covering. For each tape head, the number of Turing Machine steps needed to cause a cache miss is at least Bw, in order to have the time to drag the tape head across the Bw bits of tape.
If a c-tape TM can simulate a RAM machine very efficiently then we can save factors of B (by getting memory locality).
Corollary 5.21. If RAM (t(n)) O ⊆ DT IM E c−tape (f c (t(n))) O for all oracles O then RAM (t(n)) ⊆ CACHE M +2cB,B (f c (t(n))/B).
Proof. Combine the assumption with Lemma 5.20 to get RAM (t(n)) ⊆ DT IM E c−tape (f c (t(n))) ⊆ CACHE M +2Bc,B f c (t(n)) B .
This also has implications for the hierarchy theorem.
Lemma 5.22. If RAM (t(n)) O ⊆ DT IM E c−tape (f c (t(n))) O for all oracles O then CACHE M,B (t (n)) CACHE αM +2cB,B f c (t (n) Bα) lg 2 f c (t (n) Bα) B .
Conclusion
In this paper we give a formal definition for a complexity class based on the I/O model of computation and show its relationship to other complexity classes. This gives us a bridge between these well studied fields. Our hierarchy separation gives further justification for the study of fine-grained complexity in the I/O model, and although we are able to transfer over some results, there is ample work to be done on this topic. Further our simulations suggest results in pure complexity theory could have implications for faster algorithms in the I/O model. From here we propose several specific problems for future work. We give a hierarchy, however, unfortunately, the separation not only includes an increase in running time but also cache and line size. The increase in size of cache and cache line are only a constant sized blow up under normal assumptions (that the word size is large enough to index the memory). However, it would be very interesting to show these results without any increase. Furthermore, removing the factor of B from the hierarchy separation would be exciting. It would also be interesting to show a separation hierarchy based on cache size alone.
Many fine grained reductions in the RAM model port directly to the caching model. However, this need not be the case. Finding reductions between problems in the caching model (especially non-trivial ones) would be very interesting. Finding cases where the RAM and I/O reductions are very different would be interesting. Additionally, reducing between problems in the I/O model may lead to algorithmic improvements.
We show a connection between Turing Machine simulations of RAM machines and memory locality in Section 5.4. Showing, through any method, that a certain factor of the cache line size, B, can always be saved in the I/O model would be very interesting.
A New Upper Bounds Proofs
We will create a self reduction. The only problem we have is how to make sure that full adjacency lists are grouped together, unless they are too big for cache and then split. We will do this with division rules, and then argue that the splits aren't too inefficient.
We begin by building a more general algorithm that counts the nodes at distance 0, 1 and 2. This will allow for computing 2vs3 Diameter and 2vs3 Radius efficiently (in Corollary A.3 and Corollary A.4). Then we run the algorithm we will call D. We feed it two copies of the adjacency list A to start, but we can feed two different lists. D returns a tuple. The first value allows lower subproblems to propagate up that they found a large diameter. The second value is for message passing between levels of the program. So we will subdivide the problems, but some of our divisions will split the groups unevenly. This will limit how uneven our division can be. D(A, B) has four cases. . Base case is |A| or |B| is length 1. Then we simply scan through the other list for that one value. If they share a value return T rue if not return F alse. If A and B are each a full adjacency list then add one to the counter distT wo in A and add one to the counter distT wo in B.
2. If A is a subset of, or all of, one adjacency list and B contains many or one adjacency lists then we will use the indicator bits on list B. Intuitively, we are going to set these indicator bits to say if the subset of A we are looking at has overlapped with B.
We scan through the adjacency list and set T [i][2] = i.distT wo for all i. We add counts when a list is being considered and is about to be divided. So we never double count. We iterate to the bottom and thus every list will eventually be the entirety of the input A at some level of the recursion. This takes time O(|E|/B) because the lists are in the same order due to our previous sort. We now have all the counts.
Once a subproblem fits in memory (that is |A|+|B| < M ), whatever size that is, the subproblem is solved in the time to read in and write out all the data, so O(M/B).
We can sub-divide unevenly when we are splitting up many adjacency lists, because we choose to divide adjacency lists where they split. However, we can count the total number of subproblems that fit in memory. For adjacency lists, L, of length ≥ M/4 we will get ≤ 2L/M + 1 sublists. For adjacency lists, L, of length ≤ M/4 at most one will be alone. The other lists will be in sublists with ≥ M/(2L) other sublists. Thus, we will have at most twice as many sublists as we ought to, thus we will have at most four times as many subproblems of size ≤ M/2 as we should.
At every instance we divide into at least two new subproblems. Above we bounded the number of subproblems produced at O(|E| 2 /M ). Thus the maximum number of layers in our self recurrence is O(lg(E)). We take scan time per layer. The total time for these scans is thus O(|E| lg(|E|)/B). So the total time for this algorithm will be O((|E|/M ) 2 M/B + |E| lg(|E|)/B + sort(|E|)) = O(|E| 2 /(M B) + |E| lg(|E|)/B). 
