Abstract. Compact and I/O-efficient data representations play an important role in efficient algorithm design, as memory bandwidth and latency can present a significant performance bottleneck, slowing the computation by orders of magnitude. While this problem is very well explored in e.g. uniform numerical data processing, structural data applications (e.g. on huge graphs) require different algorithm-dependent approaches. Separable graph classes (i.e. graph classes with balanced separators of size O(n c ) with c < 1) include planar graphs, bounded genus graphs, and minor-free graphs. In this article we present two generalizations of the separator theorem, to partitions with small regions only on average and to weighted graphs. Then we propose I/O-efficient succinct representation and memory layout for random walks in (weighted) separable graphs in the pointer machine model, including an efficient algorithm to compute them. Finally, we present a worst-case I/O-optimal tree layout algorithm for root-leaf path traversal, show an additive (+1)-approximation of optimal compact layout and contrast this with NP-completeness proof of finding an optimal compact layout.
Introduction
Modern computer memory consists of several memory layers that together constitute a memory hierarchy with every level further from the CPU being larger and slower [2] , usually by more than an order of magnitude, e.g. CPU registers, L1 -L3 caches, main memory, disk drives etc. In order to simplify the model, commonly only two levels are considered at once, called main memory and cache of size M . There, the main memory access is block-oriented, assuming unit time for reading and writing of a block of size B, making random byte access very inefficient. While some I/O-efficient algorithms need to know the values of B and M (generally called cache-aware) [3] , cache-oblivious algorithms [13] operate efficiently without this knowledge.
Computations that process medium to large volumes of data therefore call for space-efficient data representations (to utilize the memory capacity and bandwidth) and strongly benefit from optimized memory access patterns and layouts (to utilize the data in fast caches and read-ahead mechanisms). While this area is very well explored in e.g. numerical data processing and analysis (e.g. [24] ), structural data applications (e.g. huge graphs) require different and applicationdependent approaches. We describe a representations to address these issues in separable graphs and trees.
Separable graphs satisfy the n c -separator theorem for some c < 1, shown for planar graphs in 1979 by Lipton and Tarjan [29] (with c = 1/2), where every such graph on n vertices has a vertex subset of size O(n c ) that is a 2/3-balanced separator (i.e. it separates the graph into two subgraphs each having at most 2/3-fraction of vertices). These graphs not only include planar graphs [29] but also bounded genus graphs [17] and minor-free graph classes in general [22] . Small separators are also found in random graph models of small-world networks (e.g. geometric inhomogeneous random graphs by Bringmann et al. [7] have sublinear separators w.h.p. for all subgraphs of size Ω( √ log n)). Some graphs which come from real-world applications are also separable, such as the road network graphs [33, 35] . Separable graph classes have linear information entropy (i.e. a separable class can contain only 2 O(n) graphs of size n) and have efficient representations using only O(1) bits per vertex on average [4] and therefore utilize the memory capacity and bandwidth very efficiently. This paper is organized as follows: Sections 1.1 and 1.2 give an overview of the prior work and our contribution. Section 2 recalls used concepts and notation. Section 3 contains our results on random walks in separable graphs. Section 4 generalizes the separator theorem. Section 5 discusses the layout of trees.
Related work
Turán [34] introduced a succinct representation 1 of planar graphs, Blandford et al. [4] introduced compact representations for separable graphs and Blelloch and Farzan [5] presented a succinct representation of separable graphs. However, none of those representations is cache-efficient (or can be easily made so). Analogous representations for general graphs suffer similar drawbacks [12, 32] .
Agarwal et al. [1] developed a representation of planar graphs allowing I/Oefficient path traversal, requiring O(K/ log B) block accesses 2 for arbitrary path of length K. This has been extended to a succinct planar graph representation by Dillabaugh et al. [11] with the same result for arbitrary path traversal. It appears unlikely that the representation of [11] could be easily modified to match the I/O complexity O(K/B) of our random-walk algorithm due to their use of a global indexing structure.
Dillabaugh et al. [10] describes a succinct data structure for trees that uses O(K/B) I/O operations for leaf-to-root path traversal. For root-to-leaf traversal, they offer a similar but only compact structure.
Among other notable I/O-efficient algorithms, Maheshwari and Zeh [30] develop I/O-efficient algorithms for computing vertex separators, shortest paths and several other problems in planar and separable graphs. Jampala and Zeh [20] extends this to a cache-oblivious algorithm for planar shortest paths. While there are representations even more efficient than succinct (e.g. implicit representations, which use only O(1) bits more than the class information entropy, see Kannan et al. [21] for an implicit graph representation), these do not seem to admit I/O-efficient access.
Random walks on graphs are commonly used in Monte Carlo sampling methods, among others in Markov Chain Monte Carlo methods for inference on graphical models [14] , Markov decision process (MDP) inference and even in partial-information game theory algorithms [25] .
Our contribution
Random walks on separable graphs. We present a compact cache-oblivious representation of graphs satisfying the n c edge separator theorem. We also present a cache-oblivious representation of weighted graphs satisfying weighted n c edge separator theorem, where the transition probabilities depend on the weights. The representations are I/O-efficient when performing random walks of any length on the graph, starting from a vertex selected according to the stationary distribution and with transition probabilities at each step proportional to the weights on the incident edges, respectively choosing a neighbor uniformly at random for the unweighted compact representation.
Namely, if every vertex contains q bits of extra (user) information, the representation uses O(n log(q + 2)) + qn bits and a random path of length K (sampled w.r.t. edge weights) uses O(K/( Bw (1+q) ) 1−c ) I/O operations with high probability.
The graph representation is compact (as the structure entropy including the extra bits is Θ((q + 1)n). The amount of memory used for the representation of the graph is asymptotically strictly smaller than the memory used by the user data already for the common case of q = Θ(w), in which case only O(K/B 1−c ) I/O operations are used. For q = O(1), the representation uses O(n) bits.
In contrast with previous I/O-efficient results for planar graphs, our representation is only compact (and not succinct) but works for all separable graph classes, is cache-oblivious (in contrast to only cache-aware in prior work), and, most importantly, comes with a much better bound on the number of I/O operations for randomly sampled paths (order of O(K/B 1−c ) rather than O(K/ log B)).
Fast tree path traversal is a ubiquitous requirement for tree-based structures used in external storage systems, database indexes and many other applications. With Theorem 9, we present a linear time algorithm to compute a layout of the vertices in memory minimizing the worst-case number of I/O operations for leafto-root paths in general trees and root-to-leaf paths in trees with unit vertex size.
We show an additive (+1)-approximation of an optimal compact layout (i.e. one that fully uses a consecutive block of memory) and show that finding an optimal compact layout is N P -hard. The above layout optimality is well defined assuming unit vertex size, an assumption often assumed and satisfied in practice. Using techniques from Section 3 we can turn the layout into a compact representation using O(n) bits of memory, requiring at most OP T L I/O operations for leaf-to-root paths in general trees and root-to-leaf paths in trees of fixed degree where OP T L is the I/O complexity of the optimal layout, i.e. I/O-optimal layout with the vertices using any conventional vertex representation with Θ(w) bits for inter-vertex pointers. See Theorem 10.
Compared to previous results [10] , our representation is compact and we present the exact optimum over all layouts while they provide the asymptotic optimum O(K/B). However, this does not guarantee that our representation has lower I/O complexity, since our notion of optimality only considers different layouts with each vertex stored by a structure of unit size.
Separable graph theorems. We prove two natural generalizations of the separator theorem (Theorem 7) and show that their natural joint generalization does not hold by providing a counterexample (Theorem 8). The Recursive Separator Theorem involves graph partitions coming from recursive applications of the Separator Theorem. Let r andr denote the maximum and average size of a region in the partition, respectively. We prove stronger bound on number of edges going between regions -O( 
Preliminaries
Throughout this paper, we use standard graph theory notation and terminology as in Bollobas [6] . We denote the subtree of T rooted in vertex v by T v , the root of tree T by r T and the set of children of a vertex v as δ(v). All the logarithms are binary unless noted otherwise.
We use standard notation and results for Markov chains as introduced in the book by Grinstead and Snell [19] (chapter 11) and mixing in Markov chains, as introduced in the chapter on mixing times in a book by Levin and Peres [27] .
Separators
Let S be a class of graphs closed under the subgraph relation. We say that S satisfies the vertex (edge) f (n)-separator theorem iff there exist constants α < 1 and β > 0 such that any graph in S has a vertex (edge) cut of size at most βf (n) that separates the graph into components of size at most αn. We define a weighted version of vertex (edge) separator theorem, which requires that there is a balanced vertex (edge) separator of total weight at most β f (n) n W , where W is the sum of weights of all the edges. Note that these definitions make sense even for directed graphs. f (n)-separator theorem without explicit statement whether it is edge or vertex separator, means f (n) vertex separator theorem.
Many graphs that arise in real-world applications satisfy n c vertex or edge separator theorem.
It has been extensively studied how to find balanced separators in graphs. In planar graphs, a separator of size √ n can be found in linear time [29] . Separators of the same size can be found in minor-closed families in time O(n 1+ ) for any > 0 [22] . A balanced separator of size n 1−1/d can be found in finite-element mesh in expected linear time [31] . Good heuristics are known for some graphs which arise in real-world applications, such as the road network [33] . A polylogarithmic approximation which works on any graph class is known [26] . A poly-logarithmic approximation of the separators will be sufficient to achieve almost the same bounds in our representation (differing by a factor at most poly-logarithmic in B).
We define a recursive separator partition to be a partition of vertex set of a graph, obtained by the following recursive process. Given a graph G, we either set the whole V (G) to be one set of the partition or do the following: We call the sets in a recursive separator partition regions. If there is an algorithm that computes balanced separator in time O(f (n)), there is an algorithm that computes recursive separator partition with region size Θ(r) in time O(f (n) log n) for any r. A stronger version called r-division can be computed in linear time on planar graphs [18] .
I/O complexity
For definitions related to I/O complexity, refer to Demaine [8] . We use the standard notation with B being the block size and M the cache size. Both B and M is counted in words. Each word has w bits and it is assumed that w ∈ Ω(log n).
Representation for Random Walks
In this section, we present our cache-oblivious representation of separable graphs optimized for random walks and related results. For other random walks and weighted graphs where the transition probabilities are proportional to the random walk stationary distribution, we can show a weaker result. Namely, we can no longer guarantee a compact representation.
Theorem 2. Let M be any Markov chain of random walks on a graph G and assume M has a unique stationary distribution π. Assume G satisfies the n c edge separator theorem with respect to the edges-traversal probabilities in π. Let M be a Markov chain of random walks on G with transition probabilities proportional to M , e.g. π (e) = Θ(π(e)). Then there is a layout of vertices of G into blocks with Θ(B) vertices each such that a random walk in M of length k crosses memory block boundary in expectation O(k/B 1−c ) times.
Note that this gives an efficient memory representation when N G (v) and the probabilities on incident edges can be represented by (or computed from) O (1) words, which is the case for bounded degree graphs with some chains M . We also note that such partially-implicit graph representations are present in the state graphs of some MCMC probabilistic graphical model inference algorithms.
Additionally, we present a result on the concentration of the number of I/O operations which applies to both Theorems 1 and 2.
Theorem 3. Let G be a fixed graph, t mix the mixing time of G and X the number of edges going between blocks crossed during the random walk. Then the probability that
for some value c and m = t mix log(n 2 /E(X 1 )), where the variable X i indicates if the walk crossed an edge between two different blocks in step i.
The following lemma is implicit in [4] , as the authors use the same layout to get compact representation of separable graphs and they use the following property.
Lemma 1 (Blandford et al. [4] ). If π in Theorem 2 gives the same traversal probability to all edges, the representation induces a vertex order l :
Proofs of Theorems 1 -3

Proof (Proof of Theorem 1).
Since the stationary distribution on an undirected graph assigns equal probability to every edge, we can apply Lemma 1 on G to obtain vertex ordering r : V → 1 . . . n such that e=uv∈E G log |r(u) − r(v)| = O(n). We could therefore compactly store the edges as variable-width vertex order differences (offsets). However, it is not straightforward to find the memory location of a given vertex when a variable-width encoding is used. To avoid an external (and I/O inefficient) index used in some other approaches, we replace the edge offset information with relative bit-offsets, directly pointing to the start of the target vertex, using Theorem 4 on the edge offsets. We expand the representation by inserting the q bits of extra information to every vertex, adjusting the pointers and thus widening each by O(log q) bits.
To prove the bound on I/O complexity, we use the same argument as in the proof of Theorem 2. Average of O(1 + q) bits is used for representation of single vertex and, therefore, average of Θ( Bw q+1 ) vertices fit into one cache line. By Theorem 7, part i, the total probability on edges going between memory blocks is O(1/ Bw q+1 ). Again, by linearity of expected value, this proves the claimed I/O complexity.
Compact representation as in Theorem 4 can be computed in the claimed bound, as is shown in Theorem 5.
Proof (Proof of Theorem 2).
We use the following recursive layout. Let S be an edge separator with respect to edge-traversal probabilities in π. Then S partitions G into two subgraphs X and Y . We recursively lay out X and Y and concatenate the layouts. Note that X and Y are stored in memory contiguously. At some level of recursion, we get partition into subgraphs represented by between B and B words for > 0 constant. We call these subgraphs block regions. Since the average degree in graphs satisfying n c edge separator theorem is O(1) [28] , the average vertex representation size is also O(1) and the average number of vertices in a block region is, therefore, Θ(B). It follows from Theorem 7, part ii, that the total probability on edges going between block regions is O(1/B 1−c ). From linearity of expectation, O(1/B c−1 )-fraction of steps in the random walk cross between block regions in expectation. Moreover, each of the block regions in the partition is stored in O(1) memory blocks, which proves the claimed bound on I/O complexity.
Proof (Proof of Theorem 3)
. Let X be the number of edges crossed during the random walk that go between blocks. We are assuming that there is at least one edge going between two blocks in the graph.
We choose δ = 3 4 δ (arbitrary constant c < 1 would work). Note that m is a number of steps, after which the probabilities on edges differ from those in stationary distribution by at most E(X 1 )/n 2 , regardless from what distribution we started the random walk since t mix ( ) ≤ log −1 t mix [27] . This means that the probability that an edge going between two blocks is crossed after m steps differs by at most 1 n -fraction from the probability in stationary distribution. Let X i be indicator random variable that is 1 iff the random walk crosses edge going between blocks in step i. We consider the following sets of random variables S i = {X j |X j−m : j mod m} = i} for 1 ≤ i ≤ m (not conditioning on variables with nonpositive indices). Note that the random variables in each of sets S i are independent and (1 − 
By applying the Chernoff inequality, we get that the following bounds hold for all n ≥ n 0 for some n 0 for each i:
The probability that there exists i such that either
is by the union bound for some value of c at most the following:
Note that µ i converges to |S i |E(X 1 ), which is the value that we are showing concentration of X∈Si X around. The asymptotic bound on the probability follows.
Expanding relative offsets to relative bit-offsets
Having the edges of a graph encoded as relative offsets to the target vertex and having these numbers encoded by a variable-length encoding, we need a way to find the exact location of the encoded vertex. Others have used a global index for this purpose but this is generally not I/O-efficient.
Our approach encodes the relative offsets as slightly wider numbers that directly give the relative bit-index of the target. However, this is not straightforward as expanding just one relative offset to a relative bit-offset can make other bit-offsets (spanning over this value) larger and even requiring more space, potentially cascading the effect.
Note that one simple solution would be to widen every offset representation by Θ(log log N ) bits where N is the total number of bits required to encode all the n offsets, yielding N + n * O(log log N ) encoding. log n bits are sufficient to store each offset. Therefore, by expanding the offsets, they increase at most log n times. By adding log(2 log n) bits, we can encode increase of offsets by factor of up to 2 log n ≥ log n + log(2 log n).
However, we propose more efficient encoding with the following theorem. We interpret the numbers a i as relative pointers, i-th number pointing to the location of the (i + a i )-th value. In the proof, we use a dynamic width gamma number encoding in the form [(sign)B 0 0B 1 0B 2 0 . . . B i 1], where 2i + 1-th bit encodes whether B i is the last bit encoded.
Theorem 4. Let a 1 . . . a n be a sequence of numbers such that −i ≤ a i ≤ n − i and n i=0 log |a n | = m. Then there are n-element sequences {w i } (the encoded bit-widths) and {b i } (the bit-offsets) of numbers such that for all 1 ≤ i ≤ n, w i ≥ 2 log |b i |+1 (i.e. b i can be gamma-encoded in w i bits), P (i)+w i = P (i+a i ) where P (j) := j−1 i=1 w i (so w i is a relative bit-offset of encoded position i + a i ) and
Proof. There are certainly some non-optimal valid choices for w i 's and b i 's, and we can improve upon them iteratively by shrinking w i 's to fit gamma-encoded b i with sign (i.e. w i = 1 + 2 log |b i |), which may, in turn, decrease some b i 's. Being monotonic, this process certainly has a fixpoint {b i } i and {w i } i and we assume arbitrary such fixpoint.
Let C < 1 and D > 1 be constants to be fixed below. Denote v i = log |a i | and R i = {i . . . i + a i − 1} (resp. {i + a i . . . i − 1} when a i < 0). Intuitively, when expanding offsets a x to bit offsets b x , it may happen that R x contains y with w y a x , forcing w x v x . We amortize such cases by distributing "extra bits" to such "smaller" offsets.
Let x ≺ y ⇐⇒ y ∈ R x ∧v x ≤ C log w y ∧v x > D and let x ↑ = arg max y x w y (or undefined if there is no such y) and let y ↓ = {x|y ∈ x ↑ }. Observe that
We also note that y = x ↑ implies w x < w y since w y ≤ w x would imply b x ≤ |a x |w x and w x > 2 vx/C leading to w x ≤ v x + log w x and 2 vx/C < w x ≤ 2v x , which gives the desired contradiction with D large enough (depending only on C).
We will distribute the extra bits starting from the largest w i 's. Every y uses w y bits for its encoding and distributes another w y bits to y
be the number of extra bits received from x ↑ in this way. For every offset x we use 10v x + 2D bits and the received bits r x . Since the received bits are accounted for in other offsets, this uses
Therefore we only need to show that the number of bits thus available at x is sufficient, i.e. that 2w x ≤ r x + 10v x + 2D (one w x to represent b x , one to distribute to x ↓ ). Now either there is y = x ↑ and we have b x ≤ |a x |w y so w x ≤ 1+2v x +2 log w y and noting that for large enough D only depending on C: 2 log w y ≤ On the other hand, undefined x ↑ implies that ∀y ∈ R x : w y ≤ 2 vx/C . Therefore b x ≤ |a x |2 vx/C and w x ≤ 1 + 2v x + 2v x /C = 1 + (2 + 2/c)v x . Now we may fix C = 2/3, obtaining w x ≤ 5v x + D as required for D ≥ 1. This finishes the proof for any fixpoint {b i } i and {w i } i .
The algorithm from the beginning of the proof can be shown to run in polynomial time. We start with e.g. w i = w 0 = 1 + 4 log n and b i = sign(a i ) j∈Ri w j . Then we iteratively update w i := 1 + 2 log b i and recompute b i as above. Since every iteration takes O(n 2 ) time and in every iteration at least one w i decreases, the total time is at most O(n 3 log n). In the following section, we show an algorithm that computes a representation with the same asymptotic bounds, running in time O(n 1+ ) for any > 0.
Constructing the compact representation In this section, we use notation defined in section 3.2, specifically R e and b e . Recall that R e is the set of edges of G spanned by the edge e in the representation and b e is the relative offset of edge e in the (expanded) representation). Let G be the graph we want to represent. We assume that G satisfies the n c edge separator theorem.
We find a representation using O(n log log n) bits, as mentioned above by expanding all pointers and then modify it to make it compact. We define a directed graph H on the set E(G) with arc going from v to u iff v ∈ R u . Let us fix a recursive separator hierarchy of G. We call l(e) the level of recursion on which the edge e is part of the separator. We define a graph H ≤k to be the subgraph of H induced by vertices corresponding to edges of G which appear in the recursive separator hierarchy in a separator of subgraph of size at most k.
The following lemma will be used to bound the running time of the algorithm:
Lemma 2. The maximum out-degree of H ≤n c is n c * c . For any fixed c > 0, |H \ H ≤n c | ∈ n 1− where > 0 is some constant depending only on c and c .
Proof. We first prove that maximum out-degree of H is O(n c ). There are O(n c ) edges e ∈ G with l(e) = 1 spanning any single vertex. The number of edges e spanning some vertex with l(e) = k decreases exponentially with k, resulting in a geometric sequence summing to O(n c ). The maximum out-degree of H ≤n c is the same as that of graph H corresponding to a subgraph of G of size at most n c . Maximum out-degree of H ≤n c is, therefore, O(n c * c ). The number of vertices in H \ H ≤n c is equal to the number of edges in G going between blocks of size Θ(n c ). This number is, by Theorem 7, equal to n/n c (1−c) , which is O(n 1− ) for some > 0.
Theorem 5. Given a separator hierarchy, the representation from Theorem 1 can be computed in time O(n 1+ ) for any > 0.
Proof. We first describe an algorithm running in time O(n 1+c log log n), where c is the constant from the separator theorem, and then improve it.
Just as in the proof of Theorem 4, b v denotes the relative offset of edge v in the representation. We store a counter c v for each vertex v ∈ H equal to the decrease of b v required to shrink its representation by at least one bit. That is, c v = b v − b v 2 k + 1, where i 2 k is i rounded down to closest power of two. When we shrink the representation of edge corresponding to vertex v ∈ H, we have to update counters c u for all u, such that vu ∈ E(H). Since the out-degree of H is O(n c ), the updates take O(n c ) time. We start with representation with O(n log log n) bits and at each step, we shorten the representation by at least one bit. This gives the running time of O(n 1+c log log n). To get the running time of O(n 1+ log log n), we consider the graph H ≤n for some sufficiently small epsilon. Note that the maximum out-degree of H ≤n is O(n c ). We can fix small enough to decrease the maximum out-degree to n . Therefore, by using the same algorithm as above on graph H ≤n for sufficiently small, we can get a running time of O(n 1+ log log n) for any fixed > 0. The representations of edges corresponding to vertices not in the graph H ≤n are not shrunk.
Note that the presumptions of Theorem 4 are fulfilled by the edges corresponding to vertices in H ≤n and the obtained representation of graph G = (V (G), V (H ≤ n )), is therefore compact. The edges not in H ≤n are then added, increasing some offsets. The representation of an offset of length at least n for > 0 is never increased asymptotically by inserting edges since it already has Θ(log n) bits. There are at most O(n ) edges of G shorter than n that span any single inserted edge. Lengthening of offsets shorter than n , therefore, contributes at most O(n 1− n log log n) ∈ o(n) for some sufficiently small. The inserted edges themselves have representations of total length O(n 1− log n) ∈ o(n). Additional o(n) bits are used after the insertion of edges and the representation, therefore, remains compact.
Separator hierarchy
In this section, we prove two generalizations of the separator hierarchy theorem. Our proof is based on the proof from [23] . Most importantly, we show that the recursive separator theorem also holds if we want the regions to have small size on average and not in the worst case. We also prove the theorem for weighted separator theorem with weights on edges. We show that the natural generalization of our two generalizations does not hold by presenting a counterexample.
Since the two theorems are very similar and their proofs only differ in one step, we present them as one theorem with two variants and show only one proof proving both variants. The difference lies in the reason why the Inequality 1 holds. The following lemma and observation prove the inequality under some assumptions and they will be used in the proof of the theorem. 
Observation 6 The Inequality 1 holds for r 1 = r 2 = r.
Lemma 3. The Inequality 1 holds for γ w = γ n and r 1 , r 2 and r satisfying the following.
Proof. Let γ = γ w = γ n . We simplify the inequality γ r
for r 1 , r 2 and r satisfying the equality (2) . By substituting for r and rearranging the inequality, we get
We substitute r 2 = λr 1 . Note that this holds for λ = 1 and that we may assume r 1 ≤ r 2 by symmetry. Since the inequality holds for λ = 1, it is sufficient to show the inequality for λ ≥ 1 with both sides differentiated with respect to λ. By differentiating both sides and simplifying the inequality, we get
which obviously holds, since λ ≥ 1 and γ > 0.
Now we proceed to prove the two generalizations of the recursive separator theorem. Note that in the following, r is the average or maximum region size, depending on whether the graph is weighted or not.
Theorem 7. Let G be a (possibly weighted) graph satisfying the n c separator theorem with respect to its weights and let P be its recursive balanced separator partition. Then if either (i) the graph in not weighted and r is the average size of a region in the partition P , or (ii) the graph is weighted and r is the maximum size of a region in the partition P .
Then the total weight of edges not contained inside a region of P is O(W/r 1−c ), where W is the total weight (resp. number if unweighted) of all edges of G.
In this proof, let w(S) be the total weight of the edges in S with w(e) denoting the weight of the single edge e.
Proof. We use induction on the number of vertices to prove the following claim.
Claim. Let us have a recursive separator partition P of n-vertex graph G of average region size r. Then w(E(G) \ p∈P p) < Before the actual proof of this claim, let us define some notation. Let c, α and β be the constants from the separator theorem (recall that separator theorem ensures existence of a partition of V (G) into two sets of size at least αV (G) with edges of total weight at most β W n 1−c going across). Let B(W, n, r) be the maximum value of w(E(G) \ p∈P p) over all n-vertex graphs of total weight W and all their recursive separator partitions with average region size r. We use γ n to denote a fraction of the number of vertices and γ w to denote a fraction of the total weight.
Proof (Proof of the claim).
We defer the proof of the base case until we fix the constant c .
By the separator theorem, B(W, n, r) satisfies the following recurrence.
B(W, n, r) = 0 for n ≤ r B(W, n, r) ≤ β W n 1−c + max
where r 1 , r 2 are the respective average region sizes in the two subgraphs. It, therefore, holds that r = 
where α > 0 is a constant depending only on α, since γ n ∈ [α, 1 − α] for α > 0. We can therefore set c such that
This completes the induction step.
For c large enough, the claimed bound in the base case is negative and it, therefore, holds.
We conclude this section by showing that the following natural generalization of Theorem 7 does not hold: Theorem 8. The following generalization does not hold: Let G be a weighted graph satisfying the n c separator theorem with respect to its weights and let P be its recursive separator partition. Let r be the average size of a region in the partition P . Then the total weight of edges not contained in a region of P is O(W/r 1−c ), where W is the total weight of all edges of G.
Proof. We show that there is a weighted graph satisfying the n c -separator theorem with respect to its weight and a recursive partition P of G with edges going between partition regions of P that have total weight Θ(W ), where W is the total weight of all edges, and with average region size of Θ(n/ log n).
Let G be an unweighted graph of bounded degree satisfying the n c -separator theorem. We set weights of all its edges to be 1, except for one arbitrary edge e with weight m − 1, where m is the number of edges of G. Note that w(e) = W/2. We denote this weighted graph by G w .
Let S be a separator in G from the separator theorem. We modify S in order to obtain a balanced separator S w in G w of weight O(W/n 1−c ). If e ∈ S, we set S w = S. Otherwise, we remove e from S and add all other edges incident to its endpoints. This gives us S w which is a separator and its weight differs from the weight of S only by an additive constant, since the graph G has bounded degree. It follows that G w satisfies the n c -separator theorem with respect to its weights. We consider a partition P constructed by the following process. Let S be a separator from the separator theorem on G w , partitioning V (G w ) into vertex sets A and B. If e ∈ S, we stop and set A and B as the regions of P . Otherwise, without loss of generality, e ∈ A. We set B as a region of P and recursively partition A.
At the end of this process, we get P with edges of total weight at least W/2 between regions (as e is not contained within any region). The partition P has Θ(log n) regions, so the average region size is Θ(n/ log n).
Representation for Paths in Trees
In this section, we show a linear algorithm that computes a cache-optimal layout of a given tree. We are assuming that the vertices have unit size and B is the number of vertices that fit into a memory block. The same assumption has been used previously by Gil and Itai [16] . This is a reasonable assumption for trees of fixed degree and for trees in which each vertex only has a pointer to its parent. It does not matter in which direction the paths are traversed and we may, therefore, assume that the paths are root-to-leaf.
We also show that it is NP -hard to find an optimal compact layout of a tree and show an algorithm which gives a compact layout with I/O complexity at most OP T + 1. Definition 1. Laid out tree: A laid out tree is an ordered triplet T = (V, E, L), where (V, E) is a rooted tree and L : V → {0, 1, 2, · · · , |V |} assigns to each vertex the memory block that it is in. We require that at most B vertices are assigned to any block. We treat the block 0 specially as the block already in the cache.
We define c L (P ) = |{L(v) for v ∈ P } \ {0}| to be the cost of path P in a given layout L. We define c(T, k), the worst-case I/O complexity given k free slots, as
where P ranges over all root-to-leaf paths and L over all layouts that assign at most k vertices to block 0. Since block 0 is assumed to be already in cache, accessing these vertices does not count towards the I/O complexity. We define c(T ), the worst-case I/O complexity of laid out tree T , to be c(T, 0). This means c(T ) is the maximum number of blocks on a root-to-leaf path. We define a worstcase optimal layout of a tree T given k free memory slots as a layout attaining c(T, k).
We can observe that c(T ) ≤ 1 + max u∈δ(r T ) (c(T u )). From the lemmas below follows that c(T ) only depends on the subtrees rooted in children of r T with the maximum value of c(T u ).
Proof. The function c(T, k) is monotonous in k since a layout given k 1 free slots is a valid layout given k 2 slots for k 2 ≥ k 1 . Moreover c(T, 0) = c(T, B) − 1, since we can map vertices in the root's block to block 0 instead. From this and the monotonicity, the lemma follows.
We define deficit of a tree k(T ) = min{k, such that c(T, k) < c(T, 0)}. Note that k(T ) ≤ B. It follows from Lemma 4 that c(T, k ) = c(T, 0) = c(T, B) + 1 for all k < k(T ) and c(T, k ) = c(T, 0) − 1 = c(T, B) for k ≥ k(T ).
Lemma 5. For k ≥ 1, there is a worst-case optimal layout attaining c(T, k) such that root is in block 0.
Proof. Let L be a layout that does not assign block 0 to the root. If no vertex is mapped to block 0, we can move root to block 0. Since block 0 does not count towards I/O complexity, doing this can only improve the layout. Otherwise, let v be vertex, which is mapped to block 0. We construct layout L such that where the min is over all sequences {k u } such that u∈δ(v) k u = k − 1.
Proof. By lemma 5, we may assume that an optimal layout attaining c(T, k) for k ≥ 1 puts the root to block 0 and allocates the remaining k − 1 slots of block 0 to root's subtrees, k u slots to the subtree T u . On the other hand, from values of k u , we can construct a layout with cost max u∈Mv (c(T u , k u )).
Problem 1.
Input: Rooted tree T Output: Worst-case optimal memory layout of T .
Theorem 9.
There is an algorithm which computes a worst-case optimal layout in time O(n). Moreover, this algorithm always outputs a convex layout.
Proof. We solve the problem using a recursive algorithm. For each vertex, we compute k(T v ) and c(T v ). First, we define d(T ) and c max (v). Using the values k(T u ) and c(T u ) calculated using the above recurrence, we reconstruct the worst-case optimal layout in a recursive manner. When laying out a subtree given k free slots, we check whether k ≥ d(T ). If it is, we distribute the k − 1 empty slots (one is used for the root) in a way that subtrees T v for v ∈ M (r T ) get at least k(T v ) empty slots. Otherwise, distribute them arbitrarily. We put the root of a subtree into a newly created block if the subtree gets 0 free slots. Otherwise, we put the root into the same block as its parent. It follows from the way we construct the solution that it is convex.
It follows from lemmas 4 and 6 that c(T, k) = c(T, 0) − 1 if and only if k − 1 free slots can be allocated among the subtrees T u , u ∈ δ(r T ) such that subtree T u gets at least k(T u ) of them. It can be easily proven by induction that the algorithm finds for each vertex the smallest number of free slots required to make the allocation possible and calculates the correct value of c(T v ).
If the subtree sizes are computed beforehand, we spend deg(v) time in vertex v. By charging this time to the children, we show that the algorithm runs in linear time.
This algorithm can be easily modified to give a compact layout which ensures I/O complexity of walking on a root-to-leaf path to be at most c(T ) + 1. This is especially relevant since finding the worst-case optimal layout is NP-hard, as we show in section 5.1. The algorithm can be modified to give a compact layout by changing the reconstruction phase such that we never give more than |V (T v )| free slots to the subtree of T rooted in v unless k > |V (T )|. Note that only the last block on a path can have unused slots. We can put blocks which are not full consecutively in memory, ignoring the block boundaries. Any path goes through at most c(T ) blocks out of which at most one is not aligned, which gives total I/O complexity of c(T ) + 1.
The following has been proven before in [9] and follows directly from Theorem 9. Corollary 1. For any tree T , there is a convex partition of T which is worstcase optimal.
Proof. The corollary follows from Theorem 9, since the algorithm given in the proof is correct and always gives a convex solution.
Since the layout computed by the algorithm is always convex, we never reenter a block after leaving it. This means that c(T ) really is the worst-case I/O complexity.
Finally, we show how to construct a compact representation with similar properties. Note that we do not claim I/O optimality among all compact representations but only relative to the tree layout optimality as in Theorem 9.
Theorem 10. For a given tree T with q bits of extra data per vertex, there is a compact memory representation of T using O(nq) bits of memory requiring at most OP T L I/O operations for leaf-to-root paths in general trees and root-to-leaf paths in bounded degree d trees. Here OP T L is the I/O complexity of the optimal layout from Theorem 9 when we set the vertex size to be q+2 log n for leaf-to-root paths, or to q + 2d log n for root-to-leaf paths.
Proof. The theorem is an indirect corollary of Theorems 9 and 4. We set the vertex size as indicated in the theorem statement (depending on the desired direction of paths) and obtain an assignment of vertices to blocks by Theorem 9. We call the set of the blocks D. Note that for q = Ω(log n), this is already a compact representation.
For smaller q, we construct an auxiliary tree T on the blocks D representing their adjacency in T . We can assume that T is a tree due to the convexity of the blocks of D. We apply the separator decomposition to obtain an ordering R of V T with short representation of offset edge representation (Lemma 1). Similarly, we can get an ordering for each block in D. We order the vertices of T according to R, ordering the vertices within blocks according to orderings of the individual blocks. We obtain an ordering having offset edge representation of total length O(n log q), as there is O(n/B) edges going between blocks with offset edge representations of total length O(n log B log q/B) and edges within blocks with offset edge representations of total length O(n log q).
We now apply Theorem 4 on the edge offsets still split in memory blocks according to D, obtaining a bit-offset edge representation where the vertex representation of every block of D still fits within one memory block, as we have previously reserved 2 log n + Θ(1) memory for every pointer and w i ≤ 1 + 2 log n. We merge consecutive blocks whose vertices fit together into one block. This ensures that every block has at least B/2 vertices.
Hardness of worst-case optimal compact layouts
In this section, we prove that it is NP-hard to find a worst-case optimal compact layout (that is, the packing with minimum I/O complexity out of all compact layouts). We show this by reduction from the 3-partition problem, which is strongly NP-hard [15] (i.e. it is NP-hard even if all input numbers are written in unary).
Problem 2 (3-partition).
Input: Natural numbers x 1 , · · · , x n . Output: Partition of {x i } n 1 into sets Y 1 , · · · , Y n/3 such that x∈Yi x = 3( n 1 x i )/n = S for each i.
Theorem 11. It is NP-hard to find a worst-case optimal compact layout of a given tree T .
Proof. We let B = S. We construct the following tree. It consists of a path P = p 1 p 2 · · · p B of length B rooted in p 1 . For each number x i from the 3-partition instance, we create a path of length x i . We connect one of the end vertices of each of these paths to p B .
Next, we prove the following claim. There is a layout of I/O complexity 2 iff the instance of 3-partition is a yes instance. We can get such layout from a valid partition easily by putting in a memory block exactly the paths corresponding to x i 's that are in the same partition set. For the other implication, we first prove that P is stored in one memory block. If it were not, we would visit at least two different memory block while traversing P and there would be a root-to-leaf path that would visit three memory blocks. If P is stored in one memory block, the I/O complexity of the tree is 2 iff the paths p i can be partitioned such that ever no part is stored in multiple memory blocks. There is such partition iff the instance of 3-partition is a yes instance.
Further research
Finally, we propose several open problems and future research directions.
Experimental comparison of traditional graph layouts with the layouts presented in our work and layouts proposed in prior work could both direct and motivate further research in this area.
While we optimize the separable graph layout for random walks it is conceivable that a minor modification would also match the worst-case performance of the previous results.
The worst-case performance of the algorithm for finding the bit-offsets in Section 3.2 is most likely not optimal, and we suspect that the practical performance would be much better.
For the sake of simplicity, both our and prior representations of trees assume fixed vertex size (e.g. implicitly in the results on layouts) or allow q = O(1) extra bits per vertex in the compact separable graph representation. This could be generalized for vertices of different sizes and unbounded degrees.
Bibliography
