Abstract-Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data dependency.
CACHE-OBLIVIOUS ALGORITHMS
Countless algorithms from data analysis, basic math [1] , graph theory, etc. are formulated as two or three nested loops which process a larger collection of objects. Let us for instance consider the simple algorithm of matrix multiplication A := B · C determining the entries a i,j of A ∈ R n×m by the rule:
• C. Böhm, Ludwig-Maximilians-Universität München, Munich, Germany, boehm@ifi.lmu.de • M. Perdacher, C. Plant, University of Vienna, Vienna, Austria, {martin.perdacher, claudia.plant}@univie.ac.at; C. Plant, ds:UniVie
Since in C-like languages matrices are stored in a rowwise order, it is common practice to transpose C before computing the scalar product k b i,k · c T j,k to achieve a higher access locality: T are read into cache, and combined with the current row B i, * . Unless the complete matrix C T fits into cache, this cyclic access pattern leads to a failure of the cache mechanism: with strategies like LRU (Least Recently Used), every row of C T will be removed from cache before it can be re-used. As a consequence, we have a total of n transfers of the complete matrix C T from main memory to cache. We could make our algorithm cache-conscious [2] by an additional loop:
for I := 0 to n−1 stepsize s do for j := 0 to m−1 do for i := I to I+s−1 do a i,j := k b i,k ·c T j,k ; Provided that we have a single cache, large enough to store s rows of B and 1 row of C T , this strategy is dramatically better, because now we have to transfer C T from main memory to cache only n/s times while we still transfer matrix B once. Modern processors support a memory hierarchy involving 2-3 levels of cache (L1, L2, L3, ordered by decreasing speed and increasing size), as well as a set of registers which are even faster than L1 cache. The main memory is usually organized as a virtual memory. Apart from expensive swapping to hard disk or solid state disk (if the matrices B and C T do not fit entirely into the physical main memory) we have to consider a second locality issue: the translation of virtual into physical addresses is supported by a very small associative cache called translation look-aside buffer. Only for a small number of pages this translation is fully efficient. While we might be able to determine the pure hardware size of all these cache mechanisms for a given hardware configuration it is difficult to know (and subject to frequent changes) how much of the various caches is available for our matrices, and not occupied e.g. by other concurrent processes or the operating system. To efficiently support the complete hierarchy of memories of (effectively) unknown sizes, we need a different concept: a cache-oblivious algorithm [3] is not optimized for a single, known cache size. Its strategy supports a wide range of different cache sizes which can also be present simultaneously. The idea is to systematically interchange the increment of the variables i and j such that the locality of the accesses to both types of objects (i and j) is guaranteed. In Figure 1 we can recognize (a) the cyclic access pattern of nested loops, (b) the cache-oblivious access pattern of the Hilbert curve, (c) the histories of variable i and (d) j over time, and (e) the number of cache misses over varying cache size. We can see in Fig. 1(d) that the access pattern of the variable j yields much more locality for the Hilbert loops compared to the cyclic access pattern of the nested loops. The result (e) is a dramatically improved number of cache misses, particularly for realistic cache sizes like 5-20% of the main memory.
In this paper, we demonstrate how to parallelize various algorithms by assigning to each thread a contiguous part of the space-filling curve. We assume a modern multi-core processor where the shared memory is a centralized component and accesses to main memory are the main obstacle to achieve a high degree of parallelism. Improved locality of memory accesses lead to better utilization of the (in parts local) caches and increases the possible parallelism. Unlike existing work focusing on parallelism in a distributed environment (Hadoop, Spark, etc.), we focus on parallelism within a single multi-core computing nodes. Combining both levels of parallelism is an interesting topic for future work.
Contributions
We propose to replace nested loops enumerating pairs of (i, j) in canonical order by cache-oblivious loops following a space-filling curve. Well-known approaches like Hilbert-, Z-order-, or Peano-curve are too inefficient and limited for a loop iteration, as discussed in Section 2.1 and 2.2. Our approach called Fast Unrestricted (FUR) Hilbert Loop overcomes these problems: 1. We reduce the worst-case complexity per loop iteration from O(log n) to a constant complexity by making an algorithm based on a context-free grammar (Lindenmayer-System) non-recursive (cf. Section 3).
2.
We overcome the usual limitation of space-filling curves to grids of equal size lengths n × n where n is a power of two or three by a concept we call nano-programs. Nano-programs completely avoid any additional overhead (like discarding unnecessarily generated pairs) and instead even accelerate the loop generation, cf. Section 4. 3. We implement the FUR-Hilbert Loop as a preprocessor macro which makes it extremely convenient to be used as a building block in any host algorithm and facilitates compiler optimization, cf. Section 5. 4. We show that even algorithms, which operate only on the lower or upper tridiagonal like Cholesky decomposition profit a lot by our approach. Additionally we compare the canonical version of the all-pairs shortest path algorithm by Warshall with our Hilbert approach. 5. We present extensive experiments on runtime efficiency, cache usage and energy consumption. We demonstrate that our algorithms have a small cache usage footprint and are energy efficient. Our algorithms are highly competitive to state of the art techniques. The initial conference version of this paper introduced the FUR-Hilbert Loop [4] as a cache-oblivious approach for matrix multiplication and K-means clustering. In this paper, we extend this work in the way described by the final two contributions above and odd-size nanoprograms (cf. Section 4).
We demonstrate the superiority by applying FURHilbert Loops in four host algorithms: matrix multiplication, K-means clustering, Cholesky decomposition and the algorithm by Warhshall cf. Section 6. Motivated by the results of extensive experiments (cf. Section 7), we believe that future compilers could even integrate our concept and transform conventional loops into FURHilbert Loops based on an automatic or user-guided analysis of the data dependency.
WELL-KNOWN METHODS FOR HILBERT
To facilitate later a rigorous mathematical treatment we introduce the well-known approaches in this section following an automaton-theoretic point of view. Many iterative approaches to generate space-filling curves like [5] can be regarded as a deterministic finite automaton, and many recursive approaches like [6] as a context-free grammar.
Explicit Enumeration of Hilbert Values
The simplest way of generating a loop over the two variables i and j enumerating the pairs in Hilbert-order (or any other space-filling curve) is to iterate over all possible Hilbert values h and to apply the inverse Hilbert function H −1 (h). Let us for this section assume that both i and j iterate over the range 0 ≤ i, j < n where n is a power of two:
For matrix multiplication, we substitute our placeholder:
The inverse Hilbert function is depicted in Figure 2 as a deterministic finite automaton (DFA) of Mealy-type (i.e. the output, the bit-strings representing i and j are generated at the state transitions of the DFA). The bitstring representation of the Hilbert value h, divided into groups of 2 bits, is the input of the DFA. State 3 is used as starting state, to generate the Hilbert curve in a clockwise order. Alternatively, State 2 can also be used for start to generate an anti-clockwise curve. Appending a binary digit 1 to the output bit-string i (in symbols: i 1) represents the mathematical operation i := 2·i+1. With the example input h = 52 10 = 110100 2 , our DFA makes upon the first bit-pair 11 2 of the input string the first transition from State 3 to State 0, then for the second bit-pair 01 2 stays in State 0 and goes finally with bit-pair 00 to State 1. In these three state transitions it appends 1, 0, and 1 to i; 0, 1, and 1 to j to finally obtain (i, j) = (5, 3) in decimal system. Each of the log 2 n bitpairs of h is separately processed by the DFA. In contrast to this O(log n) overhead, a pair of two nested loops (cf. Section 1) has only a constant overhead per loop iteration (increment i and/or j). The overhead could be prohibitive for many applications. When the upper limits of the loops are different (0 ≤ i < n, 0 ≤ j < m) or do not correspond to a power of two then we have two options: either we generate a bigger Hilbert curve and suppress the processing of pairs (i, j) that are actually out of the range:
for h := 0 to 2 2· log 2 max{n,m} − 1 do (i, j) := H −1 (h); if i < n and j < m then process object pair (i, j);
or we generate a Hilbert curve of side length 2 log 2 min{n,m} and complement the missing values later in additional loops with smaller Hilbert curves [7] . Both options incur a high overhead and deteriorate the goal of cache-obliviousness.
Lindenmayer-Systems
For subsequent calls H −1 (h), H −1 (h + 1) it is likely that the bit-strings representing h and h + 1 have a long common prefix for which the DFA makes the same state transitions. Moreover, by the properties of the Hilbert curve it is guaranteed that the corresponding coordinates (i, j) generated by subsequent Hilbert values differ exactly by 1. An alternative way to define an iteration over all values of two variables in Hilbert order avoiding this unnecessary workload is the Lindenmayer system: Definition 1 (Lindenmayer-System for the Hilbert-Curve). Let A, B be the nonterminal symbols and ⊕, , , π the terminal symbols of a context-free grammar (CFG) involving the following production rules:
The terminal symbols represent the graphical operations:
turn 90 degrees to the left without moving, ⊕ turn 90 degrees to the right without moving, go forward one step in the current direction d, π process object pair (i, j), on a grid of size n × n where n is a power of two, oriented such that j is drawn from left to right and i from top down.
Direction:
The coding and semantics of d is as follows:
the next -step will do j := j − 1,
the next -step will do i := i − 1, d = 2 ⇔ look right: the next -step will do j := j + 1, d = 3 ⇔ look down: the next -step will do i := i + 1.
Axiom (Start Symbol):
We use either A with initialization d = 3 or B with initialization d = 2. Level of a rule expansion: The expansion of the axiom has level = log 2 n. If a nonterminal symbol appears on the right side of a production rule of level , its expansion has level − 1. The terminating productions A → π and B → π are applied exactly at level = 0.
While the Mealy-DFA of Section 2.1 generates only one (i, j)-pair, the Lindenmayer-CFG produces the whole n × n Hilbert curve when we start at level = log 2 n. Axiom A with d = 3 generates the values in a clockwise order starting from (i, j) = (0, 0) and ending at (n, 0), and axiom B with d = 2 in an anticlockwise order ending at (0, n).
According to the definition of d, the operations ⊕ and correspond to the cyclic increment/decrement of d:
To avoid expensive (pipeline-breaking) if-elseoperations, we use the following implementation of the forward-step:
The mod-operation preserves the sign. 
Analogously function B( ); the labels are not needed for the implementation but for the following analysis. Note the comments giving the corresponding terminal and nonterminal symbols from the context-free grammar. Although all the increase and decrease operations corresponding to the commands , ⊕, and have now constant complexity, the recursive implementation still has some drawbacks: Firstly, after the generation of 4 subsequent (i, j)-pairs, one incarnation of A( ) or B( ) is finished. We have to return to one of labelA 0..3 or labelB 0..3 , perform there the next actions and then start the next recursive calls. In summary, after every 4 k iterations we have to move up and down on the stack at least for k positions where 1 ≤ k ≤ log 2 n. This is still a logarithmic overhead per loop iteration in a worstcase-analysis (but since the worst case does not occur frequently it is indeed constant in the average-case when applying amortized analysis). Secondly, the problem of grids with different side lengths not corresponding to powers of two is still open in the recursive solution.
Thirdly, the recursive nature of the Lindenmayer system is an obstacle to the implementation and compileroptimization of the host algorithm (like matrix multiplication, etc.). The core of a host algorithm must be implemented twice in the terminating cases of the functions A( ) and B( ), which can communicate only via global variables. More importantly, the compiler has less options for optimization. In C-like languages, optimization is only done inside functions, not across function calls. Therefore, it is advisable to put some effort in making our Lindenmayer system non-recursive, as described in the following section.
NOVEL NON-RECURSIVE LINDENMAYER
The functions A( ) and B( ) are not straightforward to make iterative. In regular time intervals, we have to leave one or more recursive incarnations of A( ) or B( ), return to the middle of another A or B-function on the recursion stack, perform the next action of this incarnation then, and start again new recursive calls. This can be studied in Figure 3 , where we are in the middle of the generation of the Hilbert loop, at i = 5, j = 3, h = 52. At this point we have four active incarnations, one of production rule A, two of B, as well as the terminating rule A → π. The dark blue printed rule at = 3 = log 2 n generates the Hilbert curve of the whole (i, j)-grid (surrounded by a dark blue frame). We are currently in the last (lower left) sub-quadrant, marked by a light blue frame, and the corresponding position in production rule A is labelA 3 which is expanded in the next, light blue production rule. In a recursive implementation, we have labelA 3 on the stack. The production rule B( = 2) in turn is expanded at labelB 1 (next position on the recursion stack) corresponding to the green production rule (again B) and the green frame in the grid. When ending the green production rule, we have to return to labelB 1 of the light blue rule ( = 2), perform the corresponding action ( , between labelB 1 and labelB 2 ), and then make the next recursive call B( −1) at labelB 2 . The labels which are on the stack, labelA 0..3 , labelB 0..3 agree with the definitions in function A( ) and B( ). Where appropriate, we will also note e.g. labelX 0 , meaning: labelA 0 or labelB 0 .
Our idea to make the recursive algorithm iterative is to code the complete recursion stack in a single N 0 -variable (e.g. 64 bit, possibly a register of the CPU). We will demonstrate that we do not need a new variable for this purpose. Instead, the current Hilbert value h as well as the current direction code d contain all the information to derive both, the current recursion depth, as well as (for all recursive incarnations of the methods A and B) the current position where we have to return. We split the proof of this in two parts: Lemma 1 states that the number of each label on the recursion stack exactly corresponds to the Hilbert-value grouped in bit-pairs. This is already obvious from Figure 3 where the Hilbert value h = 52 = 11 01 00 2 = 310 4 noted in the 4-adic system is identical to the numbers on the stack: (labelA 3 , labelB 1 , labelB 0 ). Lemma 2 shows how to derive the information whether we are in A( ) or B( ).
Lemma 1 (Label-Number).
Consider an incarnation of the non-terminating rules:
The number pop( ) of processed object pairs (i, j) starting from the incarnation at level is pop( ) = 4 . (2) The incarnation overall increases h by forw( ) = 4 − 1. (3) For each Hilbert value h generated at labelX k , the value a := h/4 −1 mod 4 equals the label number k.
Proof: (1) At the bottom level = 0, the number of processed object pairs is pop( ) = 1. For ≥ 1, the number of expansions to π is multiplied by four for each level:
(2) At the bottom level, we have a number of forward steps forw( ) = 0. In each higher level, we have four times as many forward steps as in level − 1 plus additional three performed in the current grammar rule:
(3) At the beginning k = 0 of our rule, it is possible that rules of the same level have been processed before. If so, they must have been completely processed. According to (2) , each of these rules have increased h by forw( ) = 4 − 1, and together with the final from the parent rule, we have some multiple r · 4 , (∃r ∈ N 0 ). For k = 0 we have a = r · 4 /4 −1 mod 4 = (r · 4) mod 4 = 0. At position k = 1, compared to k = 0, we have increased h by 4 −1 , because we have applied one rule at level − 1 (cf. (2) again) and performed an additional -step. Thus, we have a = (r · 4 + 4 −1 )/4 −1 mod 4 = (4r + 1) mod 4 = 1. In the general case k ∈ {0, ..., 3} we do this k times:
Next we show how to decide according to h and d whether we are in grammar rule A or B, i.e. the letter of the labels.
Lemma 2 (Production Rule A or B).
(1) The direction code d is the same at the beginning and at the end of a production rule.
(2) At the beginning and end of the production rule A, the parity of the direction code d is always odd, at the beginning and end of B always even. Proof: (1) Proof by structural induction over the production rules: In the base clauses, A → π and B → π the direction code is not changed, thus (1) is trivially true. Induction step: Consider the production rule:
If the direction code is not changed in the expansion of the nonterminals on the right-hand side, then it is not changed in the application of A because the number of ⊕ is the same as the number of (two). The same is true for rule B.
(2) By structural induction: Base clauses: At the beginning of the first expansion of the axiom the statement is trivially true because according to Def. 1 we either start with A and initialize d = 3 (odd) or start with B and initialize d = 2 (even). Because of (1) this is also true at the end of the first (topmost) expansion of A or B. Induction step: Consider again
Before expanding B at labelA 0 we have applied . Note that both operators ⊕ and toggle the parity of d from even to odd and vice versa, because they add an odd number (1 or 3) to d (mod 4). Thus, if d is odd at the start of expansion A (induction hypothesis), it is even at the start of expansion B, odd again at A at labelA 1..2 , and even at B at labelA 3 . Analogously in the expansion of B: if d is even at the beginning (induction hypothesis), d is odd at every nonterminal A and even at every B. This is a complete analysis of all cases, and to summarize the induction step: if (2) is true for the nonterminal on the left side of a rule (at level ), it must be true for all nonterminals on the right side and thus for all left sides of the next level ( − 1).
(3) As a consequence of (2), we know the parity at the beginning of a rule expansion. Since the parity is switched at each ⊕ and , we can mark the parity of d for all positions:
We can see that the parity differs between A and B at every position. Therefore, the combination of position and parity decides the rule A or B in which we currently are. Now, Lemma 1 and 2 would enable us to exactly mimic the recursion with its stack. However, we would still have the logarithmic worst-case complexity. Instead, we further transform our algorithm into that in Figure 2 which truly performs a loop enumerating the Hilbert values h. In its body, we perform the operation π (process (i, j)) and, then decide where we would be in the recursive system and which action a must be executed. An action takes always place between two successive recursive calls and has the label of the latter (e.g. a = 2 is between labelX 1 and labelX 2 ), and corresponds always to a forward-step, potentially preceded or followed by a ⊕ or -step. E.g. in production rule A at a = 1 we perform ⊕ , and at a = 3 we perform ⊕ etc. As there is no forward-step before labelX 0 and after labelX 3 we only consider the three actions a ∈ {1, 2, 3}, but not 0. We will see later how to cope with the ⊕ and at the beginning and end of the rules.
To obtain the right action code, we first increase h.
Algorithm 2
The Non-recursive Lindenmayer Alg. 
If we are at the end of one or more production rules on the stack, then increasing h will change one or more 11 2 -bit-pairs at the end of h into 00 2 . (cf. Lemma 1). In this case, the first bit-pair (from right to left) = 00 2 defines the level and the action a to be performed. Therefore, we determine (after increasing h) the number of 00 2 -bit-pairs at the end in constant time by the following trick: If the bitwise and-operation is applied to h and its negative complement −h, then in the result all bits are 0 up to one exception: The last (least significant) 1 which has been set in h is still set (the result is the largest power of two which divides h with no rest). The binary logarithm 1 2 log 2 (h and bitw −h) corresponds to the number of zero-pairs at the end of h and equals − 1. We determine the binary logarithm by casting the result of the bitwise and to a double-precision floating point number and extracting the exponent, which is very efficient and works for 1 ≤ h ≤ 2 52 − 1, the greatest natural number that can be represented by doubleprecision floating point numbers (according to IEEE-754) at no loss of precision. The action code then is extracted from h using
Finally, we perform the action in a by modifying d, in-/decreasing i or j, and modifying d again. The two modifications of d subsume the different ⊕ andoperations which are defined between two labels. Let us first assume we are at level = 1, i.e. we do not have to consider additional ⊕, -operations from starting or ending recursive calls. As we know that even and odd direction codes can only be present at certain positions of A and B (e.g. at labelA 0 , d is even, at labelB 0 , d is odd, cf. Lemma 2), we can construct the following table for the result d new of the ⊕ or operation before and after the forward step : 
Overall we can express the action before and after by the bitwise logical operations shown in the algorithm in Figure 2 where (as in C-like languages usually) results of boolean operations (isOdd, "=", xor) are represented by 1 or 0 and can thus e.g. be multiplied with other values. The operations "... + 1" in Line 6 and "... − 1" in Line 7,8,11 can be omitted or are removed by the optimizer. They are included in the pseudo-code for equivalence with the functions A( ), B( ).
Lemma 3 (Complexity of LindenmayerNonRecursive).
The worst-case time complexity of our algorithm is constant per loop iteration.
Proof: The while-loop (lines 5-11) of our algorithm contains only elementary operations (+, and, mod, etc.). The number of these operations is constant (24).
NANO-PROGRAMS
We back up our nonrecursive implementation of cacheoblivious loops by a concept called nano-programs. Nano-programs are small pieces of pre-computed spacefilling curves for grid sizes r×s = 2×2, 2×3, 2×4, 3×4, or 4×4 (see Figure 4 ). Degenerate grids with size 1 × {1, 2, 3, 4} (single loop) and 0 × {0, 1, 2, 3, 4} (empty loop) are also possible but are used only if either i or j overall has these degenerate limits. Nano-programs serve a two-fold purpose: They further decrease the overhead compared to the method proposed in the previous section because the management of helping variables like the direction d is simplified for the pre-computed small grids. The overall number of basic operations (additions, multiplications etc.) per loop iteration is thus reduced from 24 to 9. The second and more important purpose of our nanoprograms is to enable grid-sizes which do not precisely correspond to a power of two, because as we show in the following, it is possible to tesselate a grid of any size n × m by a number of sub-grids each having size r × s = {2, 3, 4} × {2, 3, 4}. Let us start with the case that n and m are possibly different but in the same power of two: t := log 2 n = log 2 m . We can partition the n × m grid into a number 2 t−1 × 2 t−1 of sub-grids where the height r is always an integer close to the average sub-grid heightr = n/2 log 2 n −1 . The following equation shows that r is always between 2 and 4:
Sub-grids of height r = 4 can also occur: e.g. if n/2 t−1 = 3.5 then half of the sub-grids are of height 3 and 4. Analogously the width s: 2 ≤ s = m/2 t−1 < 4. The sub-grids all have size {2, 3, 4} × {2, 3, 4}.
A nano-program is a bit-sequence which can be stored in an integer variable P which may be assigned to a register by the compiler while working with it. The bitsequence contains codes similar to the direction-codes stored in the variable d (cf. Section 2.2). The nanoprograms read from right to left, and therefore, the first operation to be performed on the variables i and j can be extracted Figure 4 : we can see the graphical access patterns for 2 × 2...4 × 4 grids and the corresponding nano-programs. Note that these programs read from right to left, and they are here noted in a 4-ary system. The first movement of the 2×2 pattern (step down) is coded by the tailing digit 3 of the nanoprogram 123 4 = 27 10 . The pseudo-code is embedded in the algorithm LindenmayerNonRecursive() and all processing of nano-programs highlighted in Algorithm 3. The size (r × s) of each cell of the nano-program is determined in Line 3 such that the sub-grid heights r = 2, 3, and 4 are evenly distributed to sum up to n (analogously for the widths s). After a complete nanoprogram consisting of r · s − 1 steps has been processed, a final movement (lines 11-16) is performed to connect the previous sub-grid cell to the subsequent one. 
Odd-sized Cells of Nano-programs
As depicted in Figure 4 , all nano-programs for 2 × 2...4 × 4 grids exist. However, for the non-square sub-grids (2 × 3, 2 × 4, and 3 × 4) we actually need two versions: those beginning and ending at the longer side and those beginning and ending at the shorter side. For the 2 × 4 nano-programs, both versions exist: in addition to the nano-program 1232123 4 beginning and ending at the longer side, we can also define the 4 × 2 nano-program 1112333 4 (read from right to left cf. Fig. 4 ) beginning and ending at the shorter side. In contrast, a 2 × 3 sub-grid beginning and ending at the longer side would require a diagonal transition (being less local and requiring more than two bits). Although odd-sized nano-programs are necessary if we want to support a global grid size with one or both side lengths being odd we can completely avoid the non-existing nano-programs if we carefully plan where to place the 3 × {2, 3, 4} nano-programs. Our key idea is, at most places to allow only even-sized nanoprograms. Every grid where both side lengths are even can be completely tesselated with only even-sized nanoprograms. If one side length of the grid is even, and the other is odd, we can place all 3 × 2 and 3 × 4 nanoprograms at that side of the grid opposite to the starting and ending point, as depicted in Figure 5 . We have to control the traversal order of the grid (by specifying the initial direction d = 3 or d = 2) to make sure that the even-sized side length is opposite to the global starting and ending point ( Figure 5, middle) .
As depicted in Figure 5 , right side, the situation is more tricky if both side lengths are odd. There, we have to place a column of 3 × 2 nano-programs at the right side and a single 3 × 3 nano-program at either of the ends of this column (in Figure 5 , on the upper right corner). The Hilbert-subcurves at the upper and lower row are, unfortunately, not oriented such that the remaining 3 × 2 nano-programs can be positioned in a single row. The drawn layout with sub-curves oriented towards the middle is generated by bitwise logic operations.
Severely Asymmetric Grids
If log 2 n < log 2 m like in Figure 6 where log 2 n = 2 and log 2 m = 3 we put a number m = m/2 log 2 n of independent curves side by side, where the first has width m − log 2 n · (m − 1) and the remaining have width log 2 n . All the curves are anticlockwise (initial d = 2). With this setting it is guaranteed that the nanoprograms of size 3 × {2, 3, 4} can be placed exactly such that no diagonal transitions are needed. If log 2 m > log 2 n we analogously put n = n/2 log 2 m clockwise curves (initial d = 3) one above the other. Finally we note that starting with lower bounds for (i, j) different from (0, 0) is also straightforward possible (cf. Figure 6 where we start at (i, j) = (2, 0)) at no extra cost per loop iteration. Degenerate loops where one or both variables iterate over one value only or no values at all (empty loops) are also considered, but these extensions are left out in our pseudo-code for clarity and space restrictions.
OVERALL ARCHITECTURE
Having made our code to generate loop iterations completely non-recursive and having removed all restrictions on loop boundaries for i and j, we will from now on note the cache-oblivious loops following the FUR-Hilbert curve in our pseudo-codes as follows:
FurHilbertFor (i, j)∈{imin,..., imax−1}×{jmin,..., jmax−1} do process object pair (i, j); This is very analogously possible in the source-code for C, C++, etc. where we define preprocessor macros:
of which the first contains Alg. 3, Line 1-5, the part of the code before the placeholder process object pair (i, j), and the second contains Line 7-19, including all extensions described in Section 4. The C++-file can be downloaded from https://informatik.univie.ac.at/dm/downloads/.
The application of these preprocessor macros is almost as convenient as the application of standard loops. FURHilbert loops can be nested with other loops and with each other. The implementor of the host algorithm can choose freely the name and type of the iterator variables (for all types allowing to apply the operator "+") and parenthesize the loop body as usual with "{ " and " }". In this case editors automatically apply appropriate indentation. The following lines generate the FUR-Hilbert curve in Figure 6: int i, j; FurHilbertFor (i, j, 2, 7, 0, 13) { printf("%d %d\n", i, j); } FurHilbertEnd (i, j);
This architecture not only makes our host algorithm clear and well structured but it also has important performance benefits. The whole host algorithm can be implemented in a single method and can apply local variables for the management of our loops as well as for all information needed in the host algorithm inside and outside of the cache-oblivious loop. These local variables can be assigned to registers by the compiler or upon user request (e.g. by the keyword register). Various optimizations including extraction of loop invariants and loop unrolling can be made fully automatic by the compiler. These options are all unavailable in a recursive implementation.
CACHE-OBLIVIOUS LOOPS: APPLICATIONS
On top of FUR-Hilbert loops, we implemented a number of algorithms described here (matrix multiplication, Kmeans clustering, Cholesky decomposition, and FloydWarshall) as well as a number of further algorithms from data mining, linear algebra, database systems, and other fields. In these algorithms, we additionally parallelized the FUR-Hilbert loops with Open-MP, marking these parallelized loops with FurHilbertFor * , and the innermost loops with SIMD-parallelism using AVX2 (Advanced Vector Extensions) with comment "// SIMD."
Matrix Multiplication
The matrix multiplication is vastly used in software implementations and has numerous applications [8] . The basic algorithm from our introduction to multiply B ∈ R n×p and C ∈ R p×m (stored as C T , additionally each row aligned to cache lines) can be straightforward implemented using a FUR-Hilbert loop, because it has no general data dependencies. If p is too large (for a cache-size of 32K say p s := 1024), that at least a few of the rows of B and C T fit into L1 cache, it is necessary to decompose the matrices horizontally into groups of s (divisible by 4 to ensure alignment with cache lines) columns before applying the FUR-Hilbert loop:
K-means Clustering
K-means, the most popular clustering algorithm, is like the related EM and K-medoid methods implemented in a loop until convergence alternately performing assignment and update steps. The expensive assignment step determines for each point x i the distance to each cluster representative µ j (0≤j<k) and assigns it to that cluster ID having minimal distance. We must keep track of the winner distance and the corresponding cluster ID for each point. This can be facilitated in a SIMD-parallel way by backpacking [9] the cluster ID in the least significant bits of the distance, noted dist, cID :
Cholesky Decomposition
Having numerous applications in data mining and simulation, Cholesky decomposition is an algorithm that factorizes a symmetric, positive definite matrix A ∈ R n×n into a left triangular matrix L such that A = LL T . It determines the entries i,j as follows:
In the computation of element i,j we read the whole matrix row j, * . This data dependency must be considered when changing the loop order of (i, j) either by space-filling curves or by parallelism. We achieve this by decomposing the lower left triangle matrix L into square blocks of a side length β being a power of two. The largest block of side length β = 2
is placed in the lower left corner of L starting from row i min = β (cf. Figure 7(a) ; if the overall dimension n of the matrix is not a power of two, then the blocks at the bottom lines are suitably trimmed and become a non-square β × γ rectangle). Recursively, to each placed square of side length β we connect one with side length β 2 on the upper left corner and on the lower right corner until the triangle is completely tesselated when we reach at β=1 or at some other defined basic resolution (for SIMD parallelism using AVX 1 or 2 we use 4 × 4 squares as basic elements, which degenerate to triangles at the diagonal of the matrix). Inside each block it is guaranteed that the data dependency plays no role since always i min > j max . We can apply FUR-Hilbert loops as well as SIMD and MIMD parallelism inside such a block but have to make sure that the blocks are ordered sequentially top-down and within the same row from left to right. The recursive block-generation is made iterative in the following pseudo-code which also considers the case of a matrix dimension n being not equal to a power of two: 0,0 := a 0,0 ; for α := 1 to n − 1 do β := α and bitw −α; γ := min{α + β, n} − 1;
The Algorithms by Floyd and Warshall
The Algorithm by Warshall [10] finds connected components (the transitive closure of the boolean adjacency matrix A ∈ B n×n ) in a directed or undirected graph. The standard algorithm uses three nested loops, resulting in an O(n 3 ) algorithm:
for i := 0 to n − 1 do for j := 0 to n − 1 do if a j,i then for k := 0 to n − 1 do if a i,k then a j,k := true;
Although the data dependencies are actually similar as in our previous example, Cholesky decomposition, it is not possible to restrict the operations to one half of the matrix only. The upper and lower triangle can then be decomposed in larger blocks similar to Cholesky, which are subject to MIMD-parallelism and traversed in an order based on space-filling curves. The operation in the innermost loop can be transformed into a logical ORoperation, which can be executed by SIMD-parallel operations. This is particularly attractive since AVX allows us to perform up to 256 such operations simultaneously on a single core.
for α := 1 to n − 1 do β := α and bitw −α; γ := min{α + β, n} − 1; FurHilbertFor * (i, j) ∈ {α, ..., γ}×{α−β, ..., α−1} do if a j,i then ∀k: a j,k := a j,k ∨ a i,k ; // SIMD for * j := α to γ do if a j,α then ∀k: a j,k := a j,k ∨ a α,k ; // SIMD The overall traversal scheme of the two outer loops (i and j) is depicted in Figure 7 (b). The algorithm by Floyd operates on a weighted matrix A ∈ R n×n and uses the same algorithmic pattern like Warshall. The line: if a j,i then ∀k: a j,k := a j,k ∨ a i,k ; // SIMD is replaced by:
∀k: a j,k := min{a j,k , a j,i + a i,k } // SIMD.
EXPERIMENTAL EVALUATION
Experiments have been performed on Intel Xeon E5-2680v3 CPU (2.5GHz, 12 cores) with 256GB RAM and Debian GNU/Linux 8 (jessie) as operating system. Each core is associated with 64 KB of L1 and 256 KB of L2 cache. The last level cache (LLC) is the shared L3 cache with a size of 30 MB. All measurements are averaged over 20 runs using double precision arithmetics.
Matrix Multiplication
Our algorithm FUR-Hilbert, as discussed in Section 5, is implemented in C++ and compiled with gcc version 4.9.2. We compare our algorithm to the algorithm "TifaMMy" for matrix multiplication based on the Peano Curve introduced by Bader et al. [11] , [12] (source code has been obtained by the authors compiled with icc version 16.0.3). Furthermore we compare our algorithm to the specifically for Intel processors hardware-and hand-optimized Intel MKL library (https://software.intel.com/en-us/intel-mkl) version 11.3 (operation: dgemm). Popular frameworks like Apache Spark, Pythons NumPy and SciPy can be accelerated by Intel MKL. As a baseline, we also compare to the classical matrix multiplication coded in c++ transposing the input matrix for improved cache locality. The code has been auto-vectorized using the GNU C++ compiler. Figure 8 (left) displays the runtime in seconds varying the size of the input matrices from 1 000 to 14 000. For larger problem sizes, the highly optimized MKL library slightly outperforms FUR-Hilbert. It processes the largest matrix in 17.50 seconds. Our algorithm needs 19.15 seconds and is at least 30% faster than the approach by Bader et al. (Peano). Peano requires 13.15 seconds. Our approach is more than 10 times faster than autovectorization which needs 1.7 minutes. Figure 8 (right) displays th runtime varying the number of threads. Two matrices with 10 000 elements each are processed by all techniques. All methods profit a lot from multithreading. Auto-vectorization and Peano show almost linear speed-up albeit at a low level of overall performance. FUR-Hilbert shows similar speed-up characteristics as MKL-BLAS.
Matrix Multiplication on a Manycore-System
The experiments have been performed on Intel Xeon Phi 7210 (KNL), with a processor base frequency of 1.3 GHz and 64 cores. We are using the memory mode "Cache Mode" and the cluster mode "All2All". Each core is associated with a 32 KB L1 data cache and shares 1 MB of L2 cache together with another core on the same tile. In the "Cache Mode" the MCDRAM behave as a memory-side direct mapped cache in front of DDR4, which can be seen as a high bandwidth/high capacity L3 cache. Figure 9 (left) demonstrates the runtime spent for various problem sizes. Our algorithm outperforms MKL-BLAS for matrix sizes smaller than 9 000. For the largest problem size of 15 360 MKL-BLAS is around 14% faster and needs 5.87 seconds, whereas our approach needs 6.86 seconds. Nevertheless, our approach is 3.73 times faster than Peano ("TifaMMy"), which has a runtime of 24.96 seconds. Figure 9 (right) illustrates the runtime needed for different number of threads used. The matrix size is 7 680 and our approach takes only 0.918 seconds. This is 76% faster than MKL-BLAS, which needs 1.62 seconds, whereas "TifaMMy" needs 3.47 seconds and the auto-vectorized approach needs 15.14 seconds.
Cache hierachy on Matrix Multiplication
The access time to memory is one of the bottlenecks for CPU core performance. Todays hierarchical cache structure reduces latency and hence speeds up the CPU clock. Here, we examine the cache hit footprint of our algorithm for the matrix multiplication. We are using Intel Vtune Amplifier XE 2017 to explore the cache access pattern of all algorithms among the L1, L2, and L3 hierarchy and calculate the cache hit rate for the L1 cache as:
L1 HIT L1 HIT+L1 MISS . Furthermore, we use "perf" version 3.16.7-ckt20 https://perf.wiki.kernel.org/ and the event cache-misses:u to detect a cache miss among the whole cache hierarchy. The :u tail excludes the kernel space and measures only the user space of the algorithms. The cache hit rate for "perf" is calculated as: 1 − cache-misses:u cache-references:u . We use the maximum number of threads (12) for the variation in problem size and matrices of size 10 000 are processed for the variation of threads. Figure 10 illustrates the cache hit rate for each cache level respectively and the cache hit rate for the entire cache hierachy. For the L1 cache-hit-rate (top left) the hardware optimized MKL-BLAS is the most cacheefficient algorithm. Our algorithm performs well for small sizes and as well as Peano for larger sizes. All of the algorithms reached at least 97.5% of the L1 cache hit rate. For the L2 cache hit rate our algorithm performs best and remains within a lower bound of 94% whereas Peano and MKL-BLAS have lower bounds of 87% and 72%. The last level cache (LLC), depicted as level 3 cache at the bottom left in Figure 10 , is the slowest but largest in the cache hierarchy. Our algorithm shows good performance and uses most of the LLC for large matrix sizes and for the maximum number of threads, whereas MKL-BLAS drops down in cache hit rates. The Peano algorithm performs relatively equal to our algorithm, but shows performance decrease for large problem sizes. We also captured the access pattern of memory access which could be served by any of the cache levels in Figure 10 (bottom right). For the total cache hit rate our algorithm competes with MKL-BLAS for sizes between 2 000 and 12 000 but drops down at both tails for matrix sizes of 1 000 and 14 000. Our algorithm clearly outperforms the Peano algorithm in both, variation of matrix sizes and thread counts.
K-means
Here, we extended our K-means implementation [9] with the Hilbert curve. We use the same comparison methods as for matrix multiplication but exclude the Peano-curve based algorithm by Bader et al. [11] , [13] since this approach is not designed to support K-means and is outperformed by MKL-BLAS on the task of matrix multiplication. The MKL library also does not include K-means, however it can be used to speed up the distance calculations. The MKL-based technique computes the scalar products between objects and cluster centers by matrix multiplication. As the runtime of Kmeans strongly depends on the number of iterations, we compare it for a fixed number of 5 iterations. As a basic setting, we consider a 20-dimensional data set with 1 048 576 data objects and 24 000 clusters. A high number of clusters causes massive runtime but is practically relevant, e.g., for the coarse quantization step of the product quantization indexing technique [14] . Figure 11 (left) displays the runtime when varying the number of objects. Our technique processes 1 million objects in less than 8 seconds while auto-vectorization needs 2.88 minutes for 5 iterations of K-means. The next sub-figure varies the number of dimensions. The rightmost sub-figures display the speed-up which is similar to matrix multiplication.
We have also tested the hardware-optimized library DAAL (Intel Data Analytic Acceleration Library, https://software.intel.com/en-us/intel-daal), but DAAL does not compete in this setting. DAAL is a library of optimized algorithmic building blocks for data analysis and solves problems which are associated with "Big Data". This includs regression, classification or related problems as well as clustering problems like K-means. We have been using DAAL with the current version of Intel Parallel Studio XE (version 2017 update 3). Unfortunately DAAL cannot handle settings with a high number of clusters, where K > 2 050. Even for K = 1 000 or K = 2 000 it performs worse than our auto-vectorization approach and needs 49.42 and 97.74 seconds for completion. We had been running DAAL in the batch processing mode, with the same settings which have been used for our algorithms.
Cholesky Decomposition
All implementations of the Cholesky decomposition take a positive-definite matrix A and apply the decomposition of the form A = LL T in double precision. Our algorithm is implemented using C++ (compiled with gcc version 6.4.0 and OpenMp 4.5). We compare our algorihtm to the hardware optimized library LAPACK implemented in the Intel Math Kernel Library (MKL version 17.0 update 4, see https://software.intel.com/en-us/mkl/features/linearalgebra) and to the Parallel Linear Algebra Software for Multicore Architectures (PLASMA version 2.8.0) library https://bitbucket.org/icl/plasma. In contrast to LAPACK, which relies on BLAS level 2 calls, PLASMA uses BLAS level 3 calls, especially suitable for algorithms like Cholesky decomposition [15] . As a baseline we also compare our algorithm to a classical implementation of the Cholesky decomposition, which has been automatically vectorized using the gnu gcc compiler (Auto-vect.). Figure 12 (left) displays the runtime in seconds on different dimensions of the input matrix varying from 1 000 to 12 000. For the largest matrix of 12 000, the hardware optimized MKL-LAPACK algorithm shows the best performance with 2.04 seconds whereas our algorithm FUR-Hilbert needs 3.66 seconds. However, our algorithm is at least 20% faster than the PLASMA library, which needs 7.18 seconds and 17 times faster than the autovectorization taking 123.64 seconds to completion. Figure 12 (right) displays the runtime varying the number of threads. Each algorithm processes a matrix of size 5 000. All algorithms profit from multithreading. Here again, for the highest number of threads the MKL-LAPACK algorithm shows the best performance with 0.12 seconds. Our algorithm needs 0.39 seconds and for this setting we are 6 times faster than the PLASMA library, which needs 2.53 seconds and 10 times faster than auto-vectorization which needs 4.12 seconds.
Algorithm by Floyd and Warshall
For the experiments on the algorithm of Warshall (described in Section 6.4) we have generated a graph with 3 densely connected subsets of nodes (clusters). Two nodes within a cluster are connected with a probability of 1% and two nodes in separate clusters are not connected, so we have 3 connected components. In Figure 13 we compare our packed FUR-Hilbert approach to the canonical approach. For a number of 40 000 nodes our approach is 1.81 times faster than the canonical implementation, where our algorithms needs 6.18 seconds and the canonical implementation 11.19 seconds. The same speedup of 1.81 remains for the variation in the number of cores used, where our algorithm spends 5.33 seconds and the canonical implementation 9.62 seconds.
Energy Efficiency
Energy is an important resource in all kinds of computing systems and critical in case of cost and availability. We measure the energy consumption of our Intel Xeon E5 with the power meter "Power HiTester 3334", which supports power integration measurements. The power meter has an accuracy of ±0.1%. We are using the serial port (RS-232) and our own C API to communicate with the power meter, which allows us to measure the algorithms power consumption without any initialization or finalization overhead.
Our algorithm is the most energy efficient one for the matrix multiplication, as illustrated in Figure 14 (left). For varying matrix sizes our algorithm slightly outperforms the other algorithms with one exception, where the problem size is 10 000. For other problem sizes the gain over MKL-BLAS is between 2% and 11% and over Peano ("TifaMMy") between 16% and a factor of 160%. We are up to 5 times more energy efficient than auto-vectorization, but we left it out in Figure 14 for clarity of presentation. Figure 14 (middle) displays the energy efficiency of the cholesky decomposition. Here, MKL-LAPACK is clearly the most energy efficient algorithm. Nevertheless, our algorithm is between 2 and 5 times more energy efficient than PLASMA. We are approximately 7 times more energy efficient than auto-vectorization.
Our K-means algorithm outperforms MKL-BLAS in runtime and this is also true for the energy consumption. For the largest problem size of 1 048 576 our algorithm consumes 0.726 Wh, whereas our BLAS implementation uses 1.923 Wh. Furthermore the energy consumption of our auto-vectorization takes 13.882 Wh. BLAS consumes a factor of 2.64 more and the auto-vectorized algorithm consumes even a factor of 19.12 more than our algorithm.
RELATED WORK AND DISCUSSION

Cache-oblivious Algorithms
The notion of cache-obliviousness has been first introduced by Frigo et al. in 1999 [3] . A cache oblivious algorithm performs well on any type of multi-level memory hierarchy without knowing the structure and the parameters of the hierarchy, e.g. cache and memory size, transfer block size, and bandwidth.
Cache-oblivious algorithms [3] have attracted considerable attention as they are portable to almost all environments and architectures. Algorithms and data structures for basic tasks like sorting, searching, matrix multiplication as well as for specialized tasks like ray tracing [16] or homology search in bioinformatics [17] have been proposed. The two fundamental design patterns of cache-oblivious algorithms are localized memory access and divide-and-conquer. Space-filling curves (SFC) integrate both ideas. A SFC defines an 1D ordering of the points of an n-dimensional space such that each point is visited once. Probably the most widely used SFC is the Z-order due to its simple recursive scheme. The Hilbert and the Peano SFC provide better locality properties at the expense of more computational overhead. In order to theoretically formalize cache-obliviousness, Figo et al. introduced the assumption of an ideal cache which is characterized by ideal page replacement and full associativity. The authors prove that algorithms designed with this idealized setting in mind only degrade in performance by a constant factor in realistic settings such as LRU replacement strategy and set-associative caches. Most related to our work, Bader et al. proposed to use the Peano curve for matrix multiplication and LU-decomposition [11] , [13] . Their algorithms process the input matrices in a block-wise and recursive fashion where the Peano curve guides the processing order We introduce a Hilbert curve supporting arbitrary problem sizes which is of general interest also for other applications. We propose a highly efficient iterative algorithm to compute the Hilbert values on the fly. Most existing non-recursive approaches are based on lookup tables causing memory overhead, e.g., [5] . For determination of a single Hilbert value, most iterative techniques are linear in the resolution which corresponds to the number of iterations in the loop, thus causing substantial overhead. In contrast, our solution derecursivates the Lindenmayer-System [18] and introduces compact nanoprograms fitting into registers for the traversal of small patches. Our approach processes a single pair of indices i, j in constant time, iterates through the loop in linear time and therefore causes negligible overhead. Recursive approaches, e.g., [6] , [18] are associated with logarithmic worst-case time complexity for processing a single index pair and are not suitable to support comfortable loop programming, see also Section 1.
Optimized Techniques for Specific Tasks or Hardware
In [12] , [19] Bader et al. present variants of their algorithms for matrix operations. As in [11] , [13] , the general algorithmic scheme is recursive partitioning according to the Peano curve. To tailor the algorithm to the properties of specific hardware, small matrix blocks are processed with optimized assembler code. For every microarchitecture, comprehensive adjustments are required to reach performance competitive with optimized libraries like BLAS [1] , LAPACK [20] , DAAL or MKL. The library BLAS (Basic Linear Algebra Subprograms) provides basic linear algebra operations together with programming interfaces to C and Fortran. Specific implementations for various infrastructures are available, e.g. ACML for for AMD Opteron processors or CUBLAS for NVIDIA GPUs. The Math Kernel Library (MKL) contains highly vectorized math processing routines for Intel processors.
In contrast to our work, these implementations are very hardware-specific and mostly vendor-optimized. Moreover, they are designed to efficiently support specific linear algebra operations while we are aiming at supporting loop processing in general. Our experiments demonstrate that our cache-oblivious approach reaches a performance close to BLAS on the task of matrix multiplication and outperforms BLAS when applied to support K-means clustering.
As K-means probably is the most wide-spread clustering algorithm other highly optimized techniques for specific hardware have been proposed, e.g. for mobile devices [21] , GPUs [22] or computing clusters [23] . Comparison to such specialized techniques is out of the scope of this work as K-means clustering only serves an exemplary host algorithm.
Energy Efficiency on Data Movement
The energy cost of data movement from memory to registers has been identified as one of the limiting factors for the development of efficient and sustainable exascale systems. The cost, in terms of energy, is two orders of magnitude higher than the cost of computing a double-precision register-to-register floating point operation [24] .
In general a L3 cache miss is approximately three times more expensive than a L2 cache miss and a L2 cache miss is approximately three times more expensive than a L1 cache miss. In Figure 10 we have examined the cache footprint of the matrix multiplication, where our algorihm avoids expensive cache misses most effectively. For the matrix multiplication, we observe that our algorithm is slightly behind MKL-BLAS (cf. Figure 8 ) in cases of runtime, but slighthly ahead in cases of energyefficiency (cf. Figure 14) . We believe, that this is due to our efficient cache access pattern induced by our Hilbert curve. Figure 10 shows that our approach has a better cache hit rate for L2 and L3 caches.
CONCLUSION
We introduced Fast Unrestricted (FUR-) Hilbert, a new space-filling curve with the property that every arbitrary n×m rectangle can be filled by making only axis-parallel, single-step moves in a recursively bisected data space.
The original Hilbert-curve has the analogous property but is restricted to squares (n = m) where the side length n must be a power of two. In addition, our algorithm to generate the FUR-Hilbert curve is highly efficient because it is non-recursive and has a constant worst-case complexity per generated pair of coordinates in contrast to O(log n) for previous methods. These two advantages make it particularly attractive to replace pairs of nested loops in important host algorithms (matrix multiplication, K-Means clustering, Cholesky decomposition, Floyed-Warshall etc.) by our FUR-Hilbert curve which leads to cache-oblivious accesses of the corresponding objects. For future work, we plan to investigate if our cache-oblivious loops can accelerate algorithms in a distributed environment.
