Abstract-Dilated integers form an ordered group of the Cartesian indices into a d-dimensional array represented in the Morton order. Efficient implementations of its operations can be found elsewhere. This paper offers efficient casting (type)conversions to and from an ordinary integer representation. As the Morton order representation for 2D and 3D arrays attracts more users because of its excellent block locality, the efficiency of these conversions becomes important. They are essential for programmers who would use Cartesian indexing there. Two algorithms for each casting conversion are presented here, including to-and-from dilated integers for both d ¼ 2 and d ¼ 3. They fall into two families. One family uses newly compact table lookup, so the cache capacity is better preserved. The other generalizes better to all d, using processor-local arithmetic that is newly presented as abstract d-ary and ðd À 1Þ-ary recurrences. Test results for two and three dimensions generally favor the former.
INTRODUCTION
Dilated integers form an ordered group for the Cartesian indices into a d-dimensional array represented in Morton order [1] . Fig. 1 illustrates the Morton order and the use of dilated integers as its Cartesian indices. With an extra high-order bit, they also suffice for casts from Cartesian to/from Ahnentafel indices [2] . The amount of dilation, called d-dilation, varies with the dimension d of the represented array.
Morton order is used to index arrays in applications from computer vision and graphics, cartography and geographical information systems, volumetric analyses such as axial tomography (CAT scans), and the conventional matrix computations of scientific computing. An array in the Morton order can conveniently be decomposed as a 2 d -ary tree whose subtrees have contiguous addresses. Such a locality minimizes page and cache misses. Practically, d ¼ 2 for matrices and quadtrees and d ¼ 3 for 3D aggregates and octrees.
A comfortable programming paradigm to take advantage of such locality uses recursion on 2 d blocks of each array/operand to descend those trees [3] , [4] . When the base case is reached (well above 1 Â 1, typically a 32 Â 32 block), all of it resides in cache.
More importantly, its larger enveloping blocks fit into L2, L3, RAM, swapping disk, and so forth up the memory hierarchy to include the communication of blocks among both shared memory and distributed coprocessors. The programmer need not worry about which size lands where. This implicit mapping of memory has been called "cache oblivious" because she can ignore the details of block size, cache lines, and memory capacity [5] ; all of these are local minutiae anyway. Whichever blocks fit will reside in the L2 cache for a good while, with subblocks rolling into L1 repeatedly. Meanwhile, translation lookaside buffer (TLB) misses seem to disappear [6] . This style enhances the portability of code that no longer needs to be retuned to a specific memory architecture.
As these array indexings become better used for their high performance, particularly in the context of cache-oblivious matrix processing, their conversions to and from ordinary integer representations become important. They are essential manifestations, for instance, of the common abstractions from generic programming [7] . Like the I/O conversions of the IEEE floatingpoint numbers, however, their use really should be infrequent and should be displaced by efficient computation within that type. Nevertheless, unlike the decimal/float conversions [8] , [9] , knowledge of the details of these algorithms now seems a prerequisite for the acceptance of the type into common use, particularly in the face of lore identifying that conversion as an obstacle [5, p. 288 In order to hasten its promulgation, we offer eight algorithms, two conversions in each direction for d ¼ 2 and d ¼ 3, to suit specific architectures. In all cases, the conversion is fast, faster than a random access into an array, which is their common runtime context. On current processors, the table lookup is shown to be the faster for repeated castings. For sequential access, however, efficient implementations of the additive operators are directly available on both dilated integers and their cousins, masked integers [12] , [13] . Thus, C++ iterators can be even faster than inline conversions for Cartesian indexing into Morton-order matrices. Section 3 uses table lookup to solve these problems. Section 4 uses register-local operations. Section 4.4 uses an arbitrary d as d-ary and ðd À 1Þ-ary recurrences and Section 5 describes our timings and offers conclusions. New contributions are the halving of tables necessary for contractions in Section 3, new multiplication-based algorithms in Section 4.4, the generalized d-ary and ðd À 1Þ-ary recurrences in Section 4.4, and the demonstration that table lookup is faster on current RISC processors in Section 5.
DEFINITIONS
Informally, 2-dilating an integer interleaves one 0 bit between all of the meaningful bits in the binary representation of that integer. As a simple example, consider i ¼ As this block-recursive order extends across the plane, the dilation of integer 13 indexes the entire 14th row and the doubled dilation of integer 14 indexes the whole 15th column. The transposed Morton-Z order is also available [13] . Fig. 1 for an example of their use as row and column indices). 
, where i ' , j ' , and k ' are bits. Cartesian indices are limited here to w or fewer bits. Morton indices and 2-dilated integers have 2w bits. The restriction w ¼ 16 can be relaxed for 64-bit architectures, although, frankly, dense matrices do not often have orders exceeding the 65,535 that it provides. If necessary, these algorithms are easily generalized to larger w. Definition 2.1 [2] . The root of a d-dimensional array has Morton-order index 0. A subarray (block) at Morton-order index i is either an element (scalar) or is composed of 2 d subarrays, with indices 2
For d ¼ 2, the indexing of a 16 Â 16 matrix is illustrated in Fig. 1 . At each level of the recurrence, the four quadrants are labeled northwest, southwest, northeast, and southeast, respectively. This yields the order to meet the conventions of scientific computation, where matrices are tall rather than wide. Computer graphics, where images are wider than they are tall, uses the transposed Z order available by exchanging southwest and northeast just above and using { and | ! to index rows and columns.
Theorem 2.1 [1] . The Morton index into a matrix is
' and corresponds to the
The concatenated quaternary digits above index the nested quadrants of a matrix. The octal digits, below, index nested octants of a 3D array.
Theorem 2.2 [2]. The Morton index into a 3D array is
corresponds to the Cartesian index for row i ¼ P wÀ1
These theorems reflect the so-called "bit interleaving" [15] , [16, pp. 53-55], [1] , also described in [17] . The purpose of the following algorithms is to convert between the binary representations of i ¼ P wÀ1
, that is, the dilation of integers.
The following two sections define four functions each for casting ordinary integers to and from 2-dilated and 3-dilated integers by using table lookup and word width arithmetic. The former functions run in about w=8 steps (on tables of 2 8 entries), but the latter use roughly lg w steps. Even with w growing to 32, addressing a square matrix of order 4 Á 10 9 ; , these proportions are about the same. Section 4.4 offers two more functions for even wider dilations.
CONVERTING VIA TABLE LOOKUP
The following algorithms are conversions and inverse conversions (dilations and contractions) from integers to 2-dilations:
with inline instructions. The first casting conversions use table lookup. They require a vector of 256 entries, either short integers or bytes. Its aligned footprint in an L1 cache is very small and the inline code is short.
For the first pair of algorithms, assume that d ¼ 2, thus spreading bits apart by one, that the dilated integers are 32 bits wide, that ordinary integer indices are 16 bits wide, and that all integers are unsigned. Their generalization is straightforward for 64-bit indexing into large arrays. Generalizations to d ¼ 3 follow.
The table for 2-dilation of Algorithm 1 appears as Fig. 2 [18], [19] . It is composed of 256 short integers, occupying twice the necessary bits but avoiding final operations to mask out alternating bits from them. Those extra operations are illustrated in Algorithm 3 for 3-dilation.
Algorithm 1 Algorithm 2
To invert a dilation, the dilated integer is split and folded into a single pattern, byte by byte, which then indexes a table (Fig. 3) to recover its contraction. This folding halves the fetches from others' tables [18] , [19] . Algorithms 1-4 use bitwise inclusive-OR j to sum bit patterns. Alternatives are simple addition + and bitwise exclusive-OR^, because the patterns are bitwise disjoint.
Algorithms 3 and 4 use the precomputed vector of 256 bytes in Fig. 4 , instead of two: one of shorts and another of bytes. The algorithms are written for 3-dilations of 16-bit short integers to 32-bit integers and vice versa.
Algorithm 3 Algorithm 4
Just as Algorithm 2 halves the space of Algorithm 1 (Fig. 3 versus Fig. 2 ), the next theorem nearly halves their total space once again. A
CONVERSION WITHOUT TABLE LOOKUP
Conversion can also be done entirely within the processor without the cache impact of table lookup. A review of existing approaches follows, based on shifts and masking, which leads to new ones based on integer multiplication. This section concludes by considering the abstract case for all d. Discussed first are d-undilation algorithms for d ¼ 2; 3.
As before, consider the case where the dilated integer is contained in a w-bit (32-bit) word. The w limit constrains the number of significant digits in the undilated integer to s ¼ bw=dc.
As shown in several cases in the following, abstract multiplication generates high-order overflow bits that are ignored. Notation 4.1. Let s denote the number of significant dilatable bits in an undilated integer.
Shift-Or Algorithms
Merkey [20] presents straight-line codes, based on an algorithm by Stocco and Schrack [19] , which use repeated shift-OR operations and masks for conversion between normal and dilated integers. They can also be found without citation among sample codes in machine manuals [21, pp. 136-193, 184-185] . Instances of the shiftbased 2-undilation and 2-dilation algorithms for s ¼ 16 are shown in Fig. 5 . Each algorithm works in four rounds, where a round is comprised of three operations: a shift and two bitwise operations. Merkey presents a variant that performs one shift and three bitwise operations per round, but this works equally well with superscalar instructions. Another round is necessary for w ¼ 64 and s ¼ 32. Fig. 5 illustrates the whole family of algorithms. In undilate_2, for instance, each round coalesces pairs of groups from the previous round, that is, that algorithm performs "binary coalescing." Unnumbered in this figure because it is replaced in the following, it uses a mask to remove the redundant "middle muddle" that results from reflexive bit merges. It collapses the dilation in only dlg se rounds.
Undilation via Multiplication
With fast integer multiplication, the shift-ORs in a round of the undilation algorithm might be reduced to a single multiplication that implements them. When a constant factor has only a few bits set, its multiplication might be better replaced by inline shift-adds at compile time or (even if a variable) decoded by the processor to shift-adds at runtime. For example, multiplications by integers like 3 or 17 that have only 2 bits set in their binary representation can be implemented by a single shift-add. In writing such a multiplication, we intend it to allow such an implementation, if faster.
Indeed, we have seen compilers that enforce it (sometimes slowing the code). Faster multipliers or relatively slower barrel shifters might favor the multiply as coded.
The multiplication is safe here because the intervening/dilating 0 bits absorb any carries from the implied addition within the middle muddle, which is about to be masked off, anyway. Since any dilated integer has a sparse set of significant bit positions, it is safe to implement the shift-ORs of similar undilations by using multiplication. For instance, 2 and 3-undilation can be implemented as follows:
Algorithm 6
Algorithm 7
The moderately experienced programmer will recognize the constant factors in Algorithm 6 as successors of powers of powers of 2 (Fermat primes even) and, so, we express them as decimal numbers. In fact, and as derived in Section 4.4, in Round i, the factor is composed of 2 one bits, separated by In summary, Algorithm 6 requires nine operations and Algorithm 7 requires seven. For 64-bit dilated indices, an extra round is needed by both algorithms, raising the instruction counts to 11 and 9, respectively. See Table 1 for comparisons. 
Dilation
Algorithm 5 stands as our processor-local algorithm for 2-dilation. Multiplication alone cannot replace its shift-OR operations because it introduces carry bits whose effects survive the subsequent masking. For example, one is tempted to substitute t ¼ t Ã 257 & mask for its first statement t ¼ ðtjðt << 8ÞÞ & mask, but that fails, for example, at t ¼ 257, because of the carry that shifts into the 2 9 bit of the product. Indeed, we are not aware of any multiplication-based approach to 2-dilation that uses fewer instructions than the shift-based approach.
The multiplication-based approach, however, can be used to 3-dilate an integer via "binary splitting" with fewer instructions than a shift-based binary-splitting algorithm.
Algorithm 8
As before, the constant factors in Algorithm 8 follow a pattern that can be inferred from their hexadecimal presentation: one bits are separated by 2 5Ài À 1 zero bits at Round i. This algorithm delivers a binary splitting pattern of partial results: At the end of each round, there is a maximum group size which changes as 8, 4, 2, and then 1.
While Algorithm 8 needs as many rounds as a shift-based approach, each round requires just two instructions, multiply and mask, that might become the shift-based algorithms' three. Thus, dilating for d ¼ 3 and s ¼ 10 takes 8 instructions rather than 12 and, for 64-bit dilated integers, it would require only 10.
Although a "ternary split" could perform 3-dilation in fewer rounds, such an approach is inferior for both the multiplicationbased and shift-based approaches. Multiplication fails because of the presence of undesirable carries. As before, shift-based approaches would also have an excessive number of operations per round, more than nullifying any reduction in the number of rounds. However, as shown in the next section, multiplication implements ternary split well for 4-dilation and, in general, the ðd À 1Þ-ary split does well for wider d-dilations.
Even Wider Dilations
The case of arbitrary d > 1 and s > 1 is now addressed. To be specific, how can one derive the "magic" constants above and how is one convinced of the correctness of these algorithms? (Those constants are denoted b, c, y, z in the following, subscripted for dilation d and the round i.) In the arbitrary case, s is subject only to the condition that 1 < s bw=dc, where w denotes the natural word width of the processor. The general algorithms follow.
Algorithm 9
DILATEðt; s; dÞ (where d > 2) 1) r t; 2) for i ¼ 1; 2; . . . ; dlog dÀ1 se do r ðr Â b d;i Þ AND y d;i . 3) return r.
Algorithm 10
UNDILATEðt; s; dÞ 1) r t; 2) for i ¼ 1; 2; . . . It is easy to verify that the multiplication constant for the ði þ 1Þth round c d;iþ1 should be x d;ðdÀ1Þd i . This constant can be more than w bits long, but, by the properties of modular arithmetic, only the low-order w bits of any constant are necessary for the result to be correct. For instance, when d ¼ 3, s ¼ 10, and w ¼ 32, c 3;3 ¼ x 3;18 ¼ 2 36 þ 2 18 þ 1 ¼ 0x1000040001, a 37-bit value. However, the constant 0x40001 in Algorithm 7 is just the low-order 32 bits of c 3;3 .
Algorithm 9, d-dilation for d > 2 in t ¼ dlog dÀ1 se rounds, has invariants.
3. For all rounds, the rightmost bit of r stays in Position 0. 4. At the end of Round i for i ¼ 0; 1; . . . ; t, there are groups of consecutive significant digits of size ðd À 1Þ tÀi plus possibly one (the leftmost) group of smaller size if s is not a power of ðd À 1Þ. Each group is separated from the one before it by ðd À 1Þ tÀiþ1 blank bits.
Therefore, one round of d-dilation reduces the maximum size of a group of consecutive significant digits by a factor of ðd À 1Þ, which may be viewed as a ðd À 1Þ-way split. Fig. 6 illustrates why this Table 1 compares the operation counts for the eight principal algorithms. The trade-off between fetch and multiplication is apparent. Table 2 compares the runtimes measured in clock cycles for these algorithms, run on several machines. The indicated C compilers and versions were used only with option -O3. This paper presents several results that advance the use of dilated integers for indexing Morton-ordered matrices. First, it is an introduction to those structures and how simply they deliver block locality with conventional Cartesian indexing. Second, it reintroduces efficient processor-based algorithms that should be more widely known and establishes new time-efficient and spaceefficient table-based alternatives. Of course, either could be displaced by hardwired bit shufflers. Third, it generalizes the former algorithms with d-ary and ðd À 1Þ-ary recurrences for the general case.
The expansion of the memory hierarchy and the advent of chip multiprocessors create a need for arrays with highly local access within any cache-resident block. These algorithms allow dilated integers and methods for direct arithmetic on them [12] and help deliver that locality with ordinary Cartesian indexing on ordinary hardware. 
