We examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of [11] requires index conversion routines for which the cost to encode and decode the chosen curve is significant. Using a detailed analysis of the number of bit operations required for the encoding and decoding procedures, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we show that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition as compared to the Hilbert, Peano, or Morton orders. The motivation lies in that cache efficient parallel adaptations for which the natural sequential evaluation order demonstrates lower cache miss rate result in overall faster performance on parallel machines with private or shared caches, on GPU's, or even cloud computing platforms. We report on preliminary experiments that demonstrate how the TURBO algorithm in Morton-hybrid layout attains orders of magnitude improvement in performance as the input matrices increase in size. For example, when N = 2 13 , the row major TURBO algorithm concludes within about 38.6 hours, whilst the Morton-hybrid algorithm with truncation size equal to 64 concludes within 10.6 hours.
overall faster performance on parallel machines with private or shared caches, on GPU's, or even cloud computing platforms. We report on preliminary experiments that demonstrate how the TURBO algorithm in Morton-hybrid layout attains orders of magnitude improvement in performance as the input matrices increase in size. For example, when N = 2 13 , the row major TURBO algorithm concludes within about 38.6 hours, whilst the Morton-hybrid algorithm with truncation size equal to 64 concludes within 10.6 hours.
Introduction
Exact triangulisation of matrices is crucial for a large range of problems in Computer Algebra and Algorithmic Number Theory, where a basis of the solution set of the associated linear system is required. A known algorithm in the field which resulted in several prominent adaptations is the TURBO algorithm of Dumas et al. [11] for exact LU decomposition. This algorithm recurses on rectangular and potentially singular matrices, which makes it possible to take advantage of cache effects. It improves on other expensive methods for handling singular matrices, which otherwise have to dynamically adjust the submatrices so that they become invertible. Particularly, TURBO significantly reduces the volume of communication on distributed architectures, and retains optimal work and linear span. TURBO can also compute the rank in an exact manner. As benchmarked against some of the most efficient exact elimination algorithms in the literature, TURBO incurs low synchronisation costs and reduces the communication cost featured by [13, 14] by a factor of one third when used with only one level of recursion on 4 processors. In TURBO, local TU factorisations are performed until the sub-matrices reach a given threshold, and so one can take advantage of cache effects. A cache friendly adaptation of the serial version of TURBO bears impact on all possible forms of parallel or distributed deployment of the algorithm. For one, nested parallel algorithms with low depth and for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches [6] . Locality of reference on distributed systems is also being advocated by the Databricks group initiated by founders of Apache Spark. In their own terms, when profiling Spark user applications on distributed clusters, a large fraction of the CPU time was spent waiting for data to be fetched from main memory. Locality of reference is also of concern on GPUs. Although one does not have full control over optimising locality of reference on such machines, and despite that GPUs rely on thread-level parallelism to hide long latencies associated with memory access, the memory hierarchy remains critical for many applications. Finally, applications that are cache aware are also deemed to be more energy aware, as remarked by the Green computing community.
This preamble motivates our work on trying to improve the cache performance of the serial version of TURBO. It is well established that traditional row-major or column-major layouts of matrices in compilers lead to extremely poor temporal and spatial locality of matrix algorithms. Instead, several matrix layouts based on space filling-curves have yielded cache-oblivious adaptations of matrix algorithms such as matrix-matrix multiplication [5, 9] and matrix factorisation [4, 12, 21] . Those alternative layouts are recursive in nature and produce highly cache-efficient, cache-oblivious adaptations that scale with the size of the underlying matrices. The cache-oblivious model does not require knowledge of, and hence tuning the algorithm according to, the cache parameters. Cache-oblivious programs allow for resource usage not to be programmed explicitly, and for algorithms to be portable across varying architectures, as well as all levels of the memory hierarchy within one specific architecture.
Our contributions can be summarised as follows:
1. We investigate prospects for a cache oblivious adaptation of the TURBO algorithm by mapping four different matrix layouts against each other: the Hilbert order [10, 15] , the Peano order [4, 5] , the Morton order [16, 20] , and the Morton-hybrid order [2] . Whilst matrices on which we want to perform matrix-matrix multiplication or LU decomposition without pivoting can be serialized no matter what layout is used, the recursive TU decomposition considered in this work consistently requires permutation steps that require one to traverse the matrix in a row-wise or column-wise manner, thus eliciting index conversion from the Cartesian scheme to the recursive scheme and vice versa. In addition to the specific contributions summarised below, this survey component of our work 2. Our analysis of the four schemes addresses the cost of bit operations and accessing table lookups when applicable. Our findings show the following:
(a) The overhead for using the Peano layout will be compelling as index conversion invokes operations modulo 3.
(b) Whilst the Hilbert layout has been promising for improving memory performance of matrix algorithms in general, and despite that the operations for encoding and decoding in this layout can be performed using bit shifts and bit masks, we will still require m iterations for a 2 m × 2 m matrix for each single invocation of encoding or decoding.
(c) In contrast, we find that the conversions for the Morton and the Morton-hybrid layouts incur a constant number of operations assuming the matrix is of dimensions at most 2 α × 2 α , where α is the machine word-size. For the typical value α = 64, such matrix sizes are sufficiently large for many applications.
(d) Furthermore, despite that the Morton order can be encoded and decoded faster than the Morton-hybrid order, the factor of improvement is constant: ten less operations. In return, the Mortonhybrid layout allows the recursion to stop when the blocks being divided are of some prescribed size equal to T × T , thus decreasing the recursion overhead. These T × T blocks are stored in a row-major order, which allows for benefiting from compiler optimizations that have already been designed for this layout. The row-major ordering of the block at the base case also makes accessing the entries within the blocks at the base case of the inversion, multiplication, and decomposition steps of the algorithm faster and easier because no index conversion is required.
3. Unless otherwise stated and explicitly cited, the various encoding and decoding algorithms we present under these various layouts and the propositions/proofs associated with them, are novel.
4. The present manuscript is an indispensable precursor for our work in [1] , where we introduce the concepts of alignment of sub-matrices with respect to the cache lines and their containment within proper blocks under the Morton-hybrid layout, and describe the problems associated with the recursive subdivisions of TURBO under this scheme. Although the full details of the resulting algorithm are beyond the scope of this paper, we report on experiments that demonstrate how the TURBO algorithm in Morton-hybrid layout attains orders of magnitude improvement in performance as the input matrices increase in size. For example, when N = 2 13 , the row major TURBO algorithm concludes within about 38.6 hours, whilst the Morton-hybrid algorithm with truncation size equal to 64 concludes within 10.6 hours.
The TU Algorithm: A Summary
Consider a 2m × 2n matrix A with rank r over a field F, where A may be singular. The TURBO algorithm triangulates the matrix A in a succession of recursive steps, relaxing the condition for generating a strictly lower triangular matrix. All the recursive steps are independent and thus TURBO is inherently parallel. It further outputs two matrices T and U , such that A = T · U , where U is a 2m × 2n upper triangular matrix, and T is 2m × 2m, with some "T " patterns. In all of the following, let A (i,j) (a,b) denote the submatrix of A of dimensions a × b and starting at the entry of Cartesian index (i, j). Whenever the superscript (i, j) is omitted, a default value (0, 0) is assumed. First, begin by decomposing the matrix A of size 2m × 2n as follows:
The TURBO alogrithm now performs the following:
1. Recursive TU decomposition in each of SE, then SW, NE, and finally, NW 2. Virtual row and column permutations needed to re-order the blocks to yield a final, upper triangular matrix For brevity, we only elaborate on the first step in the TURBO algorithm. It gives a flavour of the various matrix opertions we will be addressing throughout the paper. The rest of the steps can be found in [11] .
Step 1: Recursive TU in N W
This step performs a recursive call to the TU decomposition algorithm in the N W quadrant of A to get U 1 upper triangular, G 1 , and L 1 , lower triangular, such that
Here, U 1 is r × r for some r ≥ 0, G 1 , and L 1 is m × m. L 1 is then used to update N E by:
One now aims to zero out the sub-matrix of SW that lies under U 1 . This is done by calculating
and then setting
The submatrix SE has to be updated accordingly:
At the end of this step, matrix A has been updated as follows:
The resulting matrix can be seen in Fig. 1 (a) taken from [11] . 
Comparison of Index Conversion Overhead
The amount of computation required for index conversion from each of the Hilbert, Peano, Morton, and Morton-hybrid layouts to the Cartesian order is used to rate each of these layouts as they apply to TURBO. The row and column permutations required at every level of the recursion make row and column traversals of the matrix crucial. Intensive conversion tasks are required for row and column traversals of the matrices. This makes the overhead of index conversion essential in the comparison of the different layouts available. Let Θ denote a subscript associated with one of the four layouts named above. Given a Cartesian index (i, j), encoding it in the order Θ corresponds to calculating its index z Θ in the resulting matrix layout under Θ.
Given an index z Θ of a matrix entry under order Θ, decoding z Θ corresponds to calculating the Cartesian index (i, j).
Conversion Terms
The following is a list of terminologies used in the remainder of this manuscript. 
6. The act of masking a value v: applying an AND operation on v and a mask to extract part of v.
7. The row index of an entry e in a given matrix is its row offset within the matrix. It is given by i from the Cartesian index (i, j) of e. The column index of the entry is its column offset, given by j.
8. Consider a matrix M decomposed into sub-blocks of size 2 k × 2 k . For any given sub-block, its row index is its row offset within M and its column index is its column offset within M .
9. The index of an entry in a given matrix laid out in order Θ is its offset within the linear array representing the matrix in the layout Θ. This was denoted earlier by z Θ 10. The index of a sub-block of size 2 k × 2 k in a given matrix laid out in order Θ is the offset of this sub-block within the linear array of objects representing the matrix in the layout Θ, where each object is a 2 k × 2 k sub-block.
11
. We refer to the index of an entry (or block) in the Θ order as the Θ index of this entry (or block). For example, when considering a matrix M laid out in the row-major order, Θ refers to the row-major order and the row-major index of an entry (or block) of M is the index of this entry (or block) in the row-major layout of M .
Encoding/Decoding in the Row-Major Order
The row-major (column-major) order is the default ordering used by compilers for two dimensional arrays. Computer memory is linear and consists of a list of consecutive addresses in memory. Compilers therefore store a two dimensional array by laying it out row by row. When the programmer uses the notation A[i][j] to access the element in position (i, j) of the matrix A, the compiler performs the index conversion behind the scenes. In the rest of this section, we consider the example matrix shown in Fig. 2 with n = 8 i.e. m = 3, and take Θ to denote the row-major order. 
Encoding in the Row-Major Order
Consider the Cartesian index (i, j) in a matrix laid out in a row-major fashion. To find z Θ (i, j, n), denoted by z Θ for simplicity, use the equation
For the example shown in Fig. 2 , to encode i = 4 and j = 6, using Eq. (2) results in z Θ = 4 × 8 + 6 = 32 + 6 = 38. Because the element of Cartesian index (i, j) lies within a 2 m × 2 m matrix, then i, j ∈ {0, 1, 2, ..., 2 m }, and so, to represent any of these values in the binary system, at most m bits are needed. Write
The integer operations to find z Θ given by Eq. (2) are equivalent to the following bit operations:
for n = 2 m . This can be seen as concatenating the bits of j to the bits of i to get:
Hence, the encoding can be done using bit shifting and bit masking operations on the binary representations of i and j, as shown in Alg. 1. For i = 4 and j = 6 represented as i = (100) 2 and j = (110) 2 , i << 3 = (100000) 2 and z Θ = (100000) 2 |(110) 2 = (100110) 2 = 38.
Algorithm 1: Encoding for the Row-Major Order Using Bit Operations
Equations Eq. (3) and Eq. (4) present the operations used for decoding an index in the row-major order using integer operations. These equations follow from Eq. (2) by which we obtain z Θ as:
In Eq. (3) and Eq. (4), we extract the i and j value of the index (i, j) respectively. Using these equations to decode the index 38 from Fig. 2 , compute i = 38 ÷ 8 = 4 and j = 38%8 = 6. Alg. 2 describes the corresponding bit operations. Recall that n = 2 m , so the extraction of i given by Eq. (3) can be done using the bit operations:
and the extraction of j given by Eq. (4) can be done using:
To decode z Θ = 38, one operates on the bit representation of z Θ = (100110) 2 . Extract i as
and j as
Algorithm 2: Decoding for the Row-Major Order Using Bit Operations
Computation Overhead
The encoding procedure given by Eq. (2) uses one integer multiplication and one integer addition and costs two integer operations. The decoding procedures to extract i and j in Eq. (3) and Eq. (4) use one division operation and one modulo operation also costing two integer operations in total. Equivalently, to encode a Cartesian index in the row-major order two bit operations are required. Decoding row-major indices also requires two bit operations.
Encoding/Decoding in the Hilbert-based layout
The Hilbert ordering is recursively generated by dividing a 2 m × 2 m matrix M into four quadrants following a certain pattern and recursively laying out the elements of these four quadrants [8] . Before deriving the costs associated with the conversion to and from this layout, We review the procedure to generate the Hilbert order of a 2 m × 2 m matrix M , i.e. to map the entries of M to entries in the one-dimensional array representing M . The primitive patterns U, D, C, and N are shown in In one variant of the Hilbert layout, the matrix M is assigned an initial pattern ρ M from the four primitive patterns U, D, C, and N and its four quadrants are laid out in an order depending on the pattern ρ M of M . To generate the Hilbert order of M , a recursive procedure is followed. We denote by Q k the sub-matrix we are laying onto memory for each recursive level and denote by ρ Q k the pattern of the sub-matrix Q k obtained from the set {U, D, C, N }. We start with Q 0 = M , assuming it has pattern ρ Q 0 = U . The generation proceeds as follows. At the k th step, each of the sub-matrices Q k is refined into four quadrants
These are then laid onto memory in the Hilbert order according to two rules: the N extP attern rule and the HilbertOrder rule. Let M ρ denote any matrix M of pattern ρ ∈ {U, D, C, N }. Let i ∈ {0, 1, 2, 3} denote the quadrants of M ρ where 0, 1, 2, and 3 refer to the N W , N E, SW , SE quadrants respectively. The N extP attern(M ρ ) rule for any sub-matrix M of pattern ρ M ∈ {U, D, C, N } identifies the pattern of each quadrant of M . The HilbertOrder(M ρ ) rule identifies the order of precedence in which these quadrants are mapped onto memory. The N extP attern rule is given by:
where ρ i denotes the pattern of the i th quadrant of M ρ . The HilbertOrder rule is given by:
where v i represents the order of precedence of quadrant i in the physical layout when generating the Hilbert order. The N extP attern rules for the four patterns as used in the generation of the Hilbert curve are given by the following:
The HilbertOrder rules for each pattern are given by the following:
HilbertOrder(M C The matrix to be laid out in the Hilbert order is given a pattern and refined according to the refinement rules for each pattern. The refinement is done by recursively dividing the matrix into four quadrants and storing them according to the pattern refinement rules. Let M ρ denote a Hilbert matrix of pattern ρ ∈ {U, D, C, N }. Each step of the refinement for M ρ is done by identifying the patterns of the quadrants of M ρ and the order in which these quadrants are accessed. This is done using the following structures for each matrix M ρ laid out in some pattern ρ:
where ρ i denotes the pattern of the i th quadrant of M ρ in the row-major layout, and
where ν i denotes the index in the Hilbert layout of the i th quadrant of M ρ in the row-major layout.
For example, the refinement rule for the U pattern is given by:
and
The refinement process is referred to as the generation of the Hilbert curve and is going to guide the encoding and decoding procedures. Fig. 6 shows an 8 × 8 matrix stored in the U -shaped Hilbert order on which the encoding and decoding procedures will be traced. In the rest of this section, Θ refers to the Hilbert order and M refers to a 2 m × 2 m matrix in the Hilbert order. To use the refinement rules, we translate them into two lookup tables: Table T P and Table T V . We index the lookup tables using ρ and v: ρ is the pattern of the matrix we are refining and v is the index, in the row-major order, of the quadrant Q k+1 e within Q k e . Table T To determine these tables, consider matrix M ρ of pattern ρ ∈ {U, D, C, N }. Let v ∈ {0, 1, 2, 3} refer to the row-major index of any quadrant of M , designating N W , N E, SW , and SE respectively. The entry
is determined as follows: ρ is the pattern of the v th quadrant of M ρ in the row-major order. Table T V presents a mapping between the row-major layout of the quadrants of a matrix of pattern ρ and the Hilbert layout of these quadrants. The entry
is the Hilbert index of the v th quadrant of M ρ in the row-major order. For example, recall that a matrix of the U Hilbert pattern is refined as follows:
Hence the entry T P (U, 1) is C and
in which the element e of Cartesian index (i, j) lies at each refinement step. Table T P guides the generation of z Θ by identifying the pattern of Q k+1 e . Table T V identifies two bits to be appended at each iteration to a binary index z k . These two bits represent the index, in the Hilbert layout, of Q k+1 e within Q k e . We justify that this index is identified using two bits as follows. Recall that Q k+1 e is one of the quadrants of Q k e . There are only four quadrants in a Table 1 : Encoding Pattern Look-Up Table T Table 2 : Encoding Bits Look-Up Table T V 0 1 2 3 U 00 11 01 10 C 10 11 01 00 D 00 01 11 10 N 10 01 11 00 matrix. Hence, the possible values for the index of any of these quadrantsregardless of the layout -are 0, 1, 2, and 3, which can be represented using at most two bits. Hence, the length of the binary representation of this index is at most two. Now that the lookup tables are ready, we describe the iterative encoding procedure for a given 2 m × 2 m matrix and a Cartesian index (i, j). Each iteration represents a refinement step within the generation of the Hilbert curve and, after m iterations, the sequence {z k+1 } k=0,1,...,m−1 , converges to z Θ . In each iteration k, the pattern
e is M and ρ 0 = U because the Hilbert pattern we assume for the initial matrix is U . The corresponding index in the Hilbert order is progressively calculated by finding some partial index z k in each iteration. The algorithm begins with z 0 = 0 and ends with the value for z m = z Θ . Write
Recall that there are m bits in each of i and j, because at most m bits are needed to represent a value between 0 and 2 m−1 in the binary system. In each iteration k of the encoding algorithm:
• we use v k and ρ k to index the lookup table T P and find the next pattern of Q k+1 e
• we use v k and ρ k to index the lookup table T V and find
This appends the two bits of the Hilbert index of Q k+1 e within Q k e to the partial index z k to get z k+1 .
As two bits are appended to z k at each iteration, and there are m iterations, then z Θ is formed of 2m bits.
To prove correctness, we propose the following:
proof Consider the entry e of Cartesian index (i, j). Write
To prove that v k is the row-major index of Q . This establishes the base case.
Induction
Step: We want to show that i (m−(k+1)) is the row index of Q
By induction, we know that the bit
Replacing i r as in Eq. (5) in Eq. (6) above, we obtain
Hence the i (m−(k+1)) = B . This establishes the induction step and concludes the proof. 
We illustrate this encoding procedure with an example in Appendix A.
Decoding in the Hilbert Order
In the decoding procedure, we are given the Hilbert index z Θ of an entry e and we wish to find (i, j), the Cartesian index of e. Recall that Q . To determine these tables, consider matrix M ρ of pattern ρ ∈ {U, D, C, N }. Recall that the N W , N E, SW , and SE quadrants are the 0 th , 1 st , 2 nd , 3 rd quadrants in the row-major order. Entries in Table T P represent patterns and are given by:
The pattern ρ refers to the pattern of the v th quadrant of M ρ in the Hilbert order. Table T V presents a mapping between the Hilbert layout of the quadrants of M ρ and the row-major layout of these quadrants. The entry
indicates that the v th quadrant of M ρ in the Hilbert order is the v th quadrant of M ρ in the row-major order. Note that T V is the inverse mapping of the table T V used for encoding: i.e. Table T 
For example, recall that a matrix M U in a U Hilbert pattern is refined as follows:
Hence, we have T P (U, 3) = C because the 3 rd quadrant of M U in the Hilbert order has pattern C. Also, T V (U, 3) = 1 = (01) 2 because the 3 rd quadrant of M U in the Hilbert order is the 1 st -i.e. the N E quadrant of M U in the rowmajor order. Tables T P and T V guide the generation of i and j by identifying the pattern of Q k e and the row-major index of Q k e within M respectively. Now that the lookup tables have been described, we present the decoding procedure. Recall from the encoding procedure that z Θ is formed of 2m bits. Write
The decoding reverses the encoding procedure, in which at each iteration one bit of each of i and j is used to generate two bits of z. Recall, from encoding, that, in each iteration k, these two bits of z Θ make up the Hilbert index of Q k e within Q k−1 e as follows:
Also, we identify the pattern ρ k of Q k e . We start with ρ 0 = U . In each iteration k, ρ k and v k are used to index the two lookup tables T P and T V . As mentioned earlier, table T P is used to find the pattern of the quadrant of the next iteration Q k+1 e , given by ρ k+1 = T P (ρ k , v k ). Table T V is used to find a two-bit value v k = (b i b j ) 2 which represents the row-major index of Q k+1 e within Q k e . Recall that this is a two-bit value because it represents the index of a quadrant. The values of i and j are found progressively: start with i 0 = 0 and j 0 = 0. In each iteration append b i ∈ {0, 1} and b j ∈ {0, 1} to i k and j k to get i k+1 and j k+1 respectively. In bit operations, this corresponds to
appends bit b i to i k to get i k+1 , and
which appends bit b j to j k to get j k+1 . This reverses the process of encoding. We iterate m times, after which we have i m = i and j m = j. We illustrate this decoding procedure with an example in Appendix B.
Computation Overhead
Each of encoding and decoding requires m iterations. In each iteration, the encoding operation uses six bit operations and two table look-ups and the decoding algorithm uses eight bit operations and two table look-ups. The first of each table look-up incurs a random cache miss. The tables are small enough to fit in internal memory. As row and column permutations swaps in TURBO take place consecutively in one batch, so do the conversion routines, each of which requires access to the look-up table. By the LRU cache policy, the tables are kept in internal memory until the permutations have concluded. Consequently, the overall I/O cost for table lookups is Θ(1) per one batch of permutations, which is dominated by the I/O cost of the matrix operations in each recursive step, and hence, can be discarded. Summarising, the overall run-time for each of encoding and decoding in the Hilbert order is Θ(m) bit operations. Algorithms for encoding and decoding in the Hilbert order which do not use look-up tables are described in full detail in [15] . In [10] , a new variant of the Hilbert curve is introduced for which the encoding runtime overhead is O (lg(max(i, j)) ). Yet, such conversion schemes are still beaten by the Morton and Morton-hybrid layouts as we will describe in Sec. 3.5 and Sec. 3.6 respectively. The Peano ordering of a matrix is based on the Peano space filling curve. This ordering results from recursive construction and assumes the matrix has dimension 3 m × 3 m [5] . Each dimension of M is divided into 3 equal parts as shown in Fig. 7 resulting in nine sub-matrices which are named {00, 01, 02, 10, 11, 12, 20, 21, 22}. The Peano order for each of the resulting sub-matrices is recursively generated according to a certain precedence depending on the pattern of the higher level sub-matrix. Any Peano sub- Figure 8 : Peano Generation Rules matrix has a pattern from the set of primitive patterns {P, Q, R, S} shown in Fig. 8 . The Peano order assumes the initial matrix M has pattern P . We denote by Q k the sub-matrix we are laying onto memory for each recursive level and denote by ρ Q k the pattern of the sub-matrix Q k obtained from the set {P, Q, R, S}. We start with Q 0 = M of pattern ρ Q 0 = P .
Encoding/Decoding in the Peano Layout
At the k th step, each of the sub-matrices Q k is refined into nine quadrants
These are then laid onto memory in the Peano order governed by two rules: the N extP attern rule and the P eanoOrder rule, similar to the rules governing the generation of the Hilbert order. Let M ρ denote any matrix M of pattern ρ ∈ {P, Q, R, S}. Let i ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8} denote the sub-matrices of M ρ where each i refers to the 00, 01, 02, 10, 11, 12, 20, 21, and 22 submatrices respectively. The N extP attern(M ρ ) rule for any sub-matrix M of pattern ρ M ∈ {P, Q, R, S} identifies the pattern of each quadrant of M . The P eanoOrder(M ρ ) rule identifies the order of precedence in which these quadrants are mapped onto memory. These rules can also be seen from Fig.  8 : the precedence order of the sub-matrices is given by the direction of the arrow and the pattern is given by the letter inside the sub-matrix. Fig. 9 shows the steps for the generation of the Peano order for a 9 × 9 matrix M . In Fig. 10 , the entries of the matrix at the end of the generation of the Peano order show the indices of the elements of M within the physical one-dimensional array representing M in the Peano order.
The Peano-based ordering of matrices has the property that computations within matrix-matrix multiplication can be re-ordered so as to ensure no jumps in the address space [5] .
The Peano layout is similar to the Hilbert layout in that it is based on more than one pattern, and so the encoding and decoding procedures for the Peano layout are analogous to those for the Hilbert layout. Specifically, the indices are encoded (or decoded) progressively through iterations. In each iteration, a pattern identifier is used along with a set of lookup tables to identify auxiliary variables for the next iteration. The differences between • The operations used in each iteration are operations in base 3 -division by 3 and modulo 3.
• The resulting indices are in base 3 and need to be converted to base 2.
This results in m iterations with a constant number of operations in base 3 per iteration, which cannot be replaced by bit operations. Conversion between base 3 and base 2 values is also needed at the end of the iterative computations in order to get the final indices. In total, this makes encoding and decoding indices within the Peano layout significantly costly as opposed to the Hilbert, Morton, and Morton-hybrid orders. As such, we rule out using the Peano order within the TURBO algorithm. This is specifically so because the nature of the recursive TU decomposition algorithms requires row-wise and column-wise traversal of the matrix for pivoting and permutations -in contrast to matrix multiplication where any traversal that suits the layout may be used.
Encoding/Decoding in the Morton Layout
The Morton order is another alternative layout used for 2 m × 2 m matrices. To generate the Morton-order of a matrix, the latter is divided into four quadrants, which are laid out onto memory in the order northwest, northeast, southwest, then southeast. The Morton order differs from the Hilbert order in that it does not alternate between patterns, but rather uses the same pattern to generate the Morton order for each of the sub-matrices. In this section, we present the Z-shaped Morton order, in which the pattern used looks like the letter Z as can be seen in Fig. 11 . According to [16] , the corresponding Morton index z Θ is
Figure 13: Matrix in Morton Order which represents an inter-leaving of the bits of i and j. The reverse process by which we extract the bits of i and j from the bits of z Θ is referred to as de-leaving. In this section, we derive the costs for encoding and decoding the Morton order based on this inter-leaving of indices. For this, we revisit the notions of dilating and un-dilating integers from [16] . Dilating an integer
bits is the process of expanding k into an integer k = (0k b−1 ...0k 4 0k 3 0k 2 0k 1 0k 0 ) 2 of 2 × b bits by inserting a zero bit between every two bits of k. Un-dilating is the reverse process that takes k back to k. Before we determine the computational overhead for index conversion, we review here two algorithms -dilate and un-dilate -that perform the encoding and decoding between Morton order indices and Cartesian indices. Fig. 14 shows an 8 × 8 Morton ordered matrix on which the encoding and decoding procedures will be traced.
Encoding in the Morton Order
Algorithm 3: (unsigned int) dilate( unsigned short t) [20] 1 unsigned int r = t
All the present discussion follows from [16, 20] . For encoding the Morton order, z Θ must be found given (i, j). As shown in Eq. (7), the inter-leaving Then z Θ is given by
To verify Eq. (8) from [16] , perform
We illustrate this encoding procedure with an example in Appendix C.
Decoding in the Morton Order
To decode the Morton order, we need to extract i and j from z Θ . We write z Θ as in Eq. (7). Then, the bits of z Θ must be de-leaved to separate the bits of i, denoted i z Θ , from the bits of j, denoted j z Θ . To do this, we mask z Θ with 0xAAAAAAAA to get
and mask z Θ with 0x55555555 to get
Now we shift i z Θ one position to the right to get
To calculate i, we un-dilate i z Θ using the un-dilation algorithm given in Alg. 4. To calculate j, we un-dilate j z Θ using Alg. 4.
Algorithm 4: unsigned short un-dilate( unsigned int t) taken from [16, 20] 1 unsigned int r = t 2 t = (t | (t >> 1)) & 0x33333333
We illustrate this decoding procedure with an example in Appendix D.
Computation Overhead
Assume that the input matrix has dimensions 2 α ×2 α , where α is the machine word-size. This guarantees that each cartesian index fits in a machine word. For the typical value α = 64, such matrix sizes are very generous. Each of dilation and un-dilation performs twelve bit operations: four OR operations, four AND operations, and four bit-shift operations. The encoding procedure makes use of the dilation algorithm twice, followed by one bit-shift operation and one OR operation. This results in a total of twenty six bit operations for encoding an index in the Morton order. The decoding procedure makes use of two masking operations, one bit-shift operation, and two calls to the un-dilation algorithm: one for un-dilating i z and one for un-dilating j z . This requires a total of twenty seven bit operations for decoding a Morton index to a Cartesian index. 
Encoding/Decoding in the Morton-Hybrid Layout
The Morton-hybrid order is a variant of the Morton order used in [2] and [8] among others. In this variant of the Morton order, the recursive subdividing into blocks stops at a given truncation size T , resulting in a base case block of size T × T , for which the row-major order is used. The value of the truncation size can vary.
To generate the Morton-hybrid order, the procedure for generation of the Morton order is followed, using the Z pattern as shown in Fig. 11 , until the size of the sub-matrix being mapped onto memory is T × T . To map the entries of this block in the one-dimensional array representing the matrix, a row-major order is assumed. Fig. 16 shows the steps in the generation of the Z-shaped Morton-hybrid order of an 16 × 16 matrix for truncation size T = 4. The resulting map of the entries of the matrix in the corresponding one-dimensional array in memory can be seen in Fig. 17 . The following presentation of the Morton-hybrid index for T = 16 was taken from [2] , but we elaborate on it additionally in our Proposition 3.4 below and the associated proof. For the remainder of this section, Θ denotes the Morton-hybrid layout for T = 16. We assume the input matrix M has dimension 2 m × 2 m . In Sec. 3.5, we saw that the Morton index is determined using the inter-leaving of the bits of i and j, for a given Cartesian index (i, j). In Sec. 3.2 the row major index was composed of the bits of j concatenated to the bits of i. The Morton-hybrid order is a combination of both these orders.
Consider the Morton-hybrid matrix shown in Fig. 15 showing row-major sub-blocks of a Morton-hybrid matrix for T = 16. Each 16 × 16 row-major block has a row index and a column index as given in Sec. 3.1. For example, consider the row-major sub-block outlined in red in Fig. 15 . Its row index is 1 and its column index is 2. The Morton index of a sub-block in the Morton-hybrid order is then given by the interleaving of the block row and block column indices as per the Morton indexing scheme. Now, each entry e in a Morton-hybrid matrix can be indexed with the help of information regarding:
1. the T × T row-major sub-block in which it lies 2. its row index within this row-major sub-block 3. its column index within this row-major sub-block For each element e in a T × T sub-block, the possible values for the row and column indices of e within this block can be 0, 1, ..., T − 1. Thus, the offset values can be represented using β bits where T = 2 β . Let (i, j) denote the Cartesian index of e. We now have:
The β lower order bits of i represent the row index of the element within the T × T row-major sub-block and the (m − β) higher order bits of i represent the row index of this T × T sub-block.
proof Let M be a 2 m × 2 m matrix laid out in the Morton-hybrid order with truncation size T = 2 β . Let e be any entry in the T × T row-major sub-block S M of M , and let (i, j) denote the Cartesian index of e. First we show that the bit representations of the row index of the sub-block S M and the row index of the element e within S M require (m − β) and β bits respectively.
The matrix M has dimensions 2 m × 2 m and the sub-blocks at the base case have dimensions T × T , with T = 2 β . Thus there are exactly 2
rows of T × T blocks and the row index i M of a sub-block can be 0, 1,..., 2 (m−β) − 1. So these row indices require (m − β) bits to be represented. The T × T sub-block S M of M is in the row-major layout and contains T rows of entries. Let r denote the row of S M in which entry e lies. The row index i r of e within S M , which is the same as the row index of any entry within r, can be 0, 1,. . ., T − 1 and thus its bit representation requires β bits. Now we identify where to get these bits from. The row index i of the entry e within M is given by
As such, i can be represented by m bits: the higher order (m − β) bits are the bits of i M , making up the row index of S M within M , and the β lower order bits are the bits of i r , making up the row index of e within S M . This concludes the proof.
To illustrate the proposition, take for example the element (20, 6) from Fig. 15 . In this example, n = 2 m = 64, i.e. m = 6, and T = 2 β = 16, so β = 4. Consider i = 20 = (010100) 2 . The β = 4 lower order bits are (0100) 2 = 4, the element's row index within the row-major block, i r . The m − β = 2 higher order bits (01) 2 = 2 identify the row index of the row-major sub-block in which this element lies. The same applies to j = 6 = (000110) 2 . The β = 4 lower order bits (0110) 2 = 6 are the column index of the element within the row-major block. The m − β = 2 higher order bits, (00) 2 , correspond to the column index of the block. Now consider a matrix M laid out in the Morton-hybrid order and an entry e in M . Let z Θ denote the Morton-hybrid index of e and (i, j) denote its Cartesian index. Let S M denote the T × T row-major sub-block in which e lies. Recall the interleaving of the two coordinates i and j presented in Sec. 3.5 leading up to the Morton index of e. The T × T sub-blocks of M are laid out in the Morton order, so we use this procedure to inter-leave the (m − β) higher order bits of i and j to get the Morton index z M of S M . Then, we find the row-major index of e within S M : as S M is stored in row-major order, the index within S M of e is in the context of a row-major ordering scheme. To get the row-major index of e within S M , denoted by z r , we concatenate j r to i r , where i r and j r are the row and column indices of element e within S M respectively (according to Sec. 3.2) . Because the smallest sub-blocks of M are of size T × T , the Morton-hybrid index of the element e of Cartesian index (i, j) is then given by
Hence, it can be formed by concatenating the row-major offset of e within To encode a Cartesian index (i, j) in the Morton-hybrid order, we proceed as follows. The last β bits of i must be extracted. To do this, mask i with
to extract i r , the part of the binary representation of i representing the row offset of the element within the row-major block. Thus
The rest of i is denoted by i m . It represents the row index of the row-major block in which the element (i, j) lies. The value of i m is given by
We find respectively using Alg. 3. Then, z Θ is given by:
Note that this equation is in fact made of two parts: We perform OR operations on the above values to get
as in Eq. (3.6).
We illustrate this encoding procedure with an example in Appendix E.
Decoding in the Morton-Hybrid Order
When decoding an index in the Morton-hybrid order, we start with
We wish to find the Cartesian index (i, j) by isolating the bits of i and the bits of j. Recall that z Θ is in fact made of three parts as described in the introduction of this section:
1. the Morton index of the row-major block in which this element lies 2. the row offset of the element within this block 3. the column offset of the element within this block
Thus to determine i, the Morton index of the row-major block must be decoded to get a row index, and the row offset of the element within the block is concatenated to this to get i. The same applies for j. We extract the two parts for i and j from above as follows, in a reverse process to encoding. The values for i and j can then be found as:
and j = j m |j r (11)
We illustrate this decoding procedure with an example in Appendix F.
Computation Overhead
As above, assume the input matrix has dimensions 2 α × 2 α , where α is the machine word-size. The encoding algorithm for the Morton-hybrid layout performs the same twenty six bitwise operations as those for the encoding algorithm for the Morton layout for bit dilation, with an additional ten operations: four AND operations, four bit shift operations, and two OR operations. Thus encoding in the Morton-hybrid order costs thirty six bit operations. The decoding algorithm for the Morton-hybrid layout performs the same twenty seven operations used in the decoding procedure for the Morton order. In addition to those, seven bit shift operations, two AND operations, and two OR operations are needed, totalling to eleven additional bit operations. Thus decoding Morton-hybrid index to get the Cartesian index requires thirty eight bit operations.
Summary of findings and a brief note on empirical performance
Our findings so far can be summarised as follows. The overhead for using the Peano layout will be compelling as index conversion invokes operations modulo 3. Whilst the Hilbert layout has been promising for improving memory performance of matrix algorithms in general, and despite that the operations for encoding and decoding in this layout can be performed using bit shifts and bit masks, we will still require m iterations for a 2 m × 2 m matrix for each single invocation of encoding or decoding. In contrast, we find that the conversions for the Morton and the Morton-hybrid layouts incur a constant number of operations assuming the matrix is of dimensions at most 2 α × 2 α , where α is the machine word-size. For the typical value α = 64, such matrix sizes are sufficiently large for many applications.
The present manuscript is an indispensable precursor for our work in [1] , where we introduce the concepts of alignment of sub-matrices with respect to the cache lines and their containment within proper blocks under the Morton-hybrid layout, and describe the problems associated with the recursive subdivisions of TURBO under this scheme. Although the full details of the resulting algorithm are beyond the scope of this paper, we report briefly on experiments that demonstrate how the TURBO algorithm in Morton-hybrid layout attains orders of magnitude improvement in run-time performance as the input matrices increase in size. A more detailed cache analysis is reported in [1] .
We run the serial TURBO algorithm on given matrices stored in the row-major layout and use the index conversion techniques from Sec. 3.2 to complete the row and column permutations. We then run the algorithm on the same matrices stored in the Morton-hybrid order and use the conversion techniques from Sec. 3.6. We perform the experiments for a number of different values of T , the truncation size at which the Morton Ordering stops and a row-major layout begins. We run our experiments is a Pentium III with processor speed of 800 MHz. Its has 16 KB of L1 cache and 256 KB of L2 cache. It runs a linux operating system of version 2.6.12 with gcc compiler version 4.0.0.
Test Cases for Recursive TU Decomposition
To neutralise the effect of modular aritmetic over finite fields and to be able to exclusively account for the gains induced by the Morton-hybrid order, we generate random n × n matrices over the binary field. Direct linear algebra over finite fields is an important kernel for several integer factorisation and polynomial factorisation algorithms. We chose to test the Morton-hybrid TURBO algorithm on matrices generated from the Niderretier algorithm for factoring polynomials. These matrices arise from the problem of factoring a polynomial f over the binary field and are given by the equation N f −I where N f is the Niederreiter matrix corresponding to the polynomial f [17, 18, 19] and I is the identity matrix. If f is a polynomial of degree n, then the matrix N f is an n × n matrix. We vary the value of n from 256 to 8192 in the tests. We also vary the truncation size T of the Morton-hybrid layout from T = 16, 32, 64, 128, 256, in line with previous Morton-hybrid algorithms for matrix multiplication and Cholesky Factorisation in [3] .
Results and Analysis for Recursive TU Decomposition
In this section, we present the performance results for the various truncation sizes we experimented with. Below we summarise the actual run-times for both versions and given truncation sizes. We interpret our results as follows.
For small values of N , the row-major TURBO beats the Morton-hybrid one. Obviously, the overhead associated with the index conversions required by the Morton-hybrid version during each recursive step dominate the overall run-time for small values of N . For all possible truncation sizes, the crossover point is for N = 1024. For larger values of N we actually gain orders of magnitude reduction in overall run-time. For example, when N = 2 13 , the row major TURBO algorithm concludes within about 38.6 hours, whilst the Morton-hybrid algorithm with truncation size equal to 64 concludes within 10.6 hours. Now, for each given value of N , the best truncation size seems to be around T = 32 and T = 64. This is where roughly half of the recursive calls down to a trivial base case block have been dispensed with. For smaller truncation sizes, the loss in performance is due to the recursion overhead. For higher truncation sizes, the loss in performance is associated with poor cache performance. We finally note that not only is the best performance of the Morton-hybrid version is for T = 32 and 64. For those ranges, the rate 
Conclusion
In this paper we have reviewed four major space-filling curve representations as they apply to parallel TU decomposition over finite fields (TURBO). Whilst these representations have been traditionally employed to develop cache-oblivious matrix multiplication and factorisation algorithms, both serial and parallel, we find that they incur additional costs associated with encoding and decoding from the row major layout as needed for the row and column permutations within TURBO. Our detailed analysis of the bit operations required, and in some cases the number of table look-ups, shows that the Morton and Morton-hybrid order are the best candidates. Additionally, the Morton-hybrid order balances this cost with that of recursion overhead and thus is a better candidate than the purely Morton order. The present paper is an indispensable precursor for our work in [1] , where we introduce the concepts of alignment of sub-matrices with respect to the cache lines and their containment within proper blocks under the Morton-hybrid layout, and describe the problems associated with the recursive subdivisions of TURBO under this scheme. We develop the full details of a cache oblivious variant of TURBO that observes the alignment and containment of sub-matrices invariably across the recursive steps. The resulting algorithm is inherently nested-parallel, and has low span, for which the natural sequential evaluation order has lower cache miss rate. Our experiments show that the TURBO algorithm in the Morton-hybrid layout attains orders of magnitude improvement in performance as the input matrices increase in size. We get z Θ = 582.
F Decoding in the Morton-hybrid Order
We illustrate this procedure on the matrix in Fig. 18 . We start with z Θ = 582 and we aim to extract i and j. For this matrix, T = 16 and β = 4. Write
We follow the steps described above. First we extract i m and j m using the masks ((0xAAAAAAAA << β) << β) and ((0x55555555 << β) << β) respectively. We get i m = (001000000000) 2 and j m = (00000000000) 2 . This gives i = 20 and j = 6.
