ABSTRACT In-place linear transformations allow input to be overwritten with the output of the transformation. This paper presents a family of in-place linear transformations based on block lower/upper (LU) decompositions, of which the known transformation via LU decomposition is a special case, and shows that the proposed family includes a class of transformations that possesses the cache-oblivious property. Furthermore, the number of cache misses and approaches for non-square matrices is discussed. The simulation shows that the proposed in-place linear transformation is faster than the existing transformation via LU decompositions.
I. INTRODUCTION
A linear transformation is represented as
where s is a column vector with n symbols, and A m,n is an m × n matrix. If the linear transformation is implemented directly, the input and the output should be stored in different places. However, for some applications, such as the encoding/decoding of block codes, it is not necessary to keep the input data after obtaining the result. In-place algorithms are a family of algorithms in which the input is overwritten by the output during execution of the algorithm, with at most O(1) extra memory space. If the implementation of (1) possesses the in-place property, the result F(s) will be stored in the same position as the input s.
In-place versions of some specific linear transformations have been proposed. These include an in-place fast Fourier transform computation for real-valued signals [1] , a truncated version of which was considered by Arnold [2] . Further, a decomposition for in-place matrix transposition was proposed by Catanzaro et al. [3] and Sung et al. [4] proposed an in-place matrix transposition approach on graphics processing units (GPUs). In our paper, the general in-place linear transformation is considered. In this case, the transformation
The associate editor coordinating the review of this manuscript and approving it for publication was Walter Didimo.
matrix A m,n in (1) can be arbitrary, as long as A m,n meets certain conditions (see Section II-A for more information).
To our knowledge, Luby and Shokrollahi [5] first proposed the general in-place linear transformations, that is based on permutation lower-upper (PLU) decompositions. They also describe an application in the encoding and decoding of block codes. However, when the input data set is larger than the cache space, a large number of cache misses may occur during execution. To solve this problem, the Luby-Shokrollahi (LS) algorithm requires knowledge of the cache size and the length of cache block, and this limits its portability to various platforms.
Cache-oblivious algorithms are designed to take advantage of caches without requiring knowledge of the size of the cache and the length of the cache lines. To date, a number of cache-oblivious algorithms are proposed, such as fast Fourier transformations (FFTs) [6] , [7] , sorting algorithms [8] , matrix transpositions [9] and matrix multiplications [10] - [13] . These algorithms are flexible and portable, and inherently suitable for multilevel memory systems. In this paper, we focus on the design of in-place linear transformations possessing the cache-oblivious property. The contributions of this paper are as follows.
1) A family of in-place linear transformations is presented based on block lower-upper (LU) decompositions; the LS algorithm is a special case of this family. In-place linear transformations based on block lower diagonal upper (LDU) decompositions are also discussed.
2) We show that this family includes a class of transformations that retains the cache-oblivious property. 3) Applications of the proposed approach to linear block codes and neural networks are discussed. The remainder of this paper is organized as follows. Section II gives an overview. Section III presents the in-place linear transformation for the square transformation matrix. Section IV describes a simulation of the proposed approach. Section V discusses the in-place transformations for general matrices, and the application of this algorithm to linear block codes. Section VII concludes this work, and Appendix discusses the number of cache misses of algorithms.
II. PRELIMINARIES
This section introduces the definition of (block) LU decompositions, the definition of cache-oblivious algorithms, and the in-place algorithm proposed in [5] .
A. LU AND BLOCK LU DECOMPOSITIONS
This section gives the definitions of the LU and block LU decompositions [14] , [15] . Given an n × n matrix A, the LU and PLU decompositions are respectively written as A = LU and A = PLU , where L is a lower triangular matrix with ones on the diagonal, U is an upper triangular matrix, and P is a permutation matrix. For the sake of simplicity, this paper uses an LU decomposition; it is straightforward to generalize this to obtain versions with PLU decompositions.
The block LU decomposition is expressed as follows. An n × n matrix A is partitioned into 2 × 2 blocks,
where A 11 is an m × m square matrix and the sizes of A 12 , A 21 , and A 22 are determined accordingly. Assuming that A 11 is invertible, the block LU decomposition of A is given by
where I denotes an identity matrix and
B. CACHE-OBLIVIOUS ALGORITHM
This subsection introduces the cache-oblivious algorithm introduced by Frigo et al. [6] , based on the ideal-cache model. The ideal-cache model shown in Figure 1 is a simple abstraction of the multi-level memory hierarchy. It consists of an ideal data cache of M words and an arbitrarily large external memory partitioned into blocks of size B. The paper [6] showed that, in the ideal-cache model, the cacheoblivious algorithm can work well for any adjacent levels of the memory hierarchy. A cache-oblivious algorithm is an algorithm designed to take advantage of a CPU cache without knowing M or B. As an example, cache-oblivious [6] is based on the divide-and-conquer approach given in Algorithm 1. Note that [6] did not consider the in-place property, and thus Algorithm 1 is not an in-place algorithm.
C. LS ALGORITHM
Given an n × n matrix A and a column vector s with n symbols, the in-place version of the transformation A × s in the downward direction is denoted by A ↓ s. Details are given in Algorithm 2. If A is an upper triangular matrix, the result A ↓ s is equivalent to A × s. Similarly, the in-place version of the transformation A × s in the upward direction is denoted by A ↑ s. If A is a lower triangular matrix, the result A ↑ s is equivalent to A × s. 
tmp ← A[i, i]s[i]
3:
end for 6 :
The LS algorithm [5] is based on LU decompositions 1 of A n,n . Then the transformation is written as
where L is a lower triangular matrix with ones on the main diagonal, and U is an upper triangular matrix. However, the LS algorithm does not have temporal locality. Specifically, the loop (Lines 1-7) of Algorithm 2 sequentially accesses the elements of s in the order (
As most elements are accessed only once, Algorithm 2 lacks temporal locality, and thus it is unsuitable for cache hierarchies. Consequently, the LS algorithm is not cache-oblivious either. Appendix discusses the cache misses of the LS algorithm.
In the LS algorithm, the LU decomposition of A n,n is required. We assume that the LU decomposition has already been performed in the preprocessing stage and that we do not need to repeat it in the transform algorithm. This assumption is based on the fact that some applications have to convert many input vectors with the same transformation. For example, in data communication, the same error-correcting code is usually used throughout the transmission process. This means that many message vectors will be multiplied by the same generator matrix. In this case, the LU decomposition of A n,n can be computed once, and the result will be applied to the input vectors.
III. IN-PLACE TRANSFORMATION FOR SQUARE MATRICES
This section presents the in-place linear transformation (1) when m = n. Based on block LU decompositions, the first subsection gives the formulas for in-place transformations. In-place transformations based on block LDU decompositions are given in Section V-B. The second subsection presents the recursive algorithm possessing the cache-oblivious property. Note that the recursive version is not in-place, as the recursive calls require additional space in the stack. To solve this problem, the third subsection gives a loop version possessing both the cache-oblivious property and the in-place property.
A. A FAMILY OF IN-PLACE TRANSFORMATIONS
The proposed in-place transformation is based on the block LU decomposition of A n,n . Based on (3), the transformation is
where the sizes of s 1 and A 11 are denoted by k and k × k, respectively. Using (6), let
Then, the result
T is given by
It can be seen that there are six variables, (7) and (8) . To provide the in-place property, the values of s i and r i are stored in the input array s i during the executions, for i = 1, 2. This gives the following instructions:
Notably, (9) and (11) have the same variables on both sides, and thus the in-place transformation can be applied recursively. For (10) and (12), the in-place algorithm is presented later (see Algorithm 3 and Algorithm 5).
The family of proposed in-place transformations is obtained by letting k = 1, 2, . . . , n−1 in (6). It can be verified that the LS algorithm is a special case of this family when k = 1 or k = n − 1.
B. RECURSIVE VERSION
In this subsection, we present the recursive versions of the proposed algorithm with the cache-oblivious property. For the sake of simplicity, we assume that n is a power of two; one can generalize the algorithms to support arbitrary n. The recursive algorithm is obtained by letting k = n/2 in (6). The algorithm consists of two parts. The first part handles (10) and (12) . The details are given in Algorithm 3, which presents a cache-oblivious method to perform
Algorithm 3 is cache-oblivious, as the algorithm is recursive without containing any tuning parameters for the size of the cache. Specifically, the input vectors u and v are recursively divided into two smaller sub-vectors, which can always fit into a cache, regardless of its size.
return 4: end if 5: The input is divided into
where the sizes of A 11 , u 1 , and v 1 are n/2 × n/2, n/2, and n/2, respectively. / * Recursively calculate
For the second part, Algorithm 4 implements a method to compute
with the cache-oblivious property. In Algorithm 4, the matrix A should be preprocessed. Based on (3), the block LU decomposition of A yields four matrices, A 11 , A 12 , L 21 , and U 22 . Let
It can be seen that the size of B(A) is equivalent to the size of A, namely n×n. Algorithm 4 uses the matrix B(A), denoted by
Algorithm 4 is cache-oblivious, as the algorithm is recursive without containing any tuning parameters relating to the sizes of caches. Specifically, the input vector s is recursively divided into two smaller sub-vectors, which can always fit into a cache, regardless of its size. A more detailed discussion is presented in Appendix.
C. LOOP VERSION
Although Algorithm 3 and Algorithm 4 are cache-oblivious, neither of them is in-place, as the recursive calls require extra space in the stack. To avoid this overhead, this subsection gives loop versions of these two algorithms. By decomposing the recursive structure, it can be seen that Algorithm 3 sequentially accesses each element of A in an 
where ⊕ denotes the bit-wise exclusive-or operation. In ×86 architecture, Index( ) can be implemented in O(1) operations by the instruction PEXT given in bit manipulation instruction sets. Algorithm 5 presents a loop version of Algorithm 3. To possesses the spatial locality, we use the order shown 
end if 8: end for 9: return in (16) to store the element of A in a one-dimensional array. As Algorithm 3 is cache-oblivious, Algorithm 5 is also cacheoblivious. Furthermore, Algorithm 5 avoids the space overhead caused by recursive calls, and thus it is in-place.
As shown in Algorithm 4, the sub-matrices B 11 and B 22 are handled by Algorithm 4, and other sub-matrices B 12 and B 21 are handled by Algorithm 3 recursively. By de compositing the matrix recursively, the elements on the main diagonal of B are computed by Line 2 of Algorithm 4, and others are computed by Line 2 of Algorithm 3. Based on this observation, Algorithm 6 presents a loop version of Algorithm 4. In Algorithm 6, B was stored in the order (16) to possess the spatial locality. Notably, in Line 4 of Algorithm 6, the statement can be rewritten as
This allows us to remove the if-else statement (Line 3) in Algorithm 6, if the elements of B in the main diagonal are subtracted by one in the preprocessing stage. Algorithm 6 is also cache-oblivious and in-place.
IV. SIMULATION
To demonstrate the performance of the proposed approach, three algorithms, namely, the LS algorithm (in-place, but not cache-oblivious), Algorithm 1 (cache-oblivious, but not inplace) and Algorithm 6 (cache-oblivious and in-place), are implemented in the C programming language. Each program was compiled with GCC 8.1.0 and tested on Windows 10 with Intel i7-6700 (4 cores, 3.4 GHz) CPU and 8 GB of main memory. Each core of the CPU is equipped with a 32 KB L1 cache and a 256 KB L2 cache, and all 4 cores share an 8 MB L3 cache. The cache-line size is 64-byte. Each element is stored by the float data type, that requires four bytes to store a value.
The source data in Figure 2 comprised a 2 19 -element float array (2 MB) filled with random numbers. In the simulation, the source data were divided into non-overlapping blocks of size n, and each block was converted by the in-place transformation. As Algorithm 6 was designed only for the case where n is a power of two, the simulation considers n ∈ {2 i } 10 i=1 . As Algorithm 1 was designed for matrix multiplications, the simulation used m = n, p = 1. Figure 2 shows the throughputs of LS algorithm, Algorithm 1 and Algorithm 6. The L1, L2, and L3 CPU cache sizes are labeled Note that the figure does not include the pre-processing time to convert the transformation matrices.
When n ≥ 64, the size of matrix A exceeds the size of L1 cache, and thus the cache-oblivious algorithms perform better. In contrast, when n is small, the performance is dominated by other factors, such as pipeline optimizations and branch predictions. This causes that, when n ≤ 32, LS algorithm is faster than Algorithm 1. Further, Figure 2 shows that, the gap between Algorithm 1 (not in-place) and Algorithm 6 (in-place) becomes smaller, when n increases. This shows that, the performance gain due to the property in-place is not obvious for large n. In particular, note that the throughput constantly degrades in Figure 2 . The main reason is that the number of multiplications is proportional to the size of transform.
In Figure 2 , Algorithm 6 is always faster than LS algorithm and Algorithm 1. In particular, Algorithm 6 is around 1.5 times faster than LS algorithm in average. Figure 3 presents the space consumption of LS algorithm, Algorithm 1 and Algorithm 6. In Figure 3 , the size of transform n is 64. As Algorithm 1 is not in-place, Algorithm 1 requires more space than other two algorithms.
V. DISCUSSION A. LINEAR TRANSFORMATION FOR NON-SQUARE TRANSFORMATION MATRICES
Based on the proposed approach, this subsection discusses the in-place linear transformation (1) possessing cache-oblivious properties, for m = n.
When m > n, the result vector of the transformation is longer than the input vector. In this case, the input vector is overwritten by the first n symbols of the result vector. First, A is divided into two matrices
where the size of A is n × n, and the size of A is (m − n) × n. Then the transformation can be written as
The computation can be divided into two steps:
where s denotes the vector that stores the remaining m − n result symbols. (19) can be computed by Algorithm 5 with v = 0, and (20) can be computed by Algorithm 6. Note that the space complexity of this transformation is m + 1. When m < n, the result vector of the transformation is shorter than the input vector. In this case, the first m symbols of the input vector are overwritten by the result vector. First, A is divided into two matrices
where the size of A is m×m, and the size of A is m×(n−m). Then the transformation can be written as
Clearly, (22) can be computed by Algorithm 6, and (23) can be computed by Algorithm 5. Note that the space complexity of this transformation is n + 1.
B. IN-PLACE LINEAR TRANSFORM BASED ON BLOCK LDU DECOMPOSITION
This subsection presents the in-place linear transforms based on block LDU decomposition [14] , [15] . Given an n × n matrix A, defined as in (2), the block LDU decomposition is defined as
where
For LDU decomposition, the transformation is written as
This gives the following instructions:
The corresponding algorithms are omitted, as it is not difficult to obtain them based on the new formulas (27), (28), (29), and (30). In particular, one can see that (29) and (30) are equivalent to (11) and (12), respectively. However, the formulas in Section III-A perform the self-recursion in the first step (9) , whereas the new formulas perform it in the second step (28).
VI. APPLICATION A. LINEAR BLOCK CODE
A linear block code is a family of error-correcting codes that encode data in blocks, such that the sum of any two codewords is also a codeword. Many linear block codes, such as Reed-Solomon codes [16] , Hamming codes, and Reed-Muller codes, have been used in practical applications.
Generally, the encoding and decoding of a linear block code can be described by a matrix-vector product. In certain applications, the encoding and decoding of block codes are implemented over general-purpose CPUs, and thus the proposed approaches can be applied to this situation. For example, Intel Intelligent Storage Acceleration Library (ISA-L) [17] and Jerasure [18] gave software implementations of erasure codes for storage systems over x86 processors. As shown in Figure 2 , the proposed algorithm can reduce the encoding/decoding time of linear block codes by decreasing the number of cache misses. In addition, the proposed algorithm can decrease the space overhead than other cache-oblivious algorithms like Algorithm 1.
The encoding of a block code can be described as
where G is an n×k generator matrix, x denotes the k-element message vector, and y denotes the n-element codeword vector. As (31) meets the condition m > n in Section V-A, it can be computed by the proposed algorithm to overwrite the message vector x with the codeword y. This can reduce the VOLUME 7, 2019 number of CPU registers required for the encoding of block codes.
If the first k columns of G form an identity matrix, the first k codeword symbols are the message symbols, and the code is called systematic. In this case, we only need to calculate the parity symbols. The following describes the in-place transformation of r = n − k parity symbols using at most max{k, r} + 1 variables. When k > r, the r parity symbols overwrite the first r symbols of x, meeting the condition m < n in Section V-A. When k < r, the first k parity symbols overwrite x, meeting the condition m > n in Section V-A.
The algorithm for erasure decoding consists of two steps. The first step is to compute an r-element syndrome vector s. In the second step, the r erased symbols are computed by
where y is an r-element vector storing the erasures in y, and H is an r × r matrix determined by the locations of erasures. The first step is the same as that of the encoding algorithm.
In the second step, as s can be discarded after obtaining y , (32) can be computed by Algorithm 4.
B. NEURAL NETWORK
Deep neural networks (DNNs) have been widely used in many areas, including computer vision [19] , speech recognition [20] , and natural language processing [21] . Cache-oblivious and in-place properties could be an important feature for neural networks, as the large weights of DNNs are considered a critical problem on resource-constrained hardwares, such as mobile devices. The space requirement of DNN models, in particular, has been considered a critical problem in recent years [22] . This subsection discusses a potential application of the proposed method for DNN models in the testing stage. Generally, the process of testing a deep learning model can be written as a product of a sequence of matrices. The fully connected layer is the most storage-intensive layer in the DNN architecture. A fully connected layer can be described as
where W ∈ R m×n is the weight matrix between this fully connected layer (with m neurons) and the previous layer (with n neurons), θ is the bias vector, and ψ(·) is the activation function. Then, (33) can be written as y = ψ(z + θ ), where
By applying the proposed method to (34), the input x is overwritten by the output z. Thus, the fully connected layer can be computed in an in-place manner.
VII. CONCLUSION
In this paper, we proposed a family of in-place linear transformations based on block LU decompositions, and showed that a subset of this family of transformations possesses the cache-oblivious property. Loop versions of the proposed algorithms were given to avoid the extra space requirements arising from recursive calls. Further, the number of cache misses was discussed, and approaches for non-square matrices were given. Possible applications to linear block codes and neural networks were also discussed. The simulation shows that the proposed transformation algorithm is around 1.5 times faster than LS algorithm and requires less memory space than Algorithm 1.
APPENDIX NUMBER OF CACHE MISSES
This section discusses the number of cache misses of the LS algorithm and the proposed approach for an ideal-cache model, as shown in Figure 1 . Assuming that the coefficients related to the transformation matrix A are always in the cache, we only consider the cache misses on the input vector. This assumption is valid when the transformation reads data from a stream. In this case, the coefficients related to the transformation are usually in the cache, and the input data should be loaded into the cache before performing the calculations. For the LS algorithm in Section II-C, we consider the computation (5), where both matrices L and U are stored in row-major order. In this case, each element of the result requires O(1+n/B) memory transfers. Thus, the LS algorithm incurs O(n + n 2 /B) cache misses.
In the proposed approach, the number of cache misses in computing (13) The number of cache misses in computing (14) is denoted by T (n), where n is the size of s. In the proposed approach, the transformation of size n is converted into four sub-transformations of size n/2, in which (9) and (11) are computed recursively, and (10) and (12) are computed via (13) . Thus, the recurrence is given by
and the solution is T (n) = O(n 2 /(MB)).
As Algorithm 1 is a cache-oblivious matrix multiplication algorithm. When m = n, p = 1, the output of Algorithm 1 is equivalent to the output of Algorithm 3. In this case, they have the same asymptotically cache complexity. However, Algorithm 1 is not in-place, and the total space overhead is at least n 2 + 2n, where n is the size of v.
