E cient transposition of Out-of-core matrices has been widely studied. These e orts have focused o n r educing the number of I/O operations. However, in the state-of-the-art architectures, memory-memory data transfer time and index computation time are also signi cant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time.
Introduction
Matrices are typically stored in row-major order. To access all the elements in a column of a matrix, a straightforward approach will access the disk M times, where M is the number of rows of the matrix. One approach t o a void this is to transpose the matrix and store the transposed matrix on the disk. The problem is of constructing a speci c permutation of a given sequence when only a part of the sequence can be kept in the main storage and operated on at the same time, while minimizing the number of I/O operations 13]. Hence, matrix transpose is a key primitive i n a w i d e v ariety o f s c i e n ti c computations. Matrix transpose is also a fundamental operation in adaptive signal processing 3, 9, 16, 2 1 , 2 2 ]. In such applications, the typical data size that is stored on the disk is of the order of TeraBytes. To store such data, high-performance computing platforms employ RAID 5] disk systems.
Two models have been widely used in the literature to abstract the behavior of disk systems: the Parallel Disk Model (PDM) 28] and the Linear Model (LM) 19]. The PDM is well suited to model I/O systems such as the RAID 5] . In PDM, the data access cost is represented as dm=(DB)e T b , where m is the data size, D is the number of disks, and T b is time to transfer a block o f d a t a ( B) b e t ween memory and disk. In the Linear Model, the cost is represented as T s + m , where T s is the startup time, m is the data size, and is the data transfer time per unit data.
In this paper, we propose an e cient algorithm for transposing large-scale matrices (out-of-core matrix transpose). A matrix of size N N initially resides on the disk. The matrix is to be transposed and stored in another array. The size of the available main memory, M, is smaller than the matrix size. Without loss of generality, w e assume that the matrices are stored in row-major order. Several researchers have studied the out-of-core matrix transpose problem. A straightforward algorithm performs matrix transpose using O(N 3=2 ) I/O operations when M = O(N). Eklundh 13] proposed an algorithm that has O(N logN) I/O complexity assuming B = N. Ari et al. 2] modi ed the algorithm in 13] to reduce the number of I/O operations at the expense of increased number of stages (passes). Floyd 14] has derived upper and lower bounds on the number of I/O operations when M = 2 B. Aggarwal et al. 1] h a ve s h o wn a lower bound on the number of I/O operations for the general case using PDM. Vitter et al. 28] proposed PDM and also provided an asymptotic lower bound for several algorithms including permutation, of which matrix transpose is a special case. Cormen et al. 8 ] s h o wed asymptotically equal lower and upper bounds for the number of I/O operation for bit-matrix-multiply/complement (BMMC) permutations. Kaushik et al. 19] reduced the number of I/O operations by 25% compared with the algorithm in 13] b y combining two read operations.
All these e orts focus on reducing the number of I/O operations only. H o wever, the main costs in the state-of-the-art architectures consist of not only the time for I/O but also the memory-memory data transfer time and the index computation time.
Our out-of-core matrix transpose algorithm reduces the total execution time by reducing both the number of I/O operations and the index computation time. The reduction in the number of I/O operations is achieved by using e cient data read and write schedules and by balancing the number of read and write operations. We analyze the complexity of our algorithm using the well-known Parallel Disk Model (PDM) and the Linear Model (LM).
To eliminate the index computation cost, our algorithm partitions the available memory into two bu ers (read and write bu ers). The expensive in-processor permutation is replaced by data collect operations. The write operations and collect operations are scheduled e ciently to reduce the overall time. The size of each bu er is determined by t h e a vailable memory size and the factorization of N. By using these techniques, the index computation is replaced by inexpensive do-loops (see Section 5.2) .
We implemented the algorithm on a SGI R12000, a Sun Enterprise based on UltraSPARC-III and a Pentium III based platform at the University of Southern California. The experiments were carried out for available main memory sizes ranging from 16 MB to 64 MB and data sizes ranging from 128 MB to 8 GB. The results show that our algorithm reduces the overall execution time by u p t o 5 0 % .
The organization of the remainder of this paper is as follows. In Section 2, two w ell-known disk models are brie y described. In Section 3, previous algorithms for large scale matrix transposition are discussed. Overview of our algorithm is presented in Section 4. Our algorithm is described in detail in Section 5. Experimental results as well as comparisons with previous algorithms are presented in Section 6. Section 7 discusses a further extension of our algorithm and Section 8 concludes the paper.
Disk Models
State-of-the-art disk systems employ sophisticated hardware and perform several optimizations to reduce the I/O time. For example, many of these systems employ a disk bu er, a library bu er, and a controller, and perform access reordering. Each of the above system features need several parameters to describe its behavior and such a model will be too complex to be useful.
Two models of disk systems that capture the key characteristics of such systems have been widely used in the literature. One of them is the Parallel Disk Model (PDM) 28]. It models the low-level behavior of disk systems using few parameters: block s i z e ( B) which is the size of data transferred between disk and memory in one I/O operation, number of disks (D), the number of processors (P ), the size of memory (M), and the amount of data transferred (m). In this paper, P is assumed to be one. The total time for data transfer between disk and memory can be represented as dm=(DB)e T b , w h e r e T b is the time to transfer a block o f d a t a b e t ween memory and disk.
In another model 19], two costs are considered: startup time and data transfer time. The startup time is a xed time for setting up the data transfer between memory and disk. The rest of the cost is proportional to the amount of data transferred. Thus, the access cost can be represented as T s + m , w h e r e T s is the startup time, m is the data size, and is the time to transfer unit data. Typically, T s is in the order of msec, and is tens of nsec/byte. 
Previous Algorithms
In this section, for the sake of completeness, two w ell-known algorithms are brie y described. These two algorithms provide the best performance over many other algorithms. The algorithm in 1] has been designed using the PDM and the algorithm in 19] has been designed using the LM. In Section 4, our algorithm is compared with these algorithms.
Matrix Transpose
In the matrix transpose problem, an input matrix of size N N initially resides on the disk, where N = Q t;1 s=0 r s , for some t > 0, where r s is a positive i n teger. If N is a prime number, we can add dummy r o ws to make N to be non-prime. The input matrix is to be transposed and stored in another array. M, the size of the available memory is smaller than the input matrix size. Throughout this paper, to illustrate the key ideas, we use square matrices. However, the algorithms can be easily extended to rectangular matrices as well, using the technique in 6, 19] . Also, for the sake of simplicity, throughout the paper, we assume that the result of all arithmetic operations are integers.
Aggarwal's Algorithm
Aggarwal et al. 1] showed a lower bound on the number of I/O operations to perform matrix transpose. In this algorithm, as many blocks as can t in the available memory are read into memory. Then, the data are permuted and written onto the disk. The number of I/O operations for this approach i s s h o wn in Table 3 B . The asymptotic complexity o f the algorithm is N 2 lg(N 2 ), when M and B are constants. Note that the asymptotic complexity of the algorithm is analoguous to N lg N bound for sorting on the RAM model 7, 26] , where N is the number of data elements.
In this algorithm, r s is restricted to be M=B 0 s < t . This is because if r s > M = B , a b l o c k Each stage consists of N 2 =M steps. In each step, M=N rows are read into memory and permutation of the data is performed in the memory. In stage s, 0 s < t , the data are written back o n to the disk in r s write operations. Thus, the number of read (write) operations in each s t e p i s 1 ( r s ). The total number of I/O operations using the LM and the PDM are shown in Table 3 (See Page 18).
Although the number of I/O operations and the time to transfer data between memory and disk are considered, the total number of read and write operations are not optimized. Also, the index computation time is not minimized. The pseudo-code for Kaushik et al.'s algorithm is given in Figure 2. 
Overview of our Algorithm
We present a n o verview of our approach in this section. Section 5 provides the details of our approach a n d analysis using PDM and LM.
One of the key features of our algorithm is the reduction in the total number of I/O operations, which i s achieved by means of e cient read and write schedules and corresponding data permutation in the memory. For example, in 19], there are three I/O operations (one read operation and two write operations) in each step when M = 2 N and B = N. Our algorithm requires only a single write operation in each step compared with two write operations as in the case of the algorithms in 1, 19] . The concept of a step is explained in detail in Section 5. Since our algorithm consists of the same number of steps as in the previous algorithms, there is a considerable reduction in the total number of write operations. Step 1
Step 2
Step 3 Another technique used in our algorithm is the balancing of the numbers of read and write operations. In balancing the numbers of read and write operations, the key idea is that the total number of I/O operations can be reduced by reducing the number of write operations at the expense of an increased number of read operations. For example, when r s = 32, in each step, the number of read (write) operations in 19] i s 1 (32). In our algorithm, we increase the number of read operations to 9 in order to reduce the number of write operations to 9. This results in a 45% reduction in the total number of I/O operations. Note that a straightforward method to balance the numbers of read and write operations reduces the total number of I/O operations by only one (see Section 5) . The data that are written onto the disk in z s write operations in the previous algorithm is written in one write operation in our algorithm. Thus, there is a reduction in the number of write operations by a factor of z s , where z s is an integer 2. In a subsequent r e a d o p e r a t i o n (that is in the next stage) the data is read in z s read operations. By choosing an optimal value of z s , the total number of I/O operations is reduced.
In the previous algorithms 1, 19] , the entire available memory is used for reading data from disk. Even though this approach maximizes the memory utilization, it results in excessive index computation cost (index computation refers to computing the source or destination addresses of each datum during data permutation in the memory). To eliminate the index computation cost, the available memory is partitioned into two di erent-sized bu ers (read and write bu ers). Instead of performing a permutation before every write operation, only the data needed for each write operation is moved into the write bu er. This is denoted as a collect operation. The stride of the data access in the collect operation is constant. Thus, it can be performed using inexpensive do-loops.
If the same schedule as in the previous algorithms is used (collect operations followed by write operations), then the size of the write bu er must be M=2. However, in our algorithm, the utilization of the write bu er is increased using our schedule which results in a smaller write bu er. In our schedule, a write operation follows each collect operation. Since the read bu er size is less than the available memory size, the number of I/O operations is increased slightly. H o wever, as shown in Section 6, the total execution time is reduced due to reduction in the index computation time.
Details of the Algorithm
Additional details of our algorithm as well as the analysis are presented in this section. Section 5.1 describes our technique to reduce the number of I/O operations. The overall algorithm using the read and write schedules is described rst. Later, the read and write schedules are explained for the four di erent cases. In Section 5.2, our method to reduce the index computation time is explained. The algorithm consists of t stages. In each step, the data are rst read into memory using read schedule RS(s u) (Line 3). The read schedule (and the write schedule later) speci es data to be read (written) in each step during a stage. The data that are read into memory are permuted based on the column number in the input matrix (Line 4). The data is then written onto the disk using the write schedule W S (s u) (Line 5).
The RS, W S , and the permutation of data in memory are explained in detail for each of the following four cases. Case 1 and Case 2 pertain to the scenarios where as much data as the memory size can be read from the disk or written onto the disk in one I/O operation (i.e., B M). As discussed in Case 1, our analysis shows that e cient data arrangement reduces the number of I/O operations by approximately a factor of (r s + 1 ) =r s compared with the previous algorithms. Note that reducing the index computation time (discussed in Section 5.2) further improves the performance in all the cases. In the following, s, 0 s < t , refers to a stage and u, 0 u < N 2 =M, refers to a step.
De ne R s , 0 s < t , as follows. Case 1: (B M and r s < 8) The key idea in all the cases is to arrange the data on the disk using RS and W Sschedules. The RS(s u) i s a s e t c o n taining the row indices of the data to be read from the disk and W S (s u) is a set containing the row indices of the data to be written onto the disk, during Step u in Stage s.
RS and W Sfor Case 1 are speci ed by the algorithms in Figure 5 and Figure 7 respectively. Examples of RS and W Sare shown in Figure 6 and Figure 8 respectively.
If contiguous rows are read or written in a step, it can be done in one I/O operation when LM is used, since in LM, any amount o f c o n tiguous data are read or written in one I/O operation. This is true in Case 1 and Case 2. For example, in Figure 6 , in Step 0 of Stage 0, row 0 and row 1, which are contiguous, are read in one read operation. But in Step 2 of Stage 2, row 0 a n d r o w 3 are not contiguous. Thus, they are read separately, i.e., using two read operations.
The data in the memory are permuted as in Figure 9 . X(i) denotes the data at the address i 0 i < M , in the memory. Note that in Figure 9 , some details are omitted for simplicity. F or example, the permutations during Step 0 of Stage 1 need additional computations to identify the destination addresses. However, such additional computations are required for only a small number of permutations. Since the details are not critical for understanding the main idea of the algorithm and can be easily derived, they are omitted for simplicity.
The algorithm is illustrated in Figure 10 . The number in each small square denotes a data element. Notice that the data are in row-major order in the initial matrix in Stage 0 and in column-major order in the right-hand-side matrix in Stage 2. The arrows indicate read and write operations. As discussed above, if two r o ws are adjacent, the data are read or written in one I/O operation.
The number of I/O operations is approximately N 2 M P t;1 s=0 r s (See Theorem 1).
A comparison of the numbers of I/O operations per step in the algorithm in 19] and our algorithm is shown in Table 1 . In Case 1, Linear Model is used because M amount of data on the disk is loaded into memory in one I/O operation. Consequently, Aggarwal's algorithm cannot be used as its basic unit of data transfer is equal to or larger than the size of memory in this particular case. However, Linear Model still works well in this case.
Several balancing the numbers of read and write operations. In Kaushik et al.'s algorithm, the di erence between the numbers of read and write operations is large. In each step of Stage s, the number of read operations i s 1 a n d t h e n umber of write operations is r s . In our algorithm, we d e v elop a technique that reduces the number of write operations at the expense of an increased number of read operations. Since the absolute descriptions are similar to those in case 1, we focus on the key ideas here. Details are given in the Appendix.
Note that a straightforward method reduces the number of write operations to r s ;y s 0 s t;2, where y s is the number of the new read operations. Then, the total number of I/O operations is (r s ;y s )+y s = r s .
The total number of I/O operations is reduced by one. In our algorithm, we decrease the numb e r o f w r i t e operations to approximately r s =z s , where z s > 0 i s a n i n teger. The optimal value of z s is chosen later.
De ne z ;1 = z t;1 = 1. The algorithm for this case is shown in Figure 14 Step 0
Step 1
Step 3 algorithms, each superblock is written in one disk write operation. In our algorithm, z s superblocks are written on the disk in one write operation. The writing schedule, W S , i s s h o wn in Figure 17 (See Appendix). Thus, the number of write operations is reduced by a factor of z s . W riting z s superblocks in one I/O operation causes the data to be scattered as described below. Note that during the next stage (Stage (s + 1 ) ) , r s superblocks are read in a step. Among the r s superblocks that are written in a step during Stage s, only one of them is read in a step during Stage (s + 1). Thus, r s superblocks read in a step during Stage (s + 1) consist of superblocks each o f w h i c h is one of the superblocks written in di erent steps during Stage s.
If one superblock is written onto the disk in a write operation as in the previous algorithms, then a superblock can be written to any location on the disk. This enables the superblocks that are read in a step during the next stage to be read in one I/O operation. However, since z s superblocks are written in one write operation, it is impossible to write the z s superblocks in a way t o a l l o w reading of these superblocks in one read operation during the next stage i.e., the superblocks are \scattered".
The scattered writing causes two b l o c ks that are read in a step during the next stage to be separated by at least z s ; 1 b l o c ks. In each read operation during the next stage, in order to read the superblocks that are separated by at least z s ; 1 superblocks, we need to perform z s read operations. The read schedule, RS, is shown in Figure 15 (See Appendix). In addition to these, if z s 6 = 2, one read and one write operations are performed (Line 11 and Line 15). Some examples of RS and W Sare shown in Figure 16 and Figure 18 respectively (in the Appendix). The number of rows in W Sis greater than M=N since there are z s read operations in a step. Note that, in this example, the number of I/O operations is not reduced since r 0 = r 1 = 4 < 8. The number of I/O operations is reduced when r s 8, as shown in Table 2 .
The optimal value of z s is obtained as follows. De ne g(s) as follows. The total number of I/O operations performed by the algorithm in 19] a n d b y our algorithm is compared in Table 2 . The algorithm in 1] is not compared here since it is not applicable to this case.
The data in the memory are permuted as shown in Figure 19 . The total number of I/O operations for case 2 (M B, and r s 8) is given in the following theorem.
Theorem 2 In the Linear Model, the total number of I/O operations in our algorithm is Since the number of steps is N 2 =M, the total number of I/O operations is N 2 M ( P t;2 s=0 (2 p r s + g(s)) + 2).
2
Note that the number of I/O operations can be further reduced by using the memory more e ciently than what is shown in Theorem 2. However, the technique for more e cient memory utilization makes the algorithm description and analysis complex. Thus, only the basic idea is brie y described here. In the algorithm in 19], if z s;1 is not 1, the amount of data read in each read operation is M=2. But the amount of data that can be read in the rst read operation is M. In the next read operation, the amount of data read is M ; M=z s;1 and so on. Since the data read in one read operation is always larger than or equal to M=2, this method decreases the number of read operations compared with the result stated in Theorem 2. A comparison of the algorithms with respect to the number of I/O operations is shown in Table 3 .
Reducing Index Computation Time
In the previous algorithms, the available memory is fully utilized to reduce the number of I/O operations. In other words, in a read operation, as much data as the size of the available memory are read from the disk. After reading the data, permuting the data within the memory requires destination location of each data element to be computed. This results in a large index computation time.
To reduce the total execution time, we eliminate the expensive index computation time by using the algorithm shown in Figure 11 . In our algorithm, we partition the memory into two d i e r e n t-sized bu ers: one of size M r , which is used as a read bu er and the other of size M w , which is used as a write bu er, such that M = M r + M w . The read bu er is partitioned into r s superblocks. The read bu er is used for reading data from disk. After reading the data, there are r s =z s sets of collect and write operations (see Section 4.2.2). In each collect operation, data in z s superblocks are collected into the write bu er and this operation is repeated r s =z s times (Line 5). The sizes of the write and read bu ers are determined to be Mz s =(r s + z s ) and Mr s =(r s + z s ), respectively. Figure 12 shows an example of the collect operation. In this example, N=8, M=24, r s = 2 , z s = 1 . Thus, M r = 24*2/3 = 16, and M w = 24/3 = 8. In the gure, the rst two steps during a stage are shown to illustrate the collect operation. Initially, t wo r o ws of data are loaded into the memory from the disk. Note that even though the memory can hold three rows, only two r o ws are loaded since M r =16. The remaining space (M w = 8) is utilized by the write bu er. After loading data into the memory, the data for the rst write operation (0, 2, 4, ..., 14) are collected in the write bu er. Then, the data in the write bu er is written onto the disk. Since the write bu er is empty n o w, the data for the second write operation (1, 3, 5 , ..., 15) are collected into the write bu er, which is written onto the disk after collection. This is repeated until all the data in the memory are written onto the disk. In Step 1, the data (16, 17, 18,. .....31) are loaded into the memory and the above operations are repeated.
In a collect operation, the data in z s superblocks are collected using do-loops since the data access stride Move ( iz s + j) th superblock to write bu er 7
Write data in write bu er to disk In each do-loop, the required computations are simple additions. Note that, in the previous algorithms 1, 1 9 ] , the computations to permute the data consist of both index computation and addition. In our algorithm, since only addition operations are used to collect data to the write bu er, the index computation is eliminated. The collected data in the write bu er is written onto the disk in a write operation (Line 6). Even though the number of I/O operations increases by a factor of M=M r , the total execution time reduces signi cantly as shown by our experiments.
Experimental Results
We implemented the algorithms on a SGI (R12000, 300 MHz), a Sun Enterprise system (UltraSPARC-III, 750 MHz) and a PC platform (Pentium III, 733 MHz) at the University of Southern California. For the sake of comparison, Kaushik et al.'s algorithm described in Section 3.3 was also implemented. The results are shown in Figure 13 and Table 5 through Table 7 .
In our experiments, we observed that Aggarwal et al.'s algorithm, described in Section 3.2, has the same total execution time as Kaushik et al.'s algorithm. Even though the two algorithms perform the needed permutation using di erent methods and the permuted data are di erent, the permutation times are the same. If in Stage s, 0 s < t , the block size is smaller than M=r s , then both algorithms require the same total execution time, where t is the number of stages. The total execution time is di erent for the two algorithms when the amount of data transferred in one I/O operation is smaller than B. The amount o f the data transferred in one I/O operation in our experiments ranges from 128 KBytes to 8 MBytes and the typical size of B in state-of-the-art platforms is 4 KBytes. Thus, the performance of the two algorithms is the same in our experiments. Therefore, in Figure 13 and Tables 5 through 7 , the execution time reported under the heading \Previous" refers to both these algorithms. The amount of main memory allocated to the data was varied from 16 MBytes to 64 MBytes and the data size was varied from 128 MBytes to 8 GBytes. For each parameter (memory and data size) setting, the algorithms were executed 5 times and the maximum, a verage, and minimum values were calculated. The reported times are wall-clock times and the unit is second. The speedup of our algorithm over the previous algorithms was calculated for each parameter setting. The results of our experiments are shown in Figure  13 . Additional results are shown in the Tables 5, 6 , and 7. The results show that our algorithm reduces the execution time by up to 50%.
To understand the e ect of the system parameters on the speedup, we de ne the following terms. Let T d denote disk I/O time , T m denote the memory-memory data transfer time, T c denote the index computation time, and T a denote the additional disk I/O time required by our algorithm compared with the previous algorithms. Also, we de ne R c and R a ratios as R c = T c =(T d + T m ) a n d R a = T a =(T d + T m ). Note that R a is usually much smaller than one as the sum of the disk I/O and memory-memory data transfer times is larger than the additional disk I/O time incurred due to the partitioning of memory into two bu ers. Then, the speedup of the new algorithm compared with the previous algorithm is Speedup = Td+Tm+Tc Td+Tm+Ta = 1+Rc 1+Ra 1 + R c The equation shows that the speedup is a function of the two ratios, R a and R c . Note that the parameters of the disk system, memory, and the CPU are all accomodated in the two simple ratios. And also accurate measurement of these parameters is di cult to perform. For example, if the disk access time reduces, then R c increases, and if the processor speed increases, then R c decreases. The va l u e s o f t h e R c for the systems on which experiments were performed have been measured to be 1.64 (SGI), 1.29 (SUN), and 0.66 (Pentium).
The speedup on the SGI is the highest and that on the Pentium is the lowest among the three machines as the R c of the SGI is the largest and that of the Pentium is the smallest. Current operating systems do not support le sizes larger than 2 GB. To perform experiments for the le sizes as large as 8 GB, we partitioned the le into four 2 GB les. Then, based on the address, appropriate les are accessed during the execution of the algorithm. By using this technique, we w ere able to perform experiments of data sizes larger than 2 GB for the machines considered in this paper. However, there exist slight v ariation in the obtained results. For example, the execution time obtained for 8 GB le sizes is much smaller than four times the execution time obtained for 2 GB le sizes, for some operating systems. We believe the di erence is due to the optimization of the disk system by the operating system, for multiple le accesses.
Further Extensions
Our matrix transpose algorithm can be easily extended to a parallel disk version by using the same method as in 1]. The algorithm in 1] uses a RAID-type disk system that has D disks. The disks provide D parallel accesses. By allocating N=D columns to each disk and accessing D disks simultaneously, the disk data transfer time is reduced by a factor of D.
To reduce the number of I/O operations, the algorithm in Figure 4 can be used to employ the two techniques proposed in this paper: e cient read and write schedules and balancing the numb e r o f I / O operations.
To reduce the index computation time, the algorithm in Figure 11 is used: the available memory is partitioned into two bu ers, the permutation is replaced by collect operations, and the collect operations and write operations are scheduled to maximize the utilization of the available memory.
Conclusion
In this paper, we presented an e cient algorithm for large-scale matrix transposition. Contrary to the previous works that have focused only on reducing the number of I/O operations, we identi ed the major costs in the state-of-the-art computing platforms in performing transpose, and the overall cost was reduced. The generality of the main ideas lend themselves to be applicable to other algorithms having recursive structures that operate on large data sets.
Appendix
This section contains pseudo code for the matrix transposition and read and write schedules for cases 2 and 3 of our algorithm. 
