Abstract-Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict -free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T x T) of the problem space. However, to the best of our knowledge an extra space of Tx(T+1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.
I. INTRODUCTION
GPU (Graphics Processor Unit) has been applied extensively in various scientific and engineering domains. Due to its high parallelism and efficiency, GPU was utilized to obtain high performance in many of linear algebra problems involving matrices. The matrix is a fundamental concept of linear algebra since they denote a linear relationship [1] .
In Matrix Transpose (MT) the utilization of GPU memory system is essential to ensure a good application performance which is affected by the data access patterns and the shared memory bank conflicts [2] . NVIDIA proposed an approach based on adding dump columns (padding) or rows to solve the issues of matrix divisibility or bank conflicts.
The impact of MT to improve the performance of matrix multiplication is noticeable in [1] , [3] . However, MT has also been applied for utilizing the bottleneck in transferring data between host and device memory in clusters [4] , [5] and in other architectures such as distributed memory architecture [6] . Moreover, it has been applied in enhancing the performance of deep packet inspection (a famous problem in the domain of network security) by avoiding non-coalesced memory access [7] . Volkov and Demmel [8] studied the impact of MT algorithms in matrix factorization, which is one important algorithm for solving system of linear equations.
In a GPU device, blocks of data are assigned to the processors. Furthermore, shared and global memory has different access speed and latency, so it simulates a distributed memory environment in a single machine. However, CUDA programming model simplifies the solutions to the aforementioned issues and the performance of matrix transpose is noticeably high comparing to the parallel MT algorithms for distributed memory concurrent computers [6] .
In this paper, we have designed two MT algorithms for GPU. Both algorithms ensure that all global memory (GM) accesses are coalescent and all shared memory (ShM) accesses are bank conflict free. The first algorithm copies a tile of the input matrix from GM into ShM, transposes it locally, and then writes it back into GM. The main step in this algorithm is the swapping of elements of the rows and columns in the tile. However, accessing successive elements of a column by successive threads in a half-warp when the tile size is multiple of 16 causes bank conflict. Our algorithm avoids the bank conflict by mapping the block threads to the tile elements diagonally. Even though the first algorithm reported good execution time, we found that it could be further improved by reducing shared memory accesses and distribute the work evenly between the threads within a warp. In the second algorithm, the transpose is performed as simple as reading a tile into the shared memory from the input matrix in GM and writing it back to the output matrix in GM. Since our goal is to ensure that all accesses are coalescent and bank conflict free, this goal is achieved by carrying out different mappings of threads within a block and the elements in a tile in read and write operations. This paper is organized as follows: Section II provides a brief background on GPU and CUDA. Section III presents some of the previous implementations of MT algorithms. In Section IV, an in-depth illustration and explanation of proposed algorithms has been laid out. The results are shown and discussed in Section V while we conclude our paper in Section VI.
II. GPU AND CUDA
General Purpose Graphic Processor Unit (GPGPU) gains higher attention and recognition these days as a suitable 978-1-4799-5604-3/14/$31.00 copyright 2014 IEEEplatform for engineering and science applications. However, power consumption was a major issue [9] . GPU on the other hand is capable of executing more GFLOPS than normal CPU. However, the only restriction in earlier GPU version lies in its lack of support for IEEE FP Standards [9] . GPGPU costs are rapidly decreasing with increasing speed and wider memory. Though the challenges involved in programming GPU, they still provide a conductive platform for variety of applications from linear algebra to 3D gaming. CUDA T M is "a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU)" [10] . It demands and requires a deep level of understanding of the organization of GPU memory and the execution context. CUDA requires a specific compiler to extract parallel code from sequential code and execute each of them to the corresponding device, a hybrid programming model. That is, sequential code will be executed on the CPU while the parallelized segment of the code will be interpreted on the GPU device.
III. PREVIOUS WORK
Ruetsch et al [11] discussed different aspects on how to improve the MT performance based on coalesced access to all GM, which are performed by a half-warp of threads and guarantees that all threads in a half warp access shared memory locations associated with different banks.
The first condition can be achieved by accessing the input and output matrices in the GM row-wise. The standard MT algorithm is to read input matrix row-wise and writes the elements to the output matrix column-wise. However, the writes to the output matrix cannot be coalesced, since the halfwarp threads access non-contiguous locations in the global memory.
To avoid non-coalesced accesses in writing to, a TxT matrix called tile, is allocated, where T is a multiple of 16. Each thread block reads a tile of TxT elements from the input matrix and stores them into the tile in the shared memory. The threads of a block read the elements of input matrix row-wise and write them to the shared memory matrix row-wise too. Then these threads read from the shared memory column-wise and write to the output matrix row-wise. Since shared memory locations are not necessarily be accessed continuously, the coalescent access is guaranteed. However, synchronization is necessary between the threads in write and read operations to the shared memory because the element that was written by one thread in the shared memory might be read by another thread.
The above approach suffers from bank conflict problem in accessing the shared memory. When the tile size is multiple of 16, the locations of the same columns are assigned to the same shared memory banks, and therefore, all read operations from the shared memory executed by a half-warp thread involve bank conflict.
To avoid the bank conflicts, the tile of size Tx(T+1) is allocated. The extra column is added only to assign a location read by a half-warp thread to different banks in the shared memory and therefore eliminates the bank conflicts in reading from the tile as shown in Figure 1 . We note that the bank conflict could be avoided without wasting shared memory by mapping the half-warp threads to different columns in the tile in expense of simple calculations. Since the matrix transpose is a memory-bound application, the computation overheads of these calculations could be hidden and does not significantly affect the execution time of the kernel.
IV. PROPOSED MT ALGORITHMS
In this section we explain our MT algorithms.
A. The First Algorithm: Selective Copying Algorithm 1 shows the pseudo code of our first implementation for matrix transpose. There are three main steps: reading from the input matrix in the global memory to the tile in the shared memory (Lines 1-3, 7), transposing the tile (Lines 9-16) and writing the tile to the proper location in the output matrix in the global memory (Lines 4-6, 18).
The first step is the same step performed by the Nvidia algorithm: the threads of the block read the elements from the input matrix in the global memory row-wise and write them to the tile in the shared memory row-wise (Line 7). As discussed earlier, these accesses are coalescent and bank conflict free.
In the second step, the elements of the tile are transposed inside the tile. To ensure that all accesses to the tile in order to transpose the elements are bank conflict free, the block threads are mapped to the tile elements diagonally (Lines 9, 10). The mapping between the thread (x, y) and the element (x', y') that should be accessed by this thread within the tile of dimension T by using the following address transformation: x' = (x + y) mod T and y' = x.
The transpose of the tile elements is based on swapping the elements in the upper half of the tile with the elements in the lower half. The elements lies in the diagonal are not changed (Line 11). The swapping between upper half and lower half elements are involved by the threads that are mapped to the elements of the upper half (active threads). Other threads don't perform any work in this step (idle threads).
When all elements are transposed, the last step is to write the tile back to the output matrix in the global memory (Line 18). This step is achieved by mapping threads (Lines 4-6) to the tile elements row-wise and allowing half-warp threads to read the elements row-wise from the tile and write them rowwise to the output matrix. Since both matrices are accessed row-wise, all accesses are coalescent and bank conflict free. Figure 2 shows the threads and elements mapping at each step of the first algorithm. The first algorithm eliminates bank conflict while allocating ShM space exactly equal to the tile size of TxT, while NVIDIA algorithm allocates Tx(T+1) space for a tile. However, the implementation has two drawbacks:
1) It performs many accesses to the shared memory in order to transpose the tile in the shared memory. Though all accesses are bank conflict free, accessing shared memory is counted as a computation instruction and takes execution time similar to that of computation instruction [12] . The evaluation shows that when the thread block size is relatively small, this computation might not be hidden by the data transfer, which affects the overall execution time of the kernel. 2) All accesses mentioned above are performed by less than a half of the block threads, while most of the threads stay idle. This configuration has two side effects. The first is the active threads take more time to complete the work. The second is the effect of branching, since the branch instruction divides the warp threads into two groups that should be scheduled independently [12] .
To overcome these drawbacks, we designed the second algorithm as shown in Listing 2. This algorithm achieves the advantage of the first algorithm while all drawbacks are avoided. The algorithm performs two main steps: reading from the input matrix in the global memory to the tile in the shared memory (Lines 1-3, 7-9) and writing the tile elements to the proper location in the output matrix in the global memory (Lines 4-6, 11-13).
In the first step, the threads of the block read the elements from the input matrix in the global memory row-wise and write them to the tile in the shared memory diagonally (Lines 7-9). The mapping between the thread (x, y) and the element (x', y') that should be written by this thread within the tile of dimension T is the same mapping used in the first algorithm.
Since all read operations from the global memory are row-wise, the memory accesses are coalesced. And by using diagonal mapping in writing to the shared memory, we ensure that all elements accessed by half-warp threads are in different bank. Therefore, all shared memory accesses are bank conflict free.
Before writing to the transposed tile to the output matrix, we should note that by writing the tile elements diagonally, the elements stored in each row of the tile are exactly the elements of the column that are read from the input matrix, starting from the diagonal element. Therefore, to complete the transpose we need to map the threads to the tile elements properly (Lines 11, 12), read the elements from the tile row wise and write them to the output matrix row-wise. The address transformation between the thread (x, y) and the element (x', y') is defined by x' = (x + y) mod T and y' = y.
Since reads and writes operations are row-wise, they are coalesced and bank conflict free. Figure 3 shows the threads and elements mapping at each step of the second algorithm.
Algorithm 2 Parallel Write ourTranspose2(odata, idata, width, height)
Parameters: odata = global memory array to store the results idata = global memory array to store the input width = width of the matrices height = height of the matrices Constants and Keywords: TILE DIM = Dimension of the tile to load into shared memory. The allocated tile will be TILE DIM x TILE DIM in size blockIdx.x = x index of current block blockIdx.y = y index of current block threadIdx.x = x index of current thread within a block threadIdx.y = y index of current thread within a block
Algorithm:
1: xIndex = blockIdx.x * TILE DIM + threadIdx.x 2: yIndex = blockIdx.y * TILE DIM + threadIdx.y 3: index in = xIndex + yIndex * width 4: xIndex = blockIdx.y * TILE DIM + threadIdx.x 5: yIndex = blockIdx.x * TILE DIM + threadIdx.x 6: index out = xIndex + yIndex * height 7: x bar = (threadIdx.x + threadIdx.y) % TILE DIM 8: y bar = threadIdx. The proposed approach to access elements in ShM diagonally in MT algorithms 1 and 2 is general. We have defined a C macro as follows to be used in other applications to avoid shared memory bank conflicts.
#define tile(y,x) tile[x % TILE DIM][(x+y) % TILE DIM]
The programmer just needs to use the above macro to access shared memory variable instead of direct accessing as array.
V. EXPERIMENTAL RESULTS
To evaluate the performance of our algorithms, we have implemented them, in addition to the NVIDIA algorithm, as CUDA kernels. We have used the execution time elapsed by the kernels as the performance parameter. Table I shows the main features of the GPU on which the implementations were run. The three implementations that we used for evaluation comparisons are:
We run these implementations several times with the following configurations:
1) Matrices of 3072 x 3072 floats that is the maximum multiple of 32 and 48 that could be allocated. 2) We have used three tile sizes: 16x16, 32x32 and 48x48.
The last is the maximum tile size that can be allocated.
3) The implementations have been run with all possible thread block sizes shown in Table II . 4) Number of iterations = 100. The results show that the three implementations almost have the same time to carry out MT of size 3072x3072, except for some configurations. But our implementations (Selective Copying and Parallel Write) uses less shared memory per block than the NVIDIA implementation as shown in Table  III and gives more reduction in terms of percentage in case of small tile size. When the thread block size is one-dimensional block (16x1, 32x1, 48x1), the three implementations give the maximum value for the execution time. This is because in one-dimensional block, each thread transposes more memory elements than the threads in other configurations. As an example, in the tile of 48x48 and the block of 48x1, each thread transposes 48 elements in the input matrix, while in the block of 48x16 each thread transposes 3 elements only.
The results also show that when the thread block size is one-dimensional block (16x1, 32x1, 48x1), Selective Copying gives the largest value among the three implementations. This is because in this implementation the active threads in the block execute many memory accesses in order to transpose the tile in the shared memory. When the thread block size is small, these accesses cannot be hidden by global memory accesses, and therefore their overheads appear in the execution time.
We note also in Figure 5 that for the block size of 32x32, the execution time is high and Selective Copying is the highest. This is due to the fact that when the block size of 32x32 has 1024 threads (32x32=1024) and this number with the other configurations in the implementations is beyond the occupancy of the GPU as reported by NVIDIA CUDA GPU Occupancy Calculator [13] . When the kernel has low occupancy, the performance is always reduced [14] . Padding column in tile as in NVIDIA implementation to avoid bank conflict applies only on the multiple of 32 tile sizes. Our proposed approach to access shared memory tile diagonally avoids bank conflict even with non-multiple of 32 tile sizes. We have analyzed both the kernels using NVIDIA visual profiler event for shared bank conflicts. Table IV shows the profiler results for bank conflicts with different tile sizes.
We have also applied Parallel Write algorithm in the implementation of recursive gaussian filter provided in NVIDIA SDK. The code sample implements a Gaussian blur using Deriche's recursive method [15] . It processes columns of the image parallel. While, to make the coalesced read for the row pass, it transpose the image and then transpose it back again afterwards. So, it requires to perform transpose twice in each step. The results in figures 9 and 10 shows upto 6% improvement in performance with Parallel Write algorithm over the NVIDIA SDK Transpose. 
VI. CONCLUSIONS
In this work we have studied different implementations of matrix transpose that illustrate how to achieve efficient use of GPU memories and data management. We proposed two implementations for matrix transpose that perform efficient memory accesses to global and shared memory by ensuring all accesses to the global memory are coalesced and all accesses to the shared memory are bank conflict free.
The main advantage of our algorithms is that they eliminate bank conflicts while allocating exactly the tile size memory space. However, in the literature [11] they allocate Tx(T+1) space for TxT tile. Also, padding column in shared memory tile to avoid bank conflicts applies only on the multiple of 32 tile sizes only. While our approach can also be applied on non-multiple of 32 tile sizes.
We have also applied the proposed transpose algorithm to recursive gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.
