Abstract-The main contribution of this paper is to introduce two parallel memory machines, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). Unlike well studied theoretical parallel computational models such as PRAMs, these parallel memory machines are practical and capture the essential feature of memory access of NVIDIA GPUs. As a first step of the development of algorithmic techniques on the DMM and the UMM, we first evaluated the computing time for the contiguous access and the stride access to the memory on these models. We then go on to present parallel algorithms to transpose a two dimensional array on these models. Finally, we show that, for any given permutation, data in an array can be moved along a given permutation both on the DMM and on the UMM. Since the computing time of our permutation algorithms on the DMM and the UMM is equal to the sum of the lower bounds obtained from the memory bandwidth limitation and the latency overhead, they are optimal from the theoretical point of view.
I. INTRODUCTION

A. Background
The research of parallel algorithms has a long history of more than 40 years. Sequential algorithms have been developed mostly on the RAM (Random Access Machine) [1] . In contrast, since there are a variety of connection methods and patterns between processors and memories, many parallel computing models have been presented and many parallel algorithmic techniques have been shown on them. The most well-studied parallel computing model is the PRAM (Parallel Random Access Machine) [2] , [3] , [4] , which consists of processors and a shared memory. Each processor on the PRAM can access any address of the shared memory in a time unit. The PRAM is a good parallel computing model in the sense that parallelism of each problem can be revealed by the performance of parallel algorithms on the PRAM. However, since the PRAM requires a shared memory that can be accessed by all processors in the same time, it is not feasible.
The GPU (Graphical Processing Unit), is a specialized circuit designed to accelerate computation for building and manipulating images [5] , [6] , [7] , [8] . Latest GPUs are designed for general purpose computing and can perform computation in applications traditionally handled by the CPU. Hence, GPUs have recently attracted the attention of many application developers [5] , [9] . NVIDIA provides a parallel computing architecture called CUDA (Compute Unified Device Architecture) [10] , the computing engine for NVIDIA GPUs. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in NVIDIA GPUs. In many cases, GPUs are more efficient than multicore processors [11] , since they have hundreds of processor cores.
CUDA uses two types of memories in the NVIDIA GPUs: the global memory and the shared memory [10] . The global memory is implemented as a off-chip DRAM, and has large capacity, say, 1.5-4 Gbytes, but its access latency is very long. The shared memory is an extremely fast on-chip memory with lower capacity, say, 16-64 Kbytes. The efficient usage of the global memory and the shared memory is a key for CUDA developers to accelerate applications using GPUs. In particular, we need to consider the coalescing of the global memory access and the bank conflict of the shared memory access [6] , [11] , [12] . To maximize the bandwidth between the GPU and the DRAM chips, the consecutive addresses of the global memory must be accessed in the same time. Thus, threads should perform coalesced access when they access to the global memory.
The address space of the shared memory is mapped into several physical memory banks. If two or more processor cores access to the same memory banks in the same time, the access requests are processed sequentially. Hence to maximize the memory access performance, processor cores should access to distinct memory banks to avoid the bank conflicts of the memory access.
B. Our Contribution: Introduction to the Discrete Memory Machine and the Unified Memory Machine
The first contribution of this paper is to introduce simple parallel memory bank machine models that capture the essential features of the coalescing of the global memory access and the bank conflict of the shared memory access. More specifically, we present two models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), which reflect the essential features of the shared memory and the global memory of NVIDIA GPUs.
The architectures of the DMM and the UMM are illustrated in Figure 1 . In both architectures, the processing elements (PEs) are connected to the memory banks (MBs) through the memory management unit (MMU). A single address space of the memory is mapped to the MBs in an interleaved way such that the data of address is stored in the´ ÑÓ Ûµ-th bank, where Û is the number of MBs. The main difference of the two architectures is the connection of the address line between the MMU and the MBs, which can transfer an address value. In the DMM, the address lines connect the MBs and the MMU separately, while a single address line from the MMU is connected to the MBs in the UMM. Hence, in the UMM, the same address value is broadcast to every MB, and the same address of the MBs can be accessed in each time unit. On the other hand, different addresses of the MBs can be accessed in the DMM. Since the memory access of the UMM is more restricted than that of the DMM, the UMM is less powerful than the DMM. The performance of algorithms of the PRAM is usually evaluated using two parameters: the size Ò of the problem and the number Ô of processors. For example, it is well known that the sum of Ò numbers can be computed in Ç´Ò Ô · Ð Ó Ôµ time on the PRAM [2] . We will use additional two parameters, the width Û and the latency Ð of the memory when we evaluate the performance of algorithms on the DMM and on the UMM. The width Û is the number of memory banks and the latency Ð is the number of time units to complete the memory access. Hence the performance of algorithms on the DMM and the UMM is evaluated as a function of Ò (the size of a problem), Ô (the number of processors), Û (the width of a memory), and Ð (the latency of a memory). In typical NVIDIA GPUs, the width Û of global and shared memory is 16 or 32. Also the latency Ð of the global memory is 400-800 clock cycles.
We also introduce the bandwidth limited PRAM (BPRAM). In the BPRAM of width Û, any Û processors out of the Ô processors can access to the memory in a time unit. Clearly, the BPRAM is less powerful than the PRAM and is more powerful than DMM and the UMM. Unlike the DMM and the UMM, the BPRAM has no restriction of the addresses of memory access and 1 memory access latency. We use the BPRAM to show the goodness of the performance of algorithms on the DMM and the UMM. If the computing time of an algorithm to solve some problem on the DMM or the UMM is almost the same as that on the BPRAM, we can say that the algorithm on the DMM or the UMM is close to optimal.
Please note that the DMM and the UMM are theoretical models of parallel computation, that capture the essential feature of the global memory and the shared memory of NVIDIA GPUs. NVIDIA GPUs have other features such as hierarchical architecture grid/block/thread, and the cache of the global memory. However, if these aspects are incorporated in our theoretical parallel models, they will be complicated and need more parameters. The development of algorithms on such complicated model may have too much non-essential and tedious optimizations. Thus, we have introduces two simple parallel models, the DMM and the UMM, which focuses on the memory accesses of the global memory and the shared memory of NVIDIA GPUs.
In [13] , the authors have presented a GPU memory model and presented a cache-efficient FFT. However, their model focuses on the cache mechanism and ignores the coalescing and the bank conflict. Also, in [14] , acceleration techniques for GPU have been discussed. Although they are taking care of the limited bandwidth of the global memory, the details of the memory architecture are not considered. As far as we know, this paper is the first work that introduces simple theoretical parallel computing models for GPUs. We believe that the development of algorithms on these models are useful to investigate algorithmic techniques for the GPUs.
C. Our Contribution: Fundamental Algorithms on the DMM and the UMM
The second contribution of this paper is to evaluate the performance of two memory access methods, the contiguous access and the stride access on the DMM and the UMM. The reader should refer to Figure 2 for illustrating these two access methods. We will show that the contiguous access of an array of size Ò can be done in Ç´Ò Û · ÒÐ Ô µ time units on the DMM and the UMM. Also, the contiguous access and the stride access can be done in Ç´Ò Û µ time units on the BPRAM. Thus, the contiguous access of the DMM and the UMM is optimal when ÛÐ Ô, because Ò Û ÒÐ Ô if this is the case. Further, we will show that the stride access of the The third contribution of this paper is to show algorithms for transposing a two dimensional array and for permutation of an array on the DMM and the UMM. We first show that the DMM and the UMM can transpose a two dimensional array of size
We generalize this result to permute data in an array of the memory. Suppose that a permutation of an array of size Ò is given. The goal is to move data stored in an array along the given permutation. Quite surprisingly, for any given permutation of an array of size Ò, data can be moved in Ç´Ò Û · ÒÐ Ô µ time units both on the DMM and on the UMM. From the results of the second and the third contributions, we have one important observation as follows. The factor Ò Û in the computing time comes from the bandwidth limitation of the memory. It takes at least Ò Û time units to access whole data in an array of size Ò from the memory bandwidth Û. Also, the factor ÒÐ Ô comes from the latency overhead. From the memory access latency Ð, each processor cannot send next access request in Ð time units. It follows that, each processor can access to the memory once in Ð time units and each of the Ð time units can have expected Ô Ð access requests by processors. Hence, ÒÐ Ô time units are necessary to access all of the elements in an array of size Ò. Further, to hide the latency overhead factor ÒÐ Ô from the bandwidth limitation factor Ò Û , the number Ô of the processors must be no less than ÛÐ. We can confirm this fact from a different aspect. We can think that the memory access request are stored in a pipeline buffer of size Ð for each memory bank. Since we have Û memory banks, we have ÛÐ pipeline registers to store memory access requests at all. Since at most one memory request per processor are stored in the ÛÐ pipeline registers, ÛÐ Ô must be satisfied to fill the pipeline registers full of memory access requests. This paper is organized as follows. We first define the DMM and the UMM in Section II. In Section III, we evaluate the performance of the DMM and the UMM for the contiguous access and the stride access to the memory. Section IV presents algorithms that perform the transpose of two dimensional array on the DMM and the UMM. In Section V, we show that any permutation on an array can be done efficiently on the DMM. Finally, Section VI presents an permutation algorithm on the UMM.
II. PARALLEL MEMORY MACHINES: DMM AND UMM
Let us start with defining PRAM (Parallel Random Access Machine), the most popular shared memory parallel machine model. The PRAM consists of Ô processors and a shared memory. The shared memory is an array of memory cells, each of which can store a word of data. Each of the processors can select a memory cell in the array independently, and can perform read/write operation in a time unit. Please see [2] for the details of the PRAM.
We introduce a memory bandwidth limited PRAM. We assume that Û memory cells can be read/written in a time unit. If more than Û memory cells are accessed, the Û operations are automatically serialized. More specifically, if Ô memory cells are accessed, Û read/write operations performed in each time unit. and it takes Ô Û time to complete the Ô read/write operations. We call such PRAM the Bandwidth-limited PRAM (BPRAM).
We next introduce the Discrete Memory Machine (DMM) of width Û and latency Ð. Let Ñ ( ¼) denote a memory cell of address in the memory. Let
Clearly, a memory cell Ñ is in the´ ÑÓ Ûµ-th memory bank. We assume that memory cells in different banks can be accessed in a time unit, but no two memory cells in the same bank can be accessed in a time unit. Also, we assume that Ð time units are necessary to complete an access request and continuous requests are processed in a pipeline fashion through the memory management unit. Thus, it takes · Ð ½ time units to complete continuous access requests to a particular bank. We assume that Ô processors are partitioned into Ô Û groups of Û processors called warps. More specifically,Ô processors
Warps are activated for memory access in turn, and Û processors in a warp try to access the memory in the same time. In other words, Ï´¼µ Ḯ½µ ḮÛ ½µ are activated in a round-robin manner if at least one processor in a warp requests memory access. If no processor in a warp needs memory access, such warp is not activated and is skipped. When Ï´ µ is activated, Û processor in Ï´ µ sends memory access requests, one request per processor, to the memory. We also assume that a processor cannot send a new memory access request until the previous memory access request is completed. Hence, if a processor send a memory access request, it must wait for Ð time units to send a next memory access request.
Let us evaluate the time for memory access using Figure 4  on group. We assume that memory cells in the same address group are processed in the same time. However, if they are in the different groups, one time unit is necessary for each of the groups. Also, similarly to the DMM, Ô processors are partitioned into warps and each warp access to the memory in turn.
Again, let us evaluate the time for memory access using Figure 4 on the UMM for Ô , Û , and Ð ¿.
To complete the memory access requests by Ï´¼µ, those to ¼ is performed for Ñ ¼ and Ñ ½ . After that, access requests to ½ is performed for Ñ and then that to ¾ is performed for Ñ ½¼ . Next, memory access requests by Ï´½µ are processed. First, memory access to ¾ is requested for Ñ and Ñ , and then that to ¿ is requested for Ñ ½ and Ñ ½ . Since we have latency Ð ¿ , all of these memory request are processed in · Ð ½ time units on the UMM.
III. SEQUENTIAL MEMORY ACCESS OPERATIONS
We begin with simple operations to evaluate the potentiality of the DMM and the UMM. Let Ô and Û be the number of processors and the width of the machines. We assume that an array Ñ of size Ò is arranged in the memory. Let Ñ (¼ Ò ½) denote the -th word of the memory. We assume that Ò Ô Û . We consider two access operations to the memory such that each of the Ô processors reads × Ò Ô memory cells out of the Ò memory cells. Again, the readers should refer to Figure 2 . In the contiguous access, the first Ô memory cells are accessed by the Ô processors. Next, the second Ô memory cells are accessed. In this way, all Ò memory cells are accessed in turn. In the stride access, the memory is partitioned into Ô groups of Ò Ô consecutive memory cells each. A processor is assigned to each group and the assigned processor accesses memory cells in the group sequentially. The contiguous access and the stride access are written as follows.
[Contiguous Access
In the contiguous access, Û processors in each warp access memory cells in different memory banks. Hence, the memory access by a warp takes Ð time unit. Also, the memory access by a warp is processed in every 1 time unit. Let us start with a straightforward transpose algorithm using the contiguous access and the stride access. The following algorithm transposes a two dimensional array size
[ We first show an efficient transposing algorithm on the DMM. The key idea is to access the array in diagonal fashion. We use a two dimensional array of size Ò ¢ Ò as a work space.
[Transpose by the diagonal access on the DMM]
The readers should refer to Figure Next, we will show that the transpose of a two dimen- Table I  THE RUNNING TIME FOR THE CONTIGUOUS ACCESS AND THE STRIDE sional array can be also done in Ç´Ò Û · ÒÐ Ô µ on the UMM if every processor has a local memory that can store Û words. As a preliminary step, we will show that the UMM can transpose a two dimensional array of size Û ¢Û in Ç´Û · Ðµ time units using Û processors with each processor having a local storage of size Û. We assume that each processor has local memory that can store Û words. Let Ð denote the local memory of È ´ µ, and Ð ¼ Ð ½ Ð Û ½ can store a word of data.
[Transpose by the rotating technique on the UMM]
Let´ µ denote the value stored in initially. The readers should refer to Figure 5 for illustrating how these values are transposed.
Let us confirm that the algorithm above correctly transpose two dimensional array . In other words, we will show that, when the algorithm terminates, stores´ µ. It should be clear that, the value stored in Ð Ø is´Ø ´Ø· µ Ñ Ó Ûµ. 
V. PERMUTATION OF AN ARRAY ON THE DMM
In Section IV, we have shown that the transpose a two dimensional array on the DMM and the UMM can be done in Ç´Ò Û · ÒÐ Ô µ time units. The main purpose of this section is to show algorithms that perform any permutation of an array. Since a transpose is one of the permutations, the results of this section is a generalization of those presented in Section IV.
Ð¼ Ð½ Ð¾ Ð¿ Figure 6 . Transposing on the UMM Let be a one dimensional array of size Ò, and È be a permutation of´¼ ½ Ò ½µ. The goal of permutation of an array is to move a word of data stored in to È´ µ for every (¼ Ò ½). Note that, permutation is given in offline. We will show that, for given any permutation È , permutation of an array can be done efficiently on the DMM and the UMM.
Let us start with a permutation algorithm on the PRAM and the BPRAM. Suppose we need to do permutation of an array of size Ò and permutation È is given. We use an array of size Ò as a work space.
[Straightforward permutation algorithm]
It should be clear that the above algorithm rums in Ç´Ò Ô µ time on the PRAM and in Ç´Ò Û µ time on the BPRAM. This straightforward permutation algorithm also works correctly on the DMM and the UMM. However, it takes a lot of time to complete the permutation. In the worst case, this straightforward algorithm takes Ç´Òµ time on the DMM and the UMM if all writing operation to È´ µ are in the same bank on the DMM or in the different address groups on the UMM. We will show that any permutation of an array of size Ò can be done in Ç´Ò Û · ÒÐ Ô µ time units on the DMM and the UMM.
If we can schedule reading/writing operations for permutation such that Û processors in a warp read from distinct banks and write in distinct banks on the DMM, the permutation can be done efficiently. For such scheduling, we uses an important graph theoretic result [15] , [16] as follows:
Theorem 4 (König):
A regular bipartite graph with degree is -edge-colorable. Figure 7 illustrates an example of a regular bipartite graph with degree 4 painted by 4 colors. Each edge is painted by 4 colors such that no node is connected to edges with the same color. The readers should refer to [15] , [16] for the proof of Theorem 4. We can perform the bank conflict-free permutation using × . We use an array of size Ò as a work space. The details are spelled out as follows.
[Permutation algorithm on the DMM]
The first for-loop copies all words in to . The second forloop copies all words in to based on È . Thus, a word of data stored in (¼ Ò ½) is moved to È´ µ , and permutation is performed correctly.
We will show that this permutation algorithm terminates in Ç´Ò Û · ÒÐ Ô µ time units. It should be clear that, in the first for-loop, Û words in read by Û processors in a warp are different modules. Similarly, Û words in written by a warp are different modules. Thus, the operation by a warp is bank conflict-free and can be done in Ç´Ðµ time and Ô processors perform it in Ç´Ô Û · Ðµ time. Thus, the first for-loop can be done in Ò Ô ¡ Ç´Ô Û · Ðµ Ç´Ò Û · ÒÐ Ô µ time.
From Lemma 5, the second for-loop is also bank conflictfree. Thus we have, Theorem 6: Any permutation on an array of size Ò can be done in Ç´Ò Û · ÒÐ Ô µ time units on the DMM.
VI. PERMUTATION OF AN ARRAY ON THE UMM
The main purpose of this section is to show a permutation algorithm on the UMM. Our permutation algorithm uses the transpose algorithm on the UMM presented in Section IV.
We start with a small array. is a set of Û ¾ edges such that, an edge connecting node in Í and node in Î corresponds to a word of data moved from the -th column to the -th row of by permutation È . For example if a word of data in ½ ¿ is moved to ¾ by permutation, an edge is drawn from node 1 in Í and node 4 in Î . Clearly, is a regular bipartite graph with degree Û. From Theorem 4, this bipartite graph can be painted using Û colors such that Û edges painted by the same color never share a node.
Suppose that, for a given permutation È on two dimensional array of size Û ¢ Û, we have painted edges in
Let´ µ denote a word of data stored in initially. We can think that´ µ is assigned one of the Û colors. It should be clear that Û words of data in each row are painted by distinct Û colors. The key idea of permutation is to move words of data using painted colors. The details of the permutation routing on the UMM are spelled out as follows.
[Permutation on the UMM]
Step 1: Move words of data such that a word with color is moved to the -th column in the same row by the rowwise permutation.
Step 2: Transpose two dimensional array .
Step 3: Move words of data such that a word with destination -th row is stored in the -th column in the same row.
Step 4: Transpose two dimensional array .
Step 5: Move words of data such that a word with destination -th column is stored in the -th column by the row-wise permutation.
Let us see the data movement of the above permutation algorithm for Û using Figure 8 and confirm the correctness of the algorithm. We assume that each is initially storing a word data´Ü Ýµ if a data stored in should be moved to Ü Ý along permutation as illustrated in the figure. In the figure, every destination is painted by four colors such that four destinations in each row are painted by different colors, and four destinations painted by a particular color has four distinct row destinations´¼ £µ ´½ £µ ´¾ £µ, and´¿ £µ.
After
Step 1, four destinations in the -th column are painted by -th color, and thus they are four distinct row destinations. After
Step 2, each row has four destinations with distinct row destinations. Hence after Step 3, four destinations to be moved to -th row are in the -th column. By transposing in Step 4, every destination is in the right row destination. Finally, permutation is completed by row-wise permutation in Step 5. Clearly, Steps 1, 3, and 5 that involve row-wise permutation can be done in Ç´Û · Û ¾ Ð Ô µ time units on the UMM. From Lemma 3, transpose performed in Steps 2 and 4 can be done in Ç´Û · Û ¾ Ð Ô µ time on the UMM with each processor having local memory of Û words. Thus, we have Lemma 7: Any permutation of an array of size Û ¾ can be done in Ç´Û · Û ¾ Ð Ô µ time on the UMM with each processor having local memory of Û words.
We go on to show permutation algorithm on a larger array . Suppose we need to perform permutation of array of size Û . We can consider that an array is a two dimensional array of size Û confirm that any permutation can be done in Ç´Û · Û Ð Ô µ time on the UMM using Ô processors.
Repeating the same technique, we can obtain a permutation of an array of size Ò Û ¾ Ñ
. In other words, the permutation of an array of size Û ¾ Ñ can be done by executing the row-wise permutation recursively three times and the transpose twice for an array of size Û ¾ Ñ ½
. If the size Ò of an array satisfies Ò Û Ç´½µ , that is, Ñ Ç´½µ, then the depth of the recursion is constant. Thus, we have, Theorem 8: Any permutation of an array of size Ò can be done in Ç´Ò Û · ÒÐ Ô µ time on the UMM with each processor having local memory of Û words, provided that Ò Û Ç´½µ .
VII. CONCLUSION
In this paper, we have introduced two parallel memory machines, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). We first evaluated the computing time of the contiguous access and the stride access of the memory on the DMM and the UMM. We then presented an algorithm to transpose a two dimensional array on the DMM and the UMM. Finally, we have shown that any permutation of an array of size Ò can be done in Ç´Ò Û · ÒÐ Ô µ time units on the DMM and the UMM with width Û and latency Ð. Since the computing time just involves the bandwidth limitation factor Ò Û and the latency overhead ÒÐ Ô , the permutation algorithms are optimal.
Although the DMM and the UMM are simple, they capture the characteristic of the shared memory and the global memory of NVIDIA GPUs, Thus, these two parallel computing models are promising for developing algorithmic techniques for NVIDIA GPUs. As a future work, we plan to implement various parallel algorithms developed for the PRAM so far on the DMM and on the UMM. Also, NVIDIA GPUs have small shared memory and large global memory. Thus, it is also interesting to consider a hybrid memory machine such that processors are connected to a small memory of DMM and a large UMM.
