Abstract-The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succinct data re-organization of a tetrahedral data-parallel domain of size O(n 3 ) combined with an efficient block-space GPU map of the
I. INTRODUCTION
GPU computing has proven to be both a practical tool and a well established research area for computer science [7] , [9] , [11] , mostly because there are several parallel computing issues that get magnified when handling thousands of processors in parallel and become critical for achieving high efficiency. One of these problems is related to how memory can be accessed in parallel as it is no longer possible to assume an exclusive data path for each processor, but instead one must assume that one memory access can provide several consecutive words of data to a group of threads. A more recent problematic is related to the challenge of given a data-parallel problem, find a thread map that uses a space of computation that is close to the optimal one, where optimal is defined as the space of computation that achieves maximum parallelism with all threads doing useful work. This problem typically arises when handling fine grained data-parallel problems defined on a complex spatial domain, i.e., one that is different from the boxshape for the corresponding spatial dimensions of the problem.
Since the introduction of general purpose GPU programming tools such as CUDA [10] and OpenCL [4] , the problems recently described have gained mayor importance as they indeed appear in the programming model and have an impact on the performance of the GPU. Although these problems exist in all complex spatial domains, in this work we are particularly interested in studying the potential benefits on orthogonal tetrahedron domains (see Figure 1) as they can be found in nearest neighbors simulation problems such as the spin lattice Monte Carlo simulation [8] extended to tetrahedral periodic domains as well as in all-with-all-with-all scenarios such as triplet-interaction n-body problems [5] . In order to design more efficient solutions for orthogonal tetrahedron domains, it is necessary to address the following issues: 1) A typical linear memory organization of data elements in a 3D discrete tetrahedron in a major depthrow order, i.e., z → y → x, leads to a non-linear pattern of distances between nearest neighbors. This aspect produces a negative impact on performance as coalesced memory accesses become less frequent. 2) There is a stage in the GPU computing pipeline where the space of computation is mapped to the problem domain. This map can be defined as a function f (x) :
.., x k ) of the grid to a unique q-dimensional point of the problem domain. When the problem domain is simple in shape, e.g., a box, the canonical GPU map f (x) = x becomes the optimal and simplest one. For the case of 3D discrete triangular domains, up to O(n 3 ) threads may become unnecessary with this approach as shown in Figure 2 , therefore it is of interest to find a more efficient approach that can map threads according to a 3D triangular distribution.
In this work we analyze the potential performance benefits 978-1-5090-1633-4/16/$31.00 c 2016 IEEE of applying a re-organization of the problem domain combined with a block-space map that preserves thread locality. The approach taken for the data re-organization is based on the idea of a succinct blocked structure, where the diagonal plane is filled with extra data to align with the blocks. For the mapping, the approach is based on a linear enumeration of thread-blocks in the tetrahedron, for which a third order equation can be solved to obtain a unique correspondence in three-dimensional space.
The structure of the paper is as follows: Section II presents the main background regarding parallel GPU computation and its programming model. Related work is described in Section III and in Section IV we present the analysis of the potential benefits. Section V discusses the main results.
II. BASIC CONCEPTS
In the CUDA GPU programming model there is a hierarchy of three constructs 1 that are defined for the execution of a highly parallel algorithm; (1) thread, (2) block and (3) grid (See Figure 3) . Threads are the smallest elements and they are in charge of executing the instructions of a GPU kernel. A block is an intermediate structure that contains a set of threads organized in an Euclidean space. Blocks provide fast shared memory access as well as synchronization for all of its threads. The grid is the largest construct of the GPU and it is in charge of keeping all blocks spatially organized. These three constructs play an important role when mapping the execution resources to the problem domain as well as for the memory accesses.
When launching a GPU computation, a definition of a computation space must be passed as parameter to the kernel function call. The relevant information is (1) the number of threads in each dimension of a block and (2) the number 1 OpenCL chooses different names for these constructs; (1) work-element, (2) work-group and (3) work-space, respectively. of blocks in each dimension. These two sets of parameters together form the computation space. Typically, the number of threads in a block is a constant value |B| = ρ x ×ρ y ×ρ z where ρ x , ρ y , ρ z are arbitrarily chosen. An important aspect for any block is to allow warps to behave in an optimal way regarding execution and memory accesses.
A warp is the smallest group of threads that is executed in parallel using a lock-step mode. In the present, a warp has a size of ω = 32 threads. If threads inside a warp branch into different instruction paths, then parallelism suffers a performance penalty as the branches are executed sequentially for that warp. For the case of memory accesses, a warp can access ω words of 1, 2, 4, 8, or 16 bytes in one transaction if the warp's data is consecutive and aligned to 32, 64, 128 or 256 bytes, respectively. Such scenario is typically known as coalesced memory and it is a desired feature as memory bandwidth can then reach its peak performance. On the other hand, if a warp's memory access has strides, misalignments, offsets or complex access patterns, then multiple memory transactions are issued, producing a penalty on memory bandwidth. In triangular domains, data suffers from repeated misalignments, leading to a strong non-coalesced memory scenario.
Having the block definition ready, one usually builds a computation space of the form |C| = n x /ρ x × n y /ρ y × n z /ρ z for the problem domain of linear sizes n x , n y , n z . This way of building the computation space is well known and widely used as it uses a minimal cost linear map f (x) = x which is optimal for many problems that use vectors and matrices. This approach is referred as the bounding-box map and while it is efficient for box-shaped domains, it is not for triangular structures since an important part of the volume of computation space becomes unnecessary and must be discarded at runtime via kernel code, producing a performance penalty. Relevant works on these two problems are described in the next section.
III. RELATED WORK
Wu et. al. proved that the problem of data re-organization for parallel computing is NP-Complete in its general form [13] , nonetheless the authors describe efficient approaches based on data replication, padding and sharing and indicate which ones fit better for certain problem categories. Chen et. al. propose a general optimization technique for data-parallel problems with indirect memory accesses [1] , by viewing the problem as a sparse matrix computation. Yavors'kii and Weigel identified that tiled computations, such as the ones found in spin systems, are greatly improved by a blocked data re-organization [14] .
Regarding thread mapping techniques, Jung et. al. [3] proposed packed data structures for representing triangular and symmetric matrices with applications to LU and Cholesky decomposition [2] . The strategy is based on building a rectangular box strategy for accessing and storing a triangular matrix (upper or lower). Ries et. al. contributed with a parallel GPU method for the triangular matrix inversion [12] . The authors identify that the space of computation indeed can be improved by using a recursive partition of the grid, based on a divide and conquer strategy. Navarro and Hitschfeld studied the benefits of block-space thread mapping for 2D triangular domains [6] and found approximately 20% of improvement for 2D triangular problems such as collision detection and the Euclidean distance Matrix.
The study of discrete 3D triangular structures is an interesting category of problem with a variety of applications, from spin lattice simulation, special triplet collision and 3D Euclidean distance matrices. For such applications, a data reorganization technique and block-space thread mapping can provide a substantial performance improvement.
IV. ANALYSIS OF A BLOCK-SPACE APPROACH
We consider the orthogonal 3D tetrahedron that is defined by n triangular structures stacked and aligned at the main corner (see Figure 1) , where the i-th triangle contains T 2D i = i(i + 1)/2 elements. Therefore the total number of elements for the full structure is
which corresponds to the tetrahedral number defined as
The following analysis combines a simple yet effective succinct data re-organization scheme (sub-section IV-A) and a map of the form g(λ) : N → N 3 (sub-section IV-B), both working in block-space.
A. Succinct block re-organization scheme
Let M be the GPU memory space with an alignment of ϕ bytes and ω the size of a warp of threads (typically ω = 32). An analysis on the access patterns of warps is sufficient to model, in great part, the efficiency of GPU memory accesses on the tetrahedron structure. A warp may access memory with an offset of Δ bytes from the alignment point and a stride of s bytes. High memory bandwidth is achieved when the total bytes of the ω accesses are consecutive (s = 0), have no offset (Δ = 0) and match the ϕ bytes of the alignment in M . When such conditions are satisfied, the ω accesses cost only one memory transaction.
The problem with a linear organization of the tetrahedron in M is that it produces a non-linear pattern for the distances among data elements. Given an alignment of ϕ = 2 k bytes, with k ∈ N, and the linear memory for a triangular structure T 2D n , as shown in Figure 4 . The number of aligned rows can be expressed as
where the number of aligned words becomes the sum of the first, last and some internal rows that follow a pattern of multiples of ϕ. The whole summation can be expressed and simplified into
which leads to a power-law relation between the size of the layer and the alignment. The number of aligned warps is obtained by dividing the number of words by ω
where the fraction of aligned warps in a T 2D n structure becomes
For an alignment of ϕ = 2 k=7 = 128 bytes, which is a common case where each one of the 32 threads of a warp accesses a single float (4 bytes) with s = 0, the total percentage of aligned warps would be no greater than F ϕ=128 = 0.0078125 ≈ 0.78%. Considering that the tetrahedron corresponds to stacked layers of size 1, 2, . . . , n where all of them present the same behavior, if not more complex, it is expected that the average number of transactions per warp would be at least
where τ is a cost defined for an unaligned memory access. In the most advantageous scenario, the cost would be at least one extra transaction, i.e., τ = 2.
Note: the rest of the paper will work with a block of dimensions ρ = ρ x = ρ y = ρ z in order to simplify notation as the results are not limited to this assumption.
A succinct block-based re-organization of data becomes an attractive optimization to be considered as it provides full alignment with data (see Figure 5 right). At a coarse level, the structure can be represented by blocks of data linearly organized in M . At a fine-grained level, each block has constant size of Θ(ρ 3 ) with a local linear organization and with at least ρ x = ϕ to match the alignment. For the elements of the diagonal plane, blocks are padded and filled with dummy data to preserve memory alignment for the rest of the structure. With this modification, the new size of the diagonal blocks becomes O(n 2 ρ 3 ) with ρ ∈ O(1), making a final succinct blocked structure of asymptotic size O(n 3 ) + o(n 3 ) with all of its warps having full alignment, i.e., F = 1 which leads to C (n, ω, τ ) = 1. Finally, the potential improvement factor for the succinct block re-organization is
Considering that in practice F can be as low as below 1% for the tetrahedron case, it is highly convenient to re-organize the data of a tetrahedron into a succinct block-based one, specially considering that the potential improvement could be at least double memory performance since τ ≥ 2. Although some extra dummy data and work is introduced, such region grows as
, making the approach convenient as n and ϕ grow.
B. Block-based thread map
The use of a box-shaped grid to map threads on a 3D domain is a standard approach used in GPU computing and the natural one provided by the computing model since it is effective and efficient for many data-parallel problems. However the strategy presents some inefficiencies when dealing with non box-shaped domains such as the tetrahedron, since the number of unnecessary threads is in the order of O(n 3 ). A more efficient approach can be formulated by considering how block indices can map onto the tetrahedron. More precisely, it is possible to use a map of the form g(λ) : N → N 3 that uses a reduced number of blocks compared to the bounding-box approach, and maps them directly to the tetrahedral structure without loss of parallelism (see Figure 6 ). The approach takes advantage of the fact that when using a linear enumeration of blocks on the tetrahedron, the linear index λ of the first element of a 2D triangular layer corresponds to the tetrahedral number T z of its z-coordinate. Based on this fact, the rest of the data elements that reside in the same layer must follow the following
Combining the expression with the tetrahedral number definition leads to the expression
Therefore, given the linear location λ of a block, its z coordinate in the tetrahedron is obtained by solving the equation
and extracting the integer part of the root
Once the value z = v is computed, the two-dimensional λ linear coordinate is computed as
where T z = z(z + 1)(z + 2)/6 is the tetrahedral number for the recently computed z value. With λ already computed, the x and y values of the block are computed using the triangular map proposed by Navarro and Hitschfeld [6] for 2D triangular domains. Combining all three sub-results, map g(λ) becomes
where T 2D y is the triangular number for y.
The block linear size is defined as ρ = ϕ to match the data re-organization scheme of the last section. Additionally the blocks in computation space can be organized on a cubic grid of side 3 
√
T n in order to balance the number of elements on each dimension, producing n 2 ρ 3 unnecessary threads which correspond to the succinct data.
The potential improvement factor of the block-space map with respect to the box strategy can be expressed as 
where α is the cost of computing the block coordinate using the box approach, while γ is the cost mapping blocks in the tetrahedral map. In the infinite limit of n, the potential improvement becomes
and tells that in theory the tetrahedral map could be up to 6× more efficient, however such improvement can only be possible if γ ∼ α.
V. DISCUSSION
The optimization techniques analyzed in this work can offer significant improvements that are worth considering for future GPU computations on special spin systems, cellular automata, n-body problems and any other problem for which a tetrahedral domain is required. In theory, it is possible to extract up to 2× more performance from a simple succinct data re-organization and be up to 6× more efficient in terms of computation space; all this when compared to the bounding-box approach. It is important to clarify that the effective improvement observed in practice will strongly depend on how much overhead is necessarily introduced when re-organizing data as well as how fast can the cubic and square root computations be done in practice, i.e., how close is γ to α. As a future work, it will be interesting to consider technical optimizations for such challenges in the GPU architecture and hopefully approach to this potential performance benefits.
