This work studies the problem of GPU thread mapping for a Sierpiński gasket fractal embedded in a discrete Euclidean space of n × n. A block-space map λ : Z 2 E → Z 2 F is proposed, from Euclidean parallel space E to embedded fractal space F, that maps in O(log 2 log 2 (n)) time and uses no more than O(n H ) threads with H ≈ 1.58... being the Hausdorff dimension, making it parallel space efficient. When compared to a bounding-box map, λ(ω) offers a sub-exponential improvement in parallel space and a monotonically increasing speedup once n > n0. Experimental performance tests show that in practice λ(ω) can produce performance improvement at any block-size once n > n0 = 2 8 , reaching approximately 10× of speedup for n = 2 16 under typical block configurations.
I. INTRODUCTION
Fractals can be described as self-similar structures [14] where a similar 1 geometrical pattern is found at all scales. Several natural phenomena produce fractal patterns that obey a self-similar structure [13] . Phenomena such as plant and tree growth [21] , [22] , terrain formation [15] , [23] , molecular dynamics [26] , snowflake crystallization [9] , blood vessels [5] , morphological features of living organisms [27] , among many others, display a fractal design where self-similarity is a critical feature for modeling its geometrical structure.
One well known fractal is the Sierpiński Gasket, or Sierpiński Triangle, described by Waclaw Sierpiński in 1915. Despite being over than a century old, the Sierpiński gasket remains a relevant subject as it is present in different fields such as the construction of antennas [2] , [24] , cellular automata [20] , [28] , fractal molecular assembly [10] , DNA selforganization [26] , self-assembly theory [4] , [12] and phase transitions on fractal spin lattices [6] , [7] , [16] , among others, because of its special properties. In some of these applications, such as in fractal Cellular Automata and Monte Carlo simulations on spin models, there is a stage where the fractal becomes a virtual representation and computation is applied to it. In these cases, the Sierpiński gasket can be discretized and embedded into the containing n × n Euclidean domain. Is this form, the fractal is defined as a self-similar structure where level r + 1 is composed of three repetitions of level r at 1 2 the scale as shown in Figure 1 . Applications that involve graphical representations or dataparallel simulations with nearest neighbors interactions may benefit if the storage of the structure preserves spatial locality in memory space, i.e, the memory locations (x ± 1, y ± 1) define a neighborhood in the actual fractal as well. One way to achieve this is to embed the fractal in a Euclidean space of n × n as shown in Figure 2 . For applications that can operate in an embedded fractal domain, computations will usually consist of accessing all the elements of the fractal and eventually perform arithmetic/logic operations that involve the data-element itself and possibly its nearest neighbors. Eventually, when the fractal is large enough to the point of containing hundreds of thousands of elements, a sequential computation can take an excessive amount of time for the practical requirements of the field. In these situations GPU computing becomes an attractive tool for accelerating the task [19] .
For every GPU computation there is a stage where threads are mapped from parallel to data space. A map, defined as f :
GPU parallel spaces are defined as orthotopes 2 Π k ∈ P k in k = 1, 2, 3 dimensions. A known way of mapping threads is to use the bounding-box (BB) approach, that builds an orthotope Π k sufficiently large to cover the whole data space and threads are mapped using the identity f (ω) = ω. Such map is highly convenient and efficient for the class of problems where data space is also defined by an orthotope; such as vectors, tables, matrices and box-shaped volumes. But for an embedded fractal such as the Sierpiński gasket, this approach is no longer efficient in terms of parallel space as many threads fall outside the domain, introducing a performance penalty to the execution time (see Figure 3 ).
Fig. 3:
In the bounding-box approach the threads that fall outside the fractal have to be discarded at run time.
Two research questions arise from this GPU efficiency problem on the embedded Sierpiński gasket; The first question: Is there any parallel-space efficient function, namely λ(ω), that can use asymptotically the same number of threads as data elements in the fractal and map blocks 3 properly onto the embedded Sierpiński gasket? (see Figure 4 ).
It is important to note that Question 1 asks for a blockspace map and not a thread-space one. The change from thread-space to block-space allows coalesced memory to be preserved throughout the entire domain as thread organization is not compromised inside a block.
The second question relates to performance: Will the parallel-space improvement translate into a significant GPU performance improvement? The present work focuses on these two questions and provides positive answers for both of them. A dedicated analysis is devoted to show that an alternating unrolling strategy allows to define a parallel-space efficient λ(ω) that only requires O(n H ) threads, with H ≈ 1.58... being the Hausdorff dimension of the Sierpiński fractal. It addition, it is shown that by taking advantage of intra-block parallelism, λ(ω) becomes computable in O(log 2 log 2 (n)) time which is fast enough to produce monotonically increasing speedup once n > n 0 with n 0 being a constant threshold value.
The rest of the manuscript presents related work (Section II), a formal definition and analysis of λ(ω) (Section III) and Section (IV) presents performance results. A discussion and comments on future work is found in Section V.
II. RELATED WORK
The following related works can be classified into two categories; (1) studies on efficient GPU mapping for triangular domains and (2) general studies on the structure of the discrete Sierpiński fractal.
One of the first works that explored the possibilities of improving the GPU mapping and stage was the research of Jung et. al. [11] whom proposed packed data structures for representing triangular and symmetric matrices with applications to LU and Cholesky decomposition [8] . The strategy is based on building a rectangular box strategy for accessing and storing a triangular matrix (upper or lower). Data structures become practically half the size with respect to classical methods based on the full matrix. The strategy was originally intended to modify the data space (i.e., the matrix), however one can apply the concept analogously to the parallel space.
Ries et. al. contributed with a parallel GPU method for the triangular matrix inversion [25] . The authors identified that the parallel space indeed can be improved by using a recursive partition of the grid 4 , based on a divide and conquer strategy. The mapping approach takes O(log 2 (n)) with n being the linear size of the matrix. Q. Avril et. al. proposed a GPU mapping function for collision detection based on the properties of the upper-triangular map [1] . The map is a thread-space function u(x) → (a, b), where x is the linear index of a thread t x and the pair (a, b) is a unique two-dimensional coordinate in the upper triangular matrix. Since the map works in thread space, the map is accurate only in the range n ∈ [0, 3000] of linear problem size.
Navarro, Hitschfeld and Bustos have proposed a blockspace map function for 2-simplices 5 and 3-simplices [18] , [17] , based on the solution of an m order equation that is formulated from the linear enumeration of the discrete elements. The authors report performance improvement for 2-simplices, and for the 3-simplex case, the mapping technique is extended to the discrete orthogonal tetrahedron, where the parallel space usage can be 6× more efficient. However the authors clarify that it is difficult to translate such space improvement into performance improvement, as the map requires the computation of several square and cubic roots that introduce a significant amount of overhead to the process. From the point of view of datareorganization, a succinct blocked approach can be combined along with the block-space thread map, producing additional performance benefits with a sacrifice of o(n 3 ) extra memory.
Exploring the benefits of efficient GPU mapping onto the embedded Sierpiński gasket becomes an interesting topic of research since its geometry is no longer Euclidean as in the related works, but instead it is embedded in an Euclidean one. Finding a proper efficient λ(ω) would produce an asymptotic improvement in parallel space and a potential performance improvement that could eventually be exploited.
III. ANALYSIS AND FORMULATION OF λ(ω)
This Section first analyzes two important properties of the discrete embedded Sierpiński gasket, which are helpful in the formulation of the map λ(ω).
A. Analysis of the Space and Packing of F k,s n
The discrete Sierpiński gasket belongs to a category of discrete fractals where its structure can be built by replicating k instances of itself at a scale s and placed with an arbitrary spatial organization. The notation F k,s n is introduced to denote such fractal, where n ∈ N is the linear size, k ∈ N the replication factor and 0 < s < 1 ∈ R the scaling factor in terms of reduction. Since fractals have a recursive self-similar structure, their volume V(F k,s n ) may be expressed as
with V(F k,s 1 ) = 1 being the limit condition of the recursion. Given that the replication factor k is fixed, and n scales by factors of s, the volume may be simplified into
where r = log 1/s (n) is defined as the scale level.
The Sierpiński gasket is a particular case where k = 3, s = 1/2 and r = log 2 (n). Successive steps of the fractal produce the pattern early depicted in Figure 2 . The following 5 A k-simplex is the generalization of the notion of a triangle to kdimensions. A 2-simplex corresponds to the triangle while a 3-simplex corresponds to a tetrahedron. two lemmas are introduced to support the formulation of a map λ(ω).
Lemma 1:
The space occupied by a discrete embedded Sierpiński gasket is in correspondence with the Hausdorff dimension of the infinite Sierpiński gasket.
Proof: The space occupied by a Sierpiński gasket of linear size n is V(F 3, 1 2 n ) = 3 r . Given that r = log 2 (n) and 3 log2(x) = 2 log 2 (3) log 2 (x) , the space expression can be rearranged into
where the exponent H = log 2 (3) ≈ 1.5849... is the Hausdorff dimension of the original infinite Sierpiński gasket.
Proof: Proof by induction on r:
• Induction step: It is assumed that the orthotope for r = k is quasi-regular. If k is even, the packing for k + 1 will triple the horizontal dimension of the 2orthotope. If k is odd, the packing for k + 1 will triple the vertical dimension of the 2-orthotope. Since even and odd must alternate, the dimensions of the packed 2-orthotope for k + 1 can only be 3 · 3
, which is regular or quasi-regular, respectively.
The packing process of Lemma (2) is illustrated in Figure  5 , where the packing steps are in correspondence with the scale levels shown in Figure 5 . 
B. Changing from Thread-space to Block-space
An important aspect to consider is at which level the parallel space will be mapped. Two approaches are possible;
(1) thread-space mapping and (2) block-space mapping. For the first approach, λ(ω) defines ω as a unique thread location in parallel space. For the second approach, λ(ω) defines ω as a block coordinate in which several threads are contained. The block-space approach has three important advantages over thread-space mapping. First, in block-space, the fractal becomes a simplified version of the original, requiring less elements to be mapped. Second, since the fractal is a simplified version of itself, it is possible to work on higher sizes of n before the CUDA grid maximum dimensions are reached. Third, the block-space approach allows the possibility for threads inside a block to preserve locality, which is essential for coalesced memory accesses on data.
The block-space map λ(ω) is now introduced, where ω = (ω x , ω y ) denotes a two-dimensional coordinate of a block of constant size |B| = ρ x × ρ y threads. The change from thread-space to block-space means that blocks are mapped to a simplified version of the fractal of linear size n b = n/b, as shown in Figure 6 . It is important to clarify that the extra green regions visible in Figure 6 do not necessarily mean unused threads. The stage of local thread mapping is detailed later in this Section, where three approaches can be used.
The formulation of λ(ω) continues in block-space with ρ x = ρ y to simplify the analysis, with n b the new simplified linear size of the fractal with the origin (0, 0) located at the topleft corner for both the parallel and embedded fractal spaces, and with y increasing downwards.
C. Formulation of λ(ω)
The function λ : Z 2 E → Z 2 F is introduced as a mapping of block coordinates ω in parallel-space onto block coordinates in the embedded space. The intuition behind is an unrolling process applied in parallel to each ω ∈ Π 2 through all the scale levels. At each level, different x, y offsets are accumulated to form the final (λ x (ω), λ y (ω)) coordinate in the fractal.
Theorem 1: There exists a block-space parallel-space efficient λ(ω) that can map blocks in O(log 2 log 2 (n b )) time using |B| = θ( log 2 (n b ) log 2 log 2 (n b ) ) threads per block.
Proof: By construction: let r b = 2 n b be the block-space scale level of the fractal, Π 2 the 2-orthotope of 3
blocks that maps onto the discrete Sierpiński gasket F 3, 1 2 n b , with each block having ρ x × ρ y threads. By Lemma (1), Π 2 is parallel-space efficient in block-space.
A helper index function β μ (ω) is defined as β μ (ω) = ω x ((μ + 1) mod 2) + ω y (μ mod 2) 3 μ 2 −1 mod 3 (4) to produce an index in the range β μ (ω) = 0, 1, 2 that identifies, within scale level μ ∈ [0..r b ], which of the three regions of the fractal does block ω belongs to. For even μ, β μ (ω) acts on ω x . For odd μ, it acts on ω y . Regions are sorted as 0 for top, 1 for middle and 2 for right (see Figure 1 for visual reference).
Having the β μ (ω) index, the weight functions
compute the offset weights, 0 or 1, for each of the x and y directions at scale level μ. The value of the offset corresponds to 2 μ−1 , which is the linear size of each region at scale level μ. For a given μ, the combination of the weight functions with the offset produce partial coordinates of the form
that contribute to the final mapped coordinate. The summation of all partial coordinates produces the map
which can be computed in O(log 2 log 2 (n)) time (i.e., n b ∈ θ(n)) using a parallel reduction with the threads contained in the ω block. Finally, by Brent's Theorem [3] , |B| = θ( log 2 (n) log 2 log 2 (n) ) threads are sufficient for a block of threads to reduce efficiently in parallel.
Theorem 2: λ(ω) requires asymptotically less work than the bounding-box approach.
Proof: The asymptotic work improvement factor of λ(ω) with respect to the bounding-box approach is the quotient of the costs of mapping all blocks using their corresponding Π 2 structures with the consideration n b ∈ θ(n)
where Π 2 BB and Π 2 λ(ω) are the parallel-spaces for the bounding-box and λ(ω) approaches, respectively. The parallelspace of Π 2 BB corresponds to the Euclidean box of n b × n b blocks, and the parallel-space of Π 2 λ(ω) is O(n H ) by Lemma (1). Applying the limit n → ∞ lim n→∞ S λ(ω) = lim n→∞ ∂ ∂n (n 2−H ) ∂ ∂n (log 2 log 2 (n))
The importance of Theorem (2) is that it guarantees the existence of a n > n 0 where λ(ω) will start becoming each time faster than the bounding box approach. A theoretical curve is presented in Figure 7 . From the plot, one can note that space improvement is clear in the log-log scale. For the time improvement, the improvement decreases until n 0 ∼ 7, which is the point where it becomes a monotonically increasing function. In practice, constants could have an effect that would push n 0 further to the right.
D. Intra-Block Mapping
Once λ(ω) maps a block ω, all the ρ x × ρ y threads contained have a shared reference coordinate that is available to use for computing their individual location in the fractal. This phase of organizing the threads within a block is referred here as Intra-Block Mapping, and this subsection describes three possible approaches to accomplish this.
1) Further Unrolling:
In this approach threads inside their mapped block use the same λ(ω) map but applied to each thread. By Theorem (1), the Intra-block map is parallel-space efficient and the mapping time becomes O(log 2 log 2 (|B|)) ∈ O(1) as ρ x , ρ y are constant and do not grow with n. Table: The second approach is to use a shared lookup table of ρ x × ρ y = O(1) offset coordinates, available to any thread in any block. Mapping each thread would cost O(1) memory accesses and the extra memory introduced by the shared table is O(ρ x × ρ y ) ∈ O (1) .
2) Shared Lookup

3) Bounding Sub-boxes:
The third approach consists of using small bounding-boxes in each block. This approach introduces a constant number of extra threads in each block, but allows each thread to be mapped just with f (x) = x which costs O (1) . In order to know if it is in the fractal or not, each thread evaluates if t x &(n − 1 − t y ) == 0 is true or not, respectively, with & being the bitwise AND operator.
Regardless of which Intra-block mapping approach is chosen, the final mapping time will not surpass the O(log 2 log 2 (n)) time, as the blocks have a constant size of threads. Nevertheless, it is worth considering that Further Unrolling introduces a constant cost in mapping time. The Shared Lookup Table approach introduces a constant cost in memory and finally the Bounding Sub-boxes introduce a constant in the number of extra threads. Choosing one or another can depend on the specific application, i.e., to avoid competing with the application in the use of memory bandwidth or arithmetic operations.
IV. IMPLEMENTATION AND PERFORMANCE RESULTS
A CUDA implementation of λ(ω) is put under test to obtain the actual speedup for different values of n. The implementation uses the Bounding Sub-boxes approach to arrange threads inside each block, as it is the simplest in terms of implementation time. The parallel reduction per-block is performed using the shuffle instruction of CUDA, which allows efficient register-level communication among threads within the same warp 6 .
The performance test consists of measuring the average time taken to write a constant value on all the elements of a Sierpiński gasket of scale level r, which is embedded in a M n×n matrix filled with zeros. The configuration space is explored in the ranges r = 0..16 and ρ = 1, 2, 4, 8, 16, 32 for the scale level and block-size respectively, in order to find the optimal setting that provides the best performance for both the bounding-box and λ(ω) approaches. The average performance measures are taken by averaging 100 sub-averages, each one being an average time of 10 consecutive synchronized kernel calls. The standard error for each mean was below 1%. The hardware for performance test is listed in Table I . 
Device
Model GPU Titan-X Pascal, GP102, 3584 cores, 12GB CPU Intel i7-6950X 10-core Broadwell RAM 128GB DDR4 2400MHz Figure 8 presents the speedup of λ(ω) over the boundingbox approach, as well as the running times for the two mapping techniques. For values of n < 2 8 , one can note that only some curves offer speedup. Once n > 2 8 , the speedup begins to increase for all block-size configurations, reaching the higher values at n = 2 16 = 65536, which was the highest problem size that fit in the GPU memory. An important aspect to note from the speedup curve is that for the largest possible block size, |B| = ρ × ρ = 32 × 32, the λ(ω) map runs the test approximately 6× faster than the bounding-box approach. Furthermore, as blocks become smaller in ρ, the improvement increases dramatically, reaching up to 55× of speedup.
The plot of the running times provides further insights on what configuration is the best suited for each mapping technique. By looking at the running times of the small block configurations, one can note that regardless of their high speedup, their running times are the slowest ones. For the boundingbox approach (BB) the best performance is obtained when the block-size is maximum. For λ(ω) the best performance is found when using a block of |B| = 16 × 16 threads. If the curves of the best configuration for each implementation are considered, i.e., the ones with marked points on Figure 8 , right, then the speedup provided by λ(ω) increasing further more reaching almost an order of magnitude. The other running time curves are still useful to visualize that as blocks become smaller, the value of n 0 where λ(ω) is more convenient moves closer to the origin, and vice versa, however the GPU with its current organization and architecture is not fully utilized by such small block configurations, thus leading to an inferior performance. Therefore, in practice large blocks should be utilized and by Theorem (2), beyond n = 2 16 the speedup would keep increasing in favor of λ(ω).
V. DISCUSSION
The results obtained in this work have shown that avoiding unnecessary threads with a special mapping function λ(ω) can lead to significant performance gains when processing an embedded Sierpiński gasket, even when the mapping cost perthread increases in comparison to a bounding-box approach. The analysis and formulation of λ(ω) has provided three important results; (1) There exists a correspondence between a quasi-regular 2-orthotope of discrete elements and the elements of the embedded Sierpiński gasket fractal. (2) Such correspondence from parallel to data space can be computed in O(log 2 log 2 (n)) time using a block-space map that is based on efficient parallel reductions. (3) The total work of mapping the 2-orthotope with λ(ω) is asymptotically smaller than the work generated by the bounding box approach, leading to a monotonically increasing speedup that is guaranteed to occur once a n 0 value is reached.
The experimental performance results confirm the theory as once the fractal reaches the linear size n 0 = 2 8 , the speedup begins to increase monotonically for all block-sizes. Using the maximum block size of 32 × 32 threads, λ(ω) reached up to 6× of speedup. The maximum speedup obtained was approximately 55× with blocks of 1 thread, however such block configuration is not practical for the GPU architecture. Still, having measured the running times with different block configurations helped in understanding that reducing the block size only brings the n 0 value closer to the origin and increasing the block-size pushes it forward. Eventually, all block configurations can reach up to 55× of speedup and beyond if the fractal is large enough.
The performance tests were based on a small amount of work per thread. Although there exist several computational patterns that produce this scenario, such as cellular automata or spin lattice simulations, it is still interesting to discuss how is the speedup affected as the work per thread increases?. As more work in given to the threads of a kernel, the cost of the mapping stage will be less significant than the cost of the computation stage. Therefore, any speedup to the mapping stage will be diminished by the rest of the computations in a similar to Amdahl's law fraction of a program can be accelerated. Efficient mapping techniques is more convenient to use when the cost of the mapping stage is comparable or greater than the cost of the computation stage.
The GPU map found for the embedded Sierpiński gasket may be adapted to work for other types of embedded fractals that follow a similar building scheme applying modifications to the helper, weight and offset functions. In order to obtain speedup, it is crucial to check if the overall work will be asymptotically smaller than the bounding-box approach.
Two important questions may be formulated from the results obtained in this work. The first one is: Can there exist a general λ(ω) that maps a family of embedded fractals who share the same building principle?, and the second question: can there exist a λ(ω) that maps in O(1) time using no more than O(1) extra memory?. Future research in these directions can provide important insights on the limits of efficient GPU computing for embedded fractal domains.
