Reduction operations are extensively employed in many computational problems when a finite set of numeric elements are combined into a single value using for this a combining function. A parallel reduction, in turn, is the operation concurrently performed when multiple execution units are available. The present work depicts a GPU-based parallel approach for it, which employs techniques like loop unrolling, persistent threads and algebraic expressions to avoid thread divergence, able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia's hardware platforms, as well as using OpenCL and CUDA, making it portable.
I. INTRODUCTION
Reduction is a basic step in many algorithms, including classical ones such as Stream Compaction [1] , Golden Section and Fibonacci Methods [2] and Count and Radix Sort [3] . A reduction operation can be formally defined as follows [4] : given a set X of n elements, X = {x 0 , x 1 , ..., x n−1 }, compute x 0 ⊗ x 1 ⊗ ... ⊗ x n−1 , with ⊗ (also known as a combining function) being an associative and commutative operator. Examples of ⊗ when X is composed of numbers are the operators +, ×, ∧, ∨, ⊕, ∩, ∪, max and min. Algorithm 1 illustrates a reduction process through the summation of a set of values.
Algorithm 1: Summation(X)
Input: Set X = {x 1 , x 2 , . . . , x n } of numeric elements Output: The sum of all elements accumulator ← 0 for i ← 1 to n do accumulator ← accumulator + x i return accumulator A parallel version of a reduction consists of computing the ⊗ operator concurrently for the set of elements using multiple execution units. At a first glance of Algorithm 1, it may seem that the reduction operator sum (+) is inherently sequential since the variable accumulator depends on values computed in previous steps. However, it is possible to parallelize that operator because of its associative property. In fact, the associativity of ⊗, sometimes combined with the commutative property, allows to divide the calculations hierarchically into subproblems that can be solved in parallel and then having their partial computations combined to produce the final result. The order in which the partial reductions are combined does not affect the final result 1 .
Since the arrival of programmable GPUs, some strategies to accelerate the reduction operation on such devices appeared. Two famous ones are described by Mark Harris [8] and Bryan Catanzaro [9] . Most recently, Justin Luitjens [10] presented some improvements to the strategies described in [8] . Unfortunately, the approaches adopted by [8] and [10] , although very efficient, are limited to hardware and software provided by NVidia. On the other hand, the proposal of [9] is based on the open standard OpenCL [11] , adopted by a myriad of manufacturers, making it portable. Nevertheless, the code presented in [9] has some weaknesses, as it does not utilize some strategies that could significantly improve its performance.
In the present paper, we describe a new parallel implementation for reduction operators that provides both portability and very good performances. It adequately combines strategies that were used in the approaches mentioned above.
The remainder of this paper is organized as follows: Section II briefly describes the existing techniques for parallel reduction in GPUs; Section III explains our approach; Section IV details the experiments performed for testing the approach; Finally, Section V draws our conclusions and points to future investigations in this area. 1 Although, mathematically, this is true for numbers in any set, in computational terms this is more complicated. For instance, that property holds for the set of integers, but the same does not happen for the floating point numbers due to the inherent imprecision that arises when combining (adding, multiplying, etc.) numbers with different exponents, which leads to the absorption of the lower bits during the combine operation. As an example, mathematically the result of (1.5 + 4 50 − 4 50 ) is always the same, no matter the order the terms are added, whereas the floating point computation can result in 0 or 1.5, depending on the sequence in which the operations are performed [5]- [7] .
II. PARALLEL REDUCTION IN GPUS
As explained before, the basic idea for parallelizing a reduction is to "split" the problem into smaller pieces and to solve them in parallel. The GPU hardware, however, has features which impose some restrictions that must be considered in order to maximize the speedup of parallel code [12] .
The approaches of [8] and [9] to deal with reductions in GPUs operate in a very similar way, using a tree-based structure.
One of the aspects to be observed is the number of elements to be reduced. If this amount is sufficiently small and can be stored in the local memory of each SM (Symmetric Multiprocessor), then the reduction becomes simple. In [9] , Catanzaro presents some strategies for this case and conducts performance comparisons between them. After describing how reductions can be efficiently performed in small sets, the author shifts the focus to the discussion of cases in which a large volume of data must be handled. Three strategies are presented and a winner, called "Two-Stage Parallel Reduction", is elected. Harris [8] deals only with reduction of large datasets.
Our approach is mainly based on a proposal from [9] . Therefore, a more detailed description of it is presented. First, however, we explain the strategies employed by [8] since some ideas for speeding up the computation come from that work. We also comment about the research of [10] , because it is another way of dealing with the reduction problem.
A. Mark Harris' Work
The work presented by Harris [8] focuses on techniques for performing reductions on large data volumes. The author shows, through successive versions of the same algorithm, how bad decisions or an incorrect way of mapping the problem to the target platform can negatively impact the application performance.
Harris performed experiments using a G80 GPU. That video card has a 384-bit memory interface, with a 900 MHz DDR memory, which leads to a theoretic 384 * 1800/8 = 86.4 GB/s of memory bandwidth. All tests were conducted using a vector with 2 22 (4M) integer values.
Problems like shared memory bank conflict, lack of communication between thread blocks (making it impossible for a kernel to reduce a large array at once) and highly divergent warps are addressed. Starting with a naive version, step by step improvements are described. Harris tested strategies such as: (1) Replacing conditional commands by new instructions that avoid divergent branching; (2) Keeping all work-items active, doing work as much as possible; (3) Unrolling loops until maximum efficiency is reached; (4) Using CUDA C++ template parameters on device and host functions for the specification of the workgroup size, so that only the necessary amount of work-items are launched; and (5) Allowing each work-item to individually compute as many reduction operations sequentially as possible in order to reduce synchronization between threads, among other benefits. As a result of applying all these optimizations, the final version ran in 0.268 ms (30.04x times faster than its initial version) and reached a memory bandwidth of 62.671 GB/s. In our work, we use all strategies proposed by Harris except the fourth one.
B. Justin Luitjens' Work
In [10] , Luitjens shows how a feature of the NVidia's Kepler (and newer) GPU architecture, the shuffle (SHFL) instruction, can be used to make reductions even faster when compared to the strategies presented in [8] .
Usually, work-items inside the same SM use the local memory when they need to communicate. This involves a three-step process: writing the data to the local memory, perform a synchronization barrier and then reading the data back from local memory. The Kepler and newer architectures implement the shuffle instruction, which enables a work-item to directly read private data from another work-item in the same wave-front. According to the author, there are four main advantages in using this instruction:
• It ultimately allows work-items inside a wave-front to collectively exchange or broadcast data; • It replaces the three-step process by a single instruction, effectively increasing the bandwidth and decreasing the latency; • It does not use the local memory at all; • A sync barrier is implicit in the instruction and, hence, a synchronization step inside a workgroup is not necessary. Using this instruction, several versions of the reduction process were proposed, implemented and compared. However, although Luitjens states that the adopted strategies lead to faster reductions than those described by [8] , no comparative studies between the two approaches were conducted.
C. Bryan Catanzaro's Work
Now, we describe Catanzaro's two-stage parallel reduction approach for large datasets, as presented in [9] .
The technique is based on dividing the data set in p pieces (or "chunks"), where p is large enough to keep all GPU cores busy. It is also necessary to limit the number of work-items to the maximum amount that the GPU can handle in total without having to switch between them (from now on, that maximum will be called GS -or global size). Each chunk is then processed by a work-group. Since the sum operation has the properties of associativity and commutativity, each work-item can perform its own reduction in parallel. The work-item takes, as the starting point, its global identifier and accumulates its partial sum in a private variable. It reads the data from a vector stored in the GPU's global memory, intercalating the access to the values with the other work-items in a sequential order. Basically, work-item i, i = 0, 1, GS − 1, starts reading the position i and skips GS positions at every step.
After having completed a pass through the dataset, the work-items in each workgroup write the result of their own reduction in a scrap vector located at local/shared memory which, in turn, will also be reduced in parallel. At the end of the process, each working group will have its own scrap containing, in its position 0, the result of the reduction so far. This partial result is then copied to another vector, this time stored in the GPU global memory, which size must be equal to |SM |. The first stage is then complete. Its source code, for a operator less (<), extracted from [9] , is presented in Listing 1 with small adjustments. The second stage is simpler. Since now there is a vector with |SM | elements in the global memory -with the result of a partial sum in each position -just the first |SM | workitems of the first SM copy their corresponding value to an array allocated in the local memory. Then, the work-items perform a new parallel sum of the elements in the vector. After that, the position 0 of the local memory will have the final result. This is copied back to global memory and the reduction finishes.
The overall approach of Catanzaro was the underlying architecture of the present work. It was, however, improved with the extensive usage of the advanced techniques described in the next section, in order to further explore parallelism.
D. Loop Unrolling
Loop Unrolling (also known as Loop Unwinding and Loop Unfolding) is an optimization technique -performed by the compiler or manually by the programmer -applicable to certain kinds of loops in order to reduce (or even prevent) the occurrence of execution branches and to minimize the cost of instructions for controlling the loop [13] - [15] . In general, it optimizes the program's execution speed at the expense of increasing the size of the generated code (spacetime tradeoff ). It is easily applicable to loops where the number of executions is previously known, like routines of vector manipulation where the number of elements is fixed.
Basically the technique consists of reusing the sequence of instructions within the loop, so as to include more than one an iteration of the code every time the loop is repeated, reducing the amount of these repetitions.
The reuse is done by manually replicating the code inside the loop a certain amount of times or through the "#pragma unroll n" 2 positioned immediately before the beginning of the loop. The number of times the loop is unrolled is called Unrolling Factor and, with the pragma directive, it is given by the parameter "n".
It is worth noting that with the pragma directive we leave the decisions of how the loop should be unrolled to the compiler, which may lead to a not so optimized code. In the experiments performed as part of this work, the best results were always achieved using manual loop unrolling.
Unrolling, when applicable, offers several advantages over non-unrolled code. Besides the decrease in the number of iterations, an increase occurs in the amount of work done each time through the loop. This also opens ways for the exploration of parallelism by the compiler in machines with multiple execution units, since each instruction within the loop can be handled by an independent thread.
However, these are only the most easily perceivable benefits. Fog [13] lists several others, as well as some observations about when this technique should be avoided. Such factors (advantages and disadvantages) must be considered by the programmer when deciding to use loop unrolling or not.
E. Persistent Threads
All the current programmable GPUs follow the "Single Instruction Multiple Thread" (SIMT) and "Single Program Multiple Data" (SPMD) paradigms, hiding the details of the underlying hardware where the code runs in an attempt to ease the development task.
Some authors [16] argue that the usage of these paradigms greatly limits the actions of the programmer, because the execution flow is entirely under the control of the video card's scheduler. They claim that, while such abstractions reduce the effort for new developers in the GPGPU field, they also create obstacles for experienced programmers, who normally face problems for which workload is inherently irregular, therefore making it more difficult to efficiently parallelize when compared to problems whose parallelism is more regular.
They argue that this uncovers a serious drawback of the current programming styles, which is not able to ensure order, location and timing. It also does not allow the software developer to regulate these three aspects without completely avoiding the GPU scheduler. Thus, to overcome these limitations, developers have been using a programming style called Persistent Threads ("PT"), which low level of abstraction allows performance gains by directly controlling the scheduling of work-groups.
Basically, what the PT style changes is the lifetime of a work-item [17] , by letting it to run longer and giving it much more work to do than in the traditional "non-PT" style [18] . This is done by circumscribing the kernel logic (or part of it) in a loop, which remains running while there are items to be processed.
Briefly, from the point of view of the developer, all workitems are active while the kernel is running. As a direct consequence of PT, a kernel should be triggered using only the amount of work-items that can be executed concurrently by each Streaming Multiprocessor. All these actions will prevent constant return of control to the host and the cost of new kernel invocations [17] .
Gupta et al [16] acknowledge, however, that the PT technique is not a panacea, and its use should be carefully evaluated. In particular, the technique fits well when the amount of memory accesses is limited (i.e., few reading/writing to global memory and a large volume of computation) and the problem being solved has not many initial input elements or the growth in the number of elements in the input set is fairly limited. Beyond these conditions, the traditional non-PT style tends to outperform the PT style.
F. Thread Divergence
Current GPUs are able to deliver massive computational power at a reasonably low cost. However, due to the way they are constructed, some obstacles must be overcome for the effective use of such power. One of the main and hardest obstacles to avoid is the presence of conditional statements [19] potentially leading to branches in the execution flow of the various work-items [20] .
By default, GPUs try to run all the work-items inside the wave-fronts in the SIMD model. However, if the code being executed has conditional statements that lead to divergences in program flow, the divergent work-items will be stalled and its execution will only happen after the non-stalled workitems have completed their runs, which ultimately compromises the desired speedup. This phenomenon is called Thread Divergence [21] , [22] .
Some strategies have been proposed in order to minimize or even eliminate the effects of such phenomena. Among them, we cite [19] - [24] .
The strategy adopted in the current paper is detailed at the end of Section III.
III. THE NEW APPROACH
The improvements proposed in our work focus on Steps 1 and 3 of the first stage of the reduction presented in Section II-C. The improvements employ the same strategies described by Harris [8] to increase the performance of of Catanzaro's original approach [9] but with appropriately chosen interventions.
In step 1 of the original implementation, the vector in global memory containing the data to be reduced is entirely traversed by the work-items, each one performing its own reduction.
This step already uses the "Persistent-Thread" strategy, but its performance can be improved by adopting loop unrolling (Section II-D). As it can be seen, instead of doing the unroll when the data is in local memory, as proposed by [8] , our improvement performs the unroll in the global memory.
The code presented in Listing 2 shows the modified loop (now using a sum operator), assuming an unrolling factor (F) equals to 4, iGlobalID as the work-item global identifier and iLength as the number of elements to be reduced. Special attention must be given to how the data is brought from global (aVector) to private memory (accumulator), through the use of algebraic expressions that prevent reading from invalid memory locations, thus avoiding the usage of "ifs" and potential divergences in the execution flow. The expression i n < iLength expands to integers 1 or 0 whether it is, respectively, true or false. In the first case (i n < iLength) * (aV ector[i n ]) is interpreted as (1) * (aV ector[i n ]), adding the value stored in location i n to the partial sum (accumulator). In the second case, the expression is interpreted as (0) * (aV ector[0] ), ensuring that -regardless of the data stored in the first position of the vector -the value 0 is added to accumulator, keeping the partial sum correctness.
At the beginning of Step 3, the resulting values of the previous sums are stored in local memory. Then, each SM performs its own local reduction with its work-items.
In the solutions presented by [8] and [9] , in this step, all work-items are kept synchronized through the use of barriers. However, with minor conceptual changes, it is possible to completely eliminate the overhead caused by the barriers, not only in the last 6 iterations of the loop, as proposed by [8] .
Our strategy is to use algebraic expressions to keep all the work-items in the same execution step, maintaining its desired behavior and algorithm correctness, as can be seen in Listing 3. Here, the conditional statement is completely eliminated but yet the function returns the right result of the comparison. Note that the two boolean operations ((a < b) and (a >= b)) are mutually exclusive, being interpreted internally by the compiler as 0 (false) or 1 (true). So, assuming that a is smaller than b, the result of the algebraic operation is (1) * a + (0) * b which, ultimately, will return only the value of a.
The same strategy can be applied to lines 19 to 25 of Listing 1, that represent the third step of the first stage. The new code is shown in Listing 4, where iLocalSize stores the number of active local work-items and iLI represents the work-item's local identifier.
Here, in each iteration of the loop, iP os is divided by 2 (iPos >>= 1) and bF lag is expanded to either 1 or 0, thus reducing by half the number of work-items doing a useful job. If, for the current work-item, the expression iLI < iP os becomes true, then the expression in the last line will be interpreted as scratch[iLI]+ = (1) * (scratch[iLI + (1) * iP os]), ensuring that the value stored in position iLI +iP os will be added to the value in position iLI. On the other hand, if the expression becomes false, it will be interpreted as scratch[iLI]+ = (0) * (scratch[iLI +(0) * iP os]), ensuring that the value in position iLI will not be considered. Since all work-items are always in the same step of computationdoing exactly the same job (useful or not), independently of being in the same wavefront -sync barriers are unnecessary. 
IV. COMPUTATIONAL EXPERIMENTS
In this section we evaluate the new approach against the proposals of Harris and Catanzaro 3 . Their original codes were publicly available and, therefore, used in the current study. Table I and Figures 1 and 2 depict the speedup gains achieved against the code presented in [9] , where F = 1 is the running time of the original code.
The algorithms were coded in the C++ language and compiled using a GNU compiler (g++ version 4.8.2 with parameters "-O3 -mcmodel=medium -m64 -g -W -Wall"). The AMD version used OpenCL 1.2 with the Software Development Kit 2.9.1. All experiments were performed on a computer with an AMD FX-9590 Black Edition Octa Core CPU, with clock ranging from 4.7GHz to 5.0GHz, 32GB of RAM, running Ubuntu 16.04 64-bits operating system. The computer had a Radeon SAPPHIRE R9 290X Tri-X OC GPU video card, with 4GB of memory. The architecture of such a video card provides 2816 stream processing units and an enhanced engine clock of up to 1040Mhz. Its memory is clocked at 1300MHz (5.2GHz effectively). At this speed, the theoretical memory bandwidth is close to 332.8GB/sec.
All tests were run on two vectors, one of integers and one of single precision floating points, containing 5533214 elements. There were no measurable differences, regarding the execution times, between the vector types.
The values listed in Table I were obtained with the OpenCL profiler CodeXL, version 2.0.12400.0, and are the averages of five consecutive executions for each F.
As can be seen, these results show that the version of the algorithm with F = 8 reached a speedup very close to 2.8x, when compared with the proposal of [9] . It may also be noted that such speedup stabilizes around this value (F = 16 provided just over 1.5% gain when compared to F = 8). The same code was implemented in CUDA and tests were performed against the Kernel 7 presented in [8] . The GPU used was a Tesla C2075 with 6GB of memory. Its architecture provides 448 CUDA cores, a GPU clock of 575MHz and a shader clock of 1150Mhz. The clock of the GPU memory was 750MHz (3.0GHz effective).
The experiments employed the two aforementioned vectors. Several values of the unrolling factor (F ) were tried in order to find the optimal value for such a video board. It was determined that up to F = 6 the performance gains were substantial and, with F ≥ 8, the gains were very discrete. According to this, all experiments were conducted using F = 8, including the ones presented by [8] . Ta- ). The execution times of the two approaches are very similar and we assume them to be equivalent, particularly considering that the new code is simpler to implement and works for CUDA and OpenCL. V. CONCLUSION Reduction operations are widely employed in many computational problems. This paper showed how such operations can be performed in a parallel fashion using graphics processing units and detailed the main approaches for them nowadays.
All parallel reduction techniques currently in use suffer from some basic issues. Several only reach their peak performance by employing proprietary strategies and/or technologies. That ends up limiting their use to the platform for which they were designed. Others, though generic, do not adopt certain procedures that could increase their performance without loss of generality.
The strategy presented here combines the best of both worlds: It is generic enough to be used with both CUDA and OpenCL and can run on hardware of the two major GPU manufacturers with minimal changes, just being adapted to the particularities of each platform. The implemented code, besides simpler, offered a performance equivalent to the best strategy described by [8] .
It is worth mentioning that the techniques here discussed are of general use, being the reduction only one of its possible applications.
