Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
INTRODUCTION
Nowadays Graphics Processing Units (GPUs) are used to accelerate a wide range of general-purpose applications by exploiting their high-performance many-core processors. GPUs usually contribute to the overall computation in two different ways: carrying out some specific tasks through user-designed kernels or executing some data-parallel primitives provided by a growing number of libraries (e.g. CUDPP, CLPP, GPULib, Thrust…). While programmers assume control of the throughput of their kernels, primitives are integrated as black boxes whose performance relies on the library. On the other hand, hardware improvements incorporate more functional units in each release, along with larger shared memories and sophisticated cache hierarchies. For those reasons, the study of algorithmic strategies to optimize the implementation of these primitives, and the adjustment to the features of the upcoming GPU architectures, are ever-interesting topics to the GeneralPurpose GPU community [8] .
In this paper, we focus on optimizing the classic reduction primitive, paying attention to its two versions. Its unsegmented form takes a binary associative operator  (e.g. , , and
) and an array of data [a 0 , a 1 , ..., aN -1 ] as inputs, and it returns as output one value (a 0  a 1  ...  aN -1 ). In its segmented version, the input array is divided into segments of consecutive data, and the output is the individual reduction of each segment. Thus, the output size is the number of segments included in the given input. Observe that the segmented primitive could be easily implemented in terms of the unsegmented primitive, by extracting each segment and reducing it, isolated from the global input, with the unsegmented primitive. Nevertheless, this should be done many times (one for each segment), and some of the segments may be too small to justify further kernel executions. In consequence, this approach would not take the most of GPUs. On the contrary, the segmented solutions we illustrate will simultaneously perform separate parallel reductions on the segments of the input. For this reason unsegmented and segmented reductions are introduced as independent primitives along the paper.
The two reduction versions are useful building blocks for solving a wide variety of problems on GPU. For example and using CUDA, the unsegmented version has been successfully applied to solve the Single-Source Shortest-Path problem [12] and to build the Minimum Spanning Tree [16] , while the segmented version to construct kd-trees on GPU for ray tracing [17] and to accelerate sparse-matrix multiplication [3] .
Concerning the properties the operator  must fulfill, only associativity is essential. Actually, the correctness of all the algorithms described along this paper deeply relies on it. Most of the presented algorithms also make use of identity; so, 1  will denote an identity element for  from now on. Although the operator could be commutative as well, as it happens to the examples above, we will not suppose it in this paper, i.e. they are considered to be "non-commutative". The advantage of using commutativity, along with associativity, is that any pair of elements could be reduced regardless of their location on the input. However, these pairs must belong to the same segment in segmented reduction, which makes it difficult to exploit commutativity in this case. Hence, the exploitation of commutativity seems to come into conflict with the segment arrangement.
One of the main aims of this paper is to compare two kinds of reduction approaches: recursive (tree-based) versus sequential algorithms. In fact, our first contribution is to experimentally confront the state-of-the-art algorithms of both types. Obviously, we have replicated the experimental study for the two versions of the primitive. As a second contribution, we propose two algorithmic optimizations that can be integrated into any of the previous algorithms, regardless their nature. We have tested them into the fastest solution, resulting in significant speed-up for unsegmented reduction. However, they did not lead to succeed in the segmented case since the resulting solutions are bounded by the shared memory size.
II. RELATED WORK
The kernels presented by Harris [10] are the most popular CUDA implementations for the unsegmented reduction primitive. They are actually included as project examples in every CUDA SDK release. His document introduces seven kernels from a didactic perspective, in such a way that each kernel improves the performance of the previous one. Nevertheless, many of them require the operator to be commutative.
The two segmented reduction algorithms we have found are located at the previous references [17, 3] , where the primitive is also applied to solve a specific problem. In both cases, those proposals are based on the works of Blelloch for the scan primitive below mentioned.
The reduction primitive is actually a part of the (inclusive) scan primitive, which has been studied more widely because of its great applicability [5] . Thus, we must describe in this section some of the progress made on the scan primitive. Scan also takes an array [a 0 , a 1 , ..., aN -1 ] and a binary associative operator  as inputs, but it returns an array containing the reduction of all the prefixes [a 0 , (a 0  a 1 ), ..., (a 0  a 1  ...  aN -1 )]. Scan also accepts a segmented version. In this case, the output is the unsegmented scan of each segment.
The classic parallel algorithms for scan were studied by Blelloch [4, 5] . His formulations were based on recursive equations whose application described a full binary tree. For this reason, they are called tree-based algorithms. Later on, with the advent of GPUs, these formulations were adapted to the novel programming model. Thus, tabulation techniques were applied to replace the recursive nature of tree-based algorithms with an iterative processing. Horn [11] was the first developing GPU-based implementations of the scan primitive. The complexity O(NlogN) of its formulation was improved by Gress et al. [9] and Sengupta et al. [13] to a linear algorithm. The latter presents a work-efficient step-efficient implementation on CUDA that was adapted to the segmented case a year later [14] . In order to improve performance of previous algorithm, the authors exploit shared memory usage. However, the implementations involve bank conflicts, and the kernels may not scale well with shared memory size.
Subsequently, Dotsenko et al. [7] presented work-efficient sequential algorithms for the unsegmented and segmented scan. They decompose the input into blocks that are arranged as matrices in shared memory. Each matrix is sequentially reduced by rows and partial results are stored in a smaller array, which is processed later on. The overhead of previous tree-based formulations concerning synchronization barriers is reduced since each thread reduces a row. Immediately, Sengupta et al. [15] improved their tree-based algorithms incorporating an intra-warp operation: each warp individually performs a scan over 32 elements. Next, one warp carries out another intra-warp execution over the previous results to generate the reduction of the whole block. The advantages of this operation are that many synchronization barriers become unnecessary. However that paper does not include a comparison with the work by Dotsenko et al, thus, the most recent and fastest implementations of both trends -the intrawarp scan [15] for the tree-based trend and the sequential scan [7] for the sequential family-have not been experimentally compared yet.
III. UNSEGMENTED REDUCTION
Along the paper, we use the term block to denote two different concepts. A CUDA-block is a block of threads, while a data-block is a chunk of consecutive data that is individually reduced by a CUDA-block. As we will see later, a CUDAblock can reduce one or several data-blocks. We use and to denote the sizes of a CUDA-block and a data-block, respectively (see Glossary).
A data-block is loaded from global memory to shared memory by the corresponding CUDA-block, before being reduced. This is done exploiting coalesced readings. When the input size N is not a multiple of D, we append a virtual padding to the last data-block, which is filled with values 1  . This is done through an if-statement during the loading stage. Thus, all the algorithms of this paper work on / datablocks.
In order to reveal the main differences among the algorithms, we focus on how a data-block is reduced into a single result, and on how this value is handled afterwards. Thus, we will suppose that the array s_data holds a data-block in shared memory for subsequent reduction.
A. State-of-the-art Reductions
A common characteristic of these algorithms is that each CUDA-block exactly reduces a data-block. The result is then written back into global memory by a thread of the CUDAblock. Hence, the grid size is , which also corresponds to the output size. In consequence, the output must be reduced again with another kernel launch, and so on, until a single result is left. This is usually known as a multi-pass approach. 
1) Tree-based reductions:
Supposing that s_data lays on the leaves of a full binary tree, tree-based algorithms reduce each pair of siblings using , in a bottom-up manner. Hence, they require to be a power of two. The underlying recursive equations are:
Algorithm 1 is similar to the kernel#2 by Harris [10] . At each iteration, a thread reduces two elements at line 14 (with indices ai and bi), and stores the result in ai, which corresponds to a left-storing approach. Storing into bi would be also possible (right-storing). The first iteration requires a thread for each pair of elements, thus 2 and must be a power of two as well. So the algorithm executes the loop 2 times. Also observe that in each iteration, the number of active threads is halved (line 7), reaching a single active thread in the last iteration.
In order to avoid bank conflicts in shared memory, Harris replaces this interleaved addressing access with a sequential addressing pattern in his kernel#3. This new approach requires  to be commutative, so this technique has not been considered in this paper. On the contrary, we overcome bank conflicts by including the usual padding of one element every nBanks elements, where nBanks is the number of banks in the shared memory of the device. Thus, the total padding is 2B/nBanks elements. This is why we update the indices at lines 11 and 12. Incorporating this offset does not penalize occupancy on current devices, and the overhead due to the index arithmetic is negligible.
Tree-based algorithms are quite fast, but they suffer from too much synchronization. Notice that a barrier must be located between iterations (line 6). However, the number of elements that a thread loads from global memory can be tuned to achieve a certain speed-up. Notice that * then holds. We do not consider such improvements as algorithmic optimizations, but code optimizations, so we have not paid attention to them in this paper. Observe that =2 in Algorithm 1. On the contrary, =8 in the scan implementation of CUDPP 1.1.1 [6] .
2) Intra-warp tree-based reductions: Sengupta et al. [15] improve the previous tree-based algorithm by working at the warp level. In order to explain this techique, let us extend the notation of CUDA-block and data-block to the case of warps. Briefly, a CUDA-warp is composed of 32 adjacent threads inside a CUDA-block, which run implicitely synchronized in SIMD fashion, while a data-warp is a chunk of 32 consecutive data inside a data-block. Sengupta et al. avoid bank conflicts and many synchronization barriers, using an intra-warp routine in which each data-warp is scanned by a CUDA-warp. The device function tb_reduceWarp of Algorithm 2 adapts this tecnique to the reduction primitive. During its execution, each CUDA-warp is responsible for reducing one data-warp, and sending the result to parameter target. The five iterations that are enough to reduce a data-warp have been unrolled, as the authors do. Notice that no explicit synchronization barriers are requiered due to the way CUDA-warps run in CUDA.
The kernel TB4_Warp_reduction of Algorithm 2 repeatedly invokes the previous function to reduce the whole data-block. 3) Sequential reductions: Dotsenko et al. [7] propose an algorithm for the scan primitive that is based on a matrix representation of s_data, as Fig. 1 shows. and respectively denote the height and width of this matrix, so . Algorithm 3 adapts their MatrixScan to the reduction primitive. Observe that one thread is responsible for sequentially reducing one row of elements, thus only threads are required to reduce the whole data-block (line 5). Nevertheless, all the threads inside the CUDA-block cooperated at the beginning to load the data, including these threads.
Since reducing a row involves no synchronizations, the performance of the algorithm could be improved by maximizing . Nevertheless, must be small enough to fit in shared memory, thus and are forced to be rather small as well. In addition, should be a multiple of the CUDA-warp size in order to avoid divergent warps. To sum up, Dotsenko et al. finally assign and to be the warp size ( 32). A padding of one element is then added at the end of each row to avoid bank conflicts. Moreover, the values that are obtained after reducing the rows can be reduced using one intra-warp tree-based reduction (line 16). Also notice that only one warp continues beyond line 5 since 32, so no explicit synchronization is needed at line 15.
B. Algorithmic Optimizations
We present two algorithmic optimizations that can be integrated into any of the solutions described above.
1) Persistent blocks:
We can force each CUDA-block to reduce multiple consecutive data-blocks, instead of a single one. Thus, the output size decreases since the number of CUDA-blocks is smaller than the number of data-blocks. Moreover, the number of multipasses falls as decreases. The reductions of the data-blocks assigned to a CUDA-block are accumulated using a shared variable. Specifically, one of its threads accumulates the result of a data-block into the reduction of the previous data-blocks.
In order to distribute all the data-blocks among all the CUDA-blocks, we simply incorporate two new parameters into the kernel to indicate the quotient q and the remainder r of the division / . Then, a CUDA-block must reduce q+1 datablocks if blockIdx.x<r, or just q data-blocks otherwise.
In order to improve performance, the grid size must be carefully chosen according to the requirements of the kernel. Given a block size , the maximum number of resident CUDA-blocks per multiprocessor (MBPM is computed according to the CUDA Occupancy Calculator. Then is fixed to NUM_MULTIPROCESSORS*MBPM. The underlying idea is to fill each multiprocessor with the maximum number of CUDA-blocks that can reside together on it. Thus, no CUDAwarp will wait for being allocated on a multiprocessor. We use the adjective persistent to denote these CUDA-blocks since they will be residing on the device until the whole input is processed.
Using persistent blocks results in a small grid size, since the number of multiprocessors is limited by the device, and MBPM cannot exceed 8 in any of the current CUDA compute capabilities. Thus, one kernel execution is almost enough to reduce the whole input. Indeed, the results after the first launch do not even complete a data-block because . In consequence, we remove the usual recursion controlled by the host, and furthermore the cost of storing partial results into global memory between recursive calls.
The idea of persistent blocks has been already applied to other topics. For example, it was recently used to accelerate the traversal step of GPU-implemented ray tracers by Aila and Lane [1] , under the term persistent thread. In fact, Harris [10] already proposed a reduction algorithm, the so-called cascading kernel#7, whose CUDA-blocks reduce many datablocks, but during the load stage, rather than during the reduction stage, and using a commutative operator. Nevertheless, these authors do not explain how the grid size ( ) is chosen in their respective papers.
2) Producer-consumer scheme: We add another optimization onto the persistent technique in order to help the schedulers to hide memory latency a little more. The idea is to classify the set of CUDA-warps into two groups: consumers and producers, of respective sizes C and P. At each iteration, consumer warps sequentially reduce the data-block loaded in the previous iteration, while producer warps load a new datablock. Thus, consumers can reduce at the same time producers load new data.
Algorithm 4 shows the code fragment that implements the producer-consumer scheme. Two data-blocks are held in shared memory, which are accessed through two pointers, s_load and s_comp. These pointers are swapped at the beginning of each iteration (line 10). Then, consumer warps reduce the data-block pointed by s_comp (line 13), while producer warps load the data-block that is located at index first in global memory into the buffer pointed by s_load (line 15). Observe that a synchronization barrier is required (line 16) to prevent warps from overwriting the other buffer. Also notice that the first/last data-block is loaded/reduced before/after the loop. The number of iterations is controlled by variable d, which holds the number of data-blocks assigned to this CUDA-block. The result of reducing a data-block is accumulated into the shared variable s_current_red inside the reduceChunk routine. Fig. 2 graphically exposes the advantages of this technique, using the sequential matrix-based solution presented in Algorithm 3 as the underlying reduction method. Remember that each data-block is arranged as a matrix of size 1024, where 32 and 32 denote its height and width, respectively. The figure depicts how warps can be dispatched if the optimization is incorporated (on the left), and if is not (on the right). In the first case, the producer-consumer scheme has a configuration of 8 producers (warps from W 1 to W 8 ) versus 1 consumers (warp W 0 ). Observe that a single consumer warp is enough, since exactly 32 rows must be reduced. In addition, each producer is responsible for loading / 128 elements, which is done in 128/32 4 coalesced readings. Whenever a producer has requested data from global memory, it must wait until those data are available. This waiting time can be used to request more data by another producer, or to reduce the previous data-block by the consumer.
On the right, the figure shows the execution of the persistent sequential matrix-based algorithm without the optimization. In this case we have 8 warps in a CUDA-block 256 processing 1024 elements as before. Notice that warp W 0 starts reducing only when all the data-block are available in shared memory. Thus, each iteration is slightly longer than one using the producer-consumer scheme.
Although the producer-consumer paradigm is a well-known technique, its implementation on GPU is a recent issue. Besides our paper, Bauer et al. apply DMA techniques to solve in GPU other problems [2] . Our scheme can be compared to their "manual double-buffering" technique.
IV. SEGMENTED REDUCTION
In segmented reduction (s-reduction in the sequel), the input is divided into segments. We demarcate them by using another array of size N, called owner, such that owner[i] holds the index of the segment of element i. Hence, owner is sorted in non-decreasing order. Notice that the output of the s-reduction is an array whose size is the number of segments. In the sequel, the variable g_output will denote such array, which is located on global memory.
Next we adapt the algorithms presented so far to the segmented case. Again we focus on the reduction step, since it exposes the differences among them. Thus, we suppose that data and owners already hold in shared memory, specifically in s_data and s_owner.
A. State-of-the-art Segmented Reductions
Zhou et al.
[17] propose a tree-based algorithm by adapting (1) to the segmented case. The underlying recurrences are: Expressions and respectively denote the left and right children of an inner node , and is the owner of leaf . Zhou et al. include an extra operation that must be applied when the siblings have different owners (label †). In this case, the sreduction of the right child must be accumulated into the global solution. Specifically, each thread stores in ai the reduction of two elements, only if their owners are the same. Otherwise, only the left one is propagated, while the other is used to update g_output. Hence, the prefix of the processed chunk is propagated in a bottom-up manner, which agrees with this left-storing approach. Fig. 3 shows an example of a tree- The partial result of the first segment inside this data-block ends at s_data [0] and s_owner[0]. The kernel is responsible for s-reducing a data-block, thus a multi-pass approach is required again to completely s-reduce a larger input. This is common to all the algorithms we present in this subsection.
Warps can also be exploited to improve Zhou et al.'s algorithm by avoiding some synchronizations. Basically, we must replace lines 7, 9, 11, 13 and 15 of Algorithm 2 with a proper call to the following device routine, in order to obtain tb_s_reduceWarp: Notice that prefixes are propagated again. Replacing in TB4_Warp_reduction any call to tb_reduceWarp with a call to tb_s_reduceWarp results in the intra-warp tree-based kernel for s-reduction.
Concerning sequential matrix-based reduction, we must replace lines 7-16 of Algorithm 3 with the following code fragment to cover the segmented case: Data layout is left-to-right, top-to-bottom, hence the suffix of the processed chunk inside the row is propagated now (line 12), and the tree-based s-reduction at lines 24-25 must be rightstoring. The partial result of the last segment is sent to s_data[0] and s_owner[0].
B. Algorithmic Optimizations
The techniques we proposed for the unsegmented case can be integrated into the previous state-of-the-art segmented algorithms. Specifically, we incorporate persistent blocks, and the producer-consumer scheme afterward, into the sequential matrix-based algorithm. This latter technique requires a considerable amount of shared memory, since two data-blocks for the elements and another two data-blocks for the owners are needed at the same time for a CUDA-block. Thus, occupancy decreases on current devices, which turns into a not competitive performance as we will see.
V. EMPIRICAL RESULTS AND DISCUSSION
We have used a NVIDIA GTX 480 (capability 2.0 -Fermi, 480 cores, 1536MB of GDDR5 global memory, configured as 48KB of shared memory per multiprocessor and 16KB of L1 cache), with driver 270.61, and the CUDA Toolkit, SDK and Compute Visual Profiler 4.0. This paper focuses on empirically comparing algorithms by testing their straight implementations. Thus, we have not tuned the code as much as possible. In the experiments described below, the operator is the minimum function on float data (4 bytes). The timing information was obtained by reducing ten times a random input, and by taking their performance on average. The input size is N m*2 20 , where m ranges from 16 to 31. We also tested other operators, e.g. + on float data, obtaining similar runtimes that are not included in the paper.
Concerning global memory accesses, Fermi architecture incorporates an on-chip cache hierarchy which is fairly configurable. Specifically, accesses can be cached in both L1 and L2, which is the default setting, or in L2 only. Since our implementations report similar runtimes under both configurations, the results we present below correspond to the default mode (L1 and L2).
A. Unsegmented Reduction Results
We have tested the five algorithms previously presented:
 TB2: the tree-based solution of Algorithm 1, with 2 and 2 .  TB4-warp: the intra-warp tree-based solution included in Algorithm 2, with 4 and 4 . Figure 3 . Example of a tree-based s-reduction. The reduction of the first segment has not been written to the output buffer yet. Notice that we have only incorporated the algorithmic optimizations into the sequential matrix-based reduction. The reason is that Matrix exhibits the best performance for unsegmented reduction.
The features of the five implementations are presented in Table I on the left. It includes MBPM for persistent solutions because it determines the corresponding grid size. Runtimes and effective bandwidths are shown in Fig. 4 . Diffwarps exhibits the best throughput, with a bandwidth near to 104 GB/s. If and denote the number of bytes read and written by the algorithm, and is the runtime in seconds, effective bandwidth has been calculated as 10 ⁄ ⁄ . elements. These columns are especially relevant since they exhibit the behavior of each kernel execution individually. Notice that non-persistent solutions (TB2, TB4_Warp and Matrix) require three launches to completely reduce the input, while persistent ones (P_Matrix and Diffwarps) only need two. According to the Profiler reports, we include the following runtime information: grid size, global memory read throughput and average number of warps that are active on a multiprocessor per cycle aw/ac, which is calculated as (active warps)/(active cycles). Notice that bandwidth and aw/ac decrease as the reduction process advances. This is because the input for the first launch is larger than the input for consecutive launches. Hence, the GPU has less workload and becomes less efficient for subsequent launches, which explains why persistent solutions run faster.
B. Segmented Reduction Results
We have tested the corresponding five segmented algorithms. Their features are shown in Table II . Observe that the theoretical occupancy of most algorithms is smaller than those of their unsegmented counterparts. In the case of Diffwarps, it is so small (38%) that it is not competitive. Thus, we have added a variant which sequentially s-reduces a smaller matrix (H=32, W=16). Let Diffwarps16 denote such solution.
Notice that we then recover the 94% occupancy of the unsegmented version.
The way segments are distributed has a profound impact on the performance of all the algorithms, since the accesses to global memory, which are required when the left and right owners are different, called irregular accesses in the sequel, are irregularly spread along execution. Thus, we have tested three scenarios: (a) a single huge segment covering the whole data, (b) segments of random size, ranging from 10 to 50, and (c) many small segments of size 3. Fig. 5 shows the runtimes we have obtained for the three cases. Concerning state-of-the-art reductions, the figure shows two surprising facts: (1) TB2 runs faster than TB4-warp in the three scenarios, and (2) tree-based solutions exhibit a better performance w.r.t. sequential ones as the segment size gets smaller. We will explain the reasons we find below. With regards to the algorithmic optimizations, P_Matrix and Matrix show similar performance in the three cases, and Diffwarps and Diffwarps16 are not competitive in general, although the latter exhibits higher throughput.
Let us focus on scenario (a), that is, only one segment appears. Since irregular accesses do not take place, the results should be similar to those obtained for unsegmented reduction. Two facts remain: matrix-based beat tree-based methods, and P_Matrix improves Matrix; but two new issues come up. On the one hand, Diffwarps16 is not among the fastest solutions in segmented reduction, while Diffwarps was the fastest in the unsegmented case. Each data-block requires loading the same amount of bytes in Diffwarps16 (segmented) as in Diffwarps (unsegmented), since the matrix is halved (W=16) but owners are also loaded. On the contrary, the data size that is processed in a data-block is halved in Diffwarps16 (W=16). Hence, read bandwidth gets halved, which finally results in a suboptimal performance. The Profiler endorses such claim since global memory read throughput falls from 104.17 GB/s in Diffwarps (unsegmented) to 53.67 GB/s in Diffwarps16 (segmented), for the first launch on 30*2 20 elements.
On the other hand, the relation between TB2 and TB4-warp gets reversed from unsegmented to segmented reduction. TB4-warp has the advantage of requiring less explicit synchronization barriers, but it presents more intra-warp divergences because many threads inside a CUDA-warp are stalled inside function tb_reduceWarp. These divergences cause more harm to the segmented version of TB4-warp than to its unsegmented counterpart, because the divergent code is heavier in the segmented case due to s_reducePair_toTheLeft. With regard to the other scenarios, tree-based solutions are gaining positions as the segment size gets smaller. They overtake Diffwarps and Diffwarps16 in case (b), and exhibit the best performance in case (c). The reason we find is that the probability of getting coalesced accesses, concerning irregular accesses, increases for tree-based reductions when segments are small. To explain it let us focus on segments of size 3, i.e. case (c). We have simulated the first launch of TB2, TB4-warp and Matrix to analyze the ratio of read data transfers to requested accesses. Transfers have been counted according to the way Fermi serves memory in chunks of 32 adjacent floats (128 bytes). [5] [6] [7] [8] [9] . Notice that TB2 exhibits the lowest ratios, especially in the first levels, which indicates that irregular accesses are quite coalesced. On the contrary, the ratio for Matrix is always 1, that is, each request access is served with an independent transfer, which penalizes its performance.
VI. CONCLUSION AND FUTURE WORK
Sequential approaches have a better performance than treebased ones for unsegmented reduction. With regards to the segmented case, performance depends on the distribution of segments. According to our results for regularly-spread segments, tree-based methods exhibit a higher bandwidth for small sizes, whereas sequential ones run faster for medium and large ones.
The two optimizations we have presented result in a speedup for the unsegmented problem. Concerning the segmented case, performance is improved by using persistent blocks over segments of large size, while it remains the same for medium/small segments. Diffwarps is very shared-memory demanding, which makes its occupancy decrease. Thus, it is not competitive for segmented reduction on nowadays graphics hardware, although this could change for future devices since the current tendency has been to increase shared memory size.
The optimizations only have been integrated into the sequential matrix-based algorithm, and we plan to test them onto the tree-based methods. Finally, the reduction is a part of the scan primitive and the optimizations presented in this paper could be embedded in the scan algorithms to improve their performance. 
