It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where mul tiple input elements need to be combined into a single ele ment by an associative operation, e.g. addition or multipli cation. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex commu nication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better gener ality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.
INTRODUCTION
Reduction is a higher order function which combines an array of input elements through the use of an associative operation, constructing a single return value. Examples of reduction are calculating the sum of the elements of a vector, finding the maximum or minimum element in a list and logic operations such as and, or and xor over a vector. Reduc tion is encountered so frequently that many programming languages such as C++, python, perl and ruby, have built in support, although often under different names including accumulate, fold, aggregate, compress and inject.
Reduction is also often encountered in the video, image
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. and signal processing domains, which are the target domains of wide-SIMDs. Amongst others, reduction is required for kernels such as Partial Histogram Merging, Convolution, Sum of Absolute Differences, Row Projection, Min/Max finding and Matrix Multiplication. Given that reduction is such an important part of the target domains, it is impera tive to support reduction in an efficient manner.
One of the main difficulties of wide-SIMDs is the inter connect between the PEs. PEs need to be able to commu nicate in order to synchronize or exchange data. Since there are hundreds of PEs, any form of complex interconnect soon hits a scalability wall. Therefore, wide-SIMDs typically only have a very limited form of interconnect, which puts con straints on the amount and type of communication between the PEs. This complicates the exploitation of the data level parallelism (DLP) present in reduction.
In this paper two novel reduction algorithms optimized for wide-SIMDs with minimal interconnect are proposed. These algorithms do not rely on any additional hardware and re quire only local communication with short wires, making this approach extremely scalable. Additionally this soft ware approach is completely flexible in type of combining operation. To demonstrate the effectiveness of the proposed algorithms, we compare an implementation on a wide-SIMD with limited connectivity, with both a straightforward map ping and a solution with dedicated hardware. Furthermore, a case study shows that for a practical case dedicated hard ware is only 6.1 % faster, while consuming 6.8% more area.
The remaining parts of this paper are organized as follows; First the experimental setup including the target platform and data layout are discussed in Section 2. Next a straight forward and two novel reduction algorithms are presented in Section 3. The novel reduction algorithms are analysed and compared with the reference approaches in Section 4, including the results of a practical case study. Finally re lated work and conclusions can be found in Sections 5 and 6 respectively.
EXPERIMENTAL SETUP
This section describes the target platform used to bench mark the novel reduction algorithms. Additionally the data layout on this platform, and a dedicated hardware approach are described.
Target Architecture
The architecture used to benchmark the novel reduction algorithms is a wide-SIMD with limited interconnect. In particular this SIMD has an array with Np E RISC-Iike processing elements to exploit data level parallelism, and in parallel to that a Control Processor (CP). The CP handles the program flow and the PE Array runs in lock-step with the CP. A high level overview of the architecture is shown in Figure 1 . 
Neighbourhood Network
In order to communicate data, the architecture has a neigh bourhood network, which is one of the most minimal types of interconnect possible. In this network, all PEs are connected in a circular fashion. A PE can access one of its neighbouring PEs' operands as means of communication. The neighbour hood network is illustrated in Figure 2 . The CP can be a part of the loop or not, depending on the configuration of the first and last PE. It is also possible to 'break' the loop and let the boundary PEs read a predefined value. This configuration can be changed at runtime.
All the wires are local and there is no complex network control involved, which is greatly enhances scalability. This scalability comes at the price of degraded performance for long distance communication. The key concept here is that when a PE needs to exchange data with a PE not directly adjacent to it, that data will have to pass through all PEs in between. Every hop in this chain takes one cycle, hence long distance communication is slow and inefficient. Therefore the challenge of this network is to map algorithms in such a way that communication is kept local as much as possible.
Processing Elements
The Processing Elements are RISC-like architectures with four pipeline stages. An instruction can either perform a memory or an arithmetic operation. The memory operations operate on a private data memory (DMEM) with addressing that is independent of the rest of the PE Array. Furthermore each instruction can be predicated to be able differentiate the execution between the PEs.
Data Layout
The goal of the reduction techniques is to combine the el ements of a vector which is distributed over the data memo ries of the PE Array. In particular we assume NVect vectors of size Vsize elements are stored in the N P E data memories of the target SIMD. The NVect reduced outputs have to end up in the CP. In terms of data layout in the PE Array two cases can be distinguished:
If the vector size is smaller or equal to the number of PEs, each vector has at most one element in the DMEM of each PE. The vectors are assumed to be stored in rows, and in case Vsize < N P E the last PEs in the array are assumed to hold no elements and can be left out of consideration. In Figure 3 the position of four vectors in the DMEM of the target architecture is illustrated.
[8] [7] [8]
.
3
.4
[1]
If the vector has more elements than there are PEs, a wrap around is required. Therefore the DMEM of a PE will con tain at least one element of the vector and possibly more. It is relatively easy to convert this case to case 1, by letting each PE locally reduce all elements associated to the same vector in its private DMEM. This leads to the same layout as in case 1 where each PE has one element per vector. The conversion from case 2 to case 1 is illustrated in Figure 4 .
.1
.2
(a) Initial situation (b) After column reduction The conversion from case 2 to case 1 is a simple proce dure, since there is no communication required between PEs.
Given that a PE contains a maximum of I V N s ize 1 elements PE of a single vector, converting case 2 to case 1 would take I V,i,ze 1 loads I V,ize 1 -1 combine operations and 1 store
operation. This gives a total of 2 x I V N 'i,ze 1 operations per PE vector.
All the algorithms and techniques discussed hereafter as sume a data layout as shown in Figure 3 . To compensate for the conversion from a layout such as in Figure 4a , an additional 2 x I V N ,ize 1 x N V ect cycles should be added to all PE running times given in this work.
Dedicated Reduction Hardware
To benchmark the novel reduction algorithms, they are compared with dedicated reduction hardware. Although dedicated hardware is not as scalable as a software approach, and fixes the supported combine operation at design time, it has been used in the past in wide-SIMDs as will be discussed in Section 5. Therefore it is important to compare the novel algorithms with such an approach.
For this work it was chosen to focus on summation as the combining operation, as it is one of the most common types of reduction and can be found in many kernels. Therefore a basic adder is added to the target architecture as the dedi cated reduction hardware.
The basic adder tree is fully pipelined and can start a new computation every cycle. It is as wide as the PE Array and contains ilog2N PE l stages. The adder tree inputs and output are memory mapped. The PEs can input elements and the sum of those elements can be accessed by the CPo
SOFTWARE APPROACHES
This section contains three software approaches to map reduction to the target architecture. Straightforward re duction is an attempt to exploit the DLP within a single reduction operation, and is intended as a reference for the novel algorithms. The pipelined reduction and diagonal ac cess reduction are the two novel algorithms that map reduc tion efficiently to the target architecture using no dedicated hardware extensions or complicated interconnect.
Straightforward Reduction
In typical cases the DLP in a reduction operation is ex ploited by performing the operations in a tree-like fashion, i.e. all operations in one layer of a binary reduction tree are executed in parallel. The mapping of such a tree to the PE Array is illustrated in 5. As can be seen in Figure 5 , directly mapping such a reduction tree onto the target architecture results in a mismatch with the neighbourhood network. Per cycle, data can only be transferred either one PE to the left or to the right. The red arrows in Figure 5 require com munication over more than one PE, resulting in additional cycles to perform the communication. Per layer of the tree, the branches become longer and the overhead increases. The number of operations for layer i, consisting of combine plus communication operations is given in formula 1.
The number of layers in a reduction tree for vectors of size Vsize is given in formula 2.
( 2)
Combining formula 1 and 2, the number of required opera tions can be calculated, as is shown in inequality 3.
From inequality 3 it can be concluded that instead of map ping the reduction tree to the SIMD, it would be just as fast, or even faster, to implement a sequential type of algorithm that simply performs the Vsize -1 combinations required to reduce one vector. This can be accomplished by shifting the elements to the CP and in parallel combine them one by one as they arrive. The pseudo code for this straightfor ward method is given in Algorithm 1. In the pseudo code right(x) is used to indicate that element x is being read from the right neighbouring PE. 
Pipelined Reduction
Since it is impossible to exploit the DLP within a sin gle vector with a neighbourhood network, as proven in the previous section, the parallelism has to be found elsewhere. In this section the novel pipelined reduction and diagonal access reduction algorithms are introduced that exploit in ter vector parallelism in contrast to intra vector parallelism. Using inter vector parallelism, the communication pattern is transformed such that only local transactions are required.
The pseudo code for the pipelined reduction algorithm is given in Algorithm 2. Each PE operates on data from a different input vector. After a PE has performed a combine operation, the result is passed to the next PE. This PE will then load the element from its DMEM that corresponds to the vector of the received data, and repeat the process. For clarity a visualisation is given in Figure 6 . In this pipelined reduction algorithm, three phases can be recognized:
Filling the pipeline:
In this phase not all PEs are active. It takes N P E steps be fore PEO receives its first element. This phase corresponds with Figure 6a to 6c. If NVect 2: Vsize, then there will be a point where all the PEs are active. In this phase Vsize PEs will perform a use ful combine operation per step in the algorithm. See Figure  6d .
Emptying the pipeline:
Once the last PE in the array (here PE3) has processed the last vector, it can be disabled. From this point onward the remaining PEs will finish one by one until PEO completes. This corresponds with Figure 6e to 6f.
Diagonal Access Reduction
If NVect < Vsize, the pipelined reduction algorithm never enters the most efficient phase (phase 2). Therefore, if NVect is much smaller than Vsize it is better to take a different approach. By accessing the elements in a diagonal pattern from the start and using wrap-around, efficient reduction is possible for all situations where NVect :::; Vsize. The pseudo code for the diagonal access reduction algorithm is given in Algorithm 3. A visualization is given in Figure 7 . 
ANALYSIS AND EVALUAT ION
In this section the two novel reduction methods and the reference methods are analysed and evaluated in terms of running time, chip area and energy consumption.
First running times of the various approaches are obtained by using a cycle accurate simulator which is verified against RTL code. The measured running times are plotted as con tinuous lines in Figure 8 . For this Figure, the vector size ( Vsize) is fixed at 128 elements. Besides the measured val ues, formulas for the running time are derived from the as sembly code of implementations of these algorithms. With these formulas it is possible to accurately approximate the with a = nextPowerOfTwo ( l1og2 N PE l)
Since the adder tree requires exactly the same amount of cycles for 64 < Vsize :::; 128, the line for the adder tree shown in Figure 8 would also hold for Vsize = 65. The software approaches would however need to do less work and would finish faster for Vsize = 65. To illustrate this a purple line is added for the pipelined algorithm for Vsize = 65. This line can be compared directly to line of the adder tree, indicating how much the performance difference can vary if Vsize is between two consecutive powers of two. As can be seen in Figure 8 the pipelined and Diagonal Access algorithms provide very large speed up compared to the straightforward method for more than a couple of vectors. The pipelined approach has a high initial cost and is slow when Nvect is small. This effect can be mitigated by using the Diagonal Access algorithm in this region.
The interesting part however, is that the running time of the adder tree grows at about the same rate as the pipelined reduction algorithm. In fact, as can be derived from the running time formulas, in the current implementation, the pipelined reduction algorithm grows at about 2.38 cycles per vector while the adder tree grows at a rate of 2.75. At some point the software reduction would thus actually be Faster than the dedicated adder tree.
This effect though depends on the specific target architec ture. Both algorithms are in the same order of complexity and are theoretically able to grow at a rate of one cycle per vector. In the current target architecture however, one cycle is required to load the vector from memory, one to do the actual reduction and the rest is control overhead shared over a number of vectors. An extension of the target architecture with zero overhead loop support and dual issue PEs would enable a growth of only one cycle per vector for both the dedicated hardware, and the pipelined reduction method. Table 1 shows the area overhead, and energy results for a fixed input size. These numbers are obtained by synthesiz ing the SIMD for 400MHz with a 40nm TSMC library. Post synthesis simulation is used to obtain the power and energy results. As can be seen in the less energy, but these numbers are excluding memories. If the PE data memories are chosen to be 16 bit wide, 1KB large and also built in 40nm technology, the cacti memory tool [6] estimates an access energy of 0.7564pJ. For the tested configurations this would result in an additional en ergy of 9758pJ, making the energy difference between dedi cated hardware and the novel algorithms negligible.
Case Study -Fast Focus on Structures
To evaluate the effectiveness of the novel reduction algo rithms in a practical application, the Fast Focus on Struc tures application [2] is mapped to the target platform.
In the FFoS algorithm the centres of OLEDs have to be detected from an image. In order to perform this task, re duction is required in two parts of the algorithm. Once to merge partial histograms and convert them to a Cumulative Histogram (CH) and Cumulative Intensive Area (CIA), and once to obtain the sum of the rows of the input image (Row Projection). In this projection peaks have to be detected. The cycle counts for the various parts of the application with both the novel software techniques and a dedicated adder tree are given in Table 2 . As is shown in the table, for this practical example, the software reduction technique is even faster for the CH/CIA calculation. This is due to the flexibility of the software approach. Where the adder tree always gives its result directly to the CP, the software ap proach is able to do some post processing on PEO in parallel with the CPo This takes some load of the CP, reducing the running time.
For row-projection, the detection of peaks in this projec tion on the CP takes so much time that the reduction opera tions on the PEs are completely hidden by parallel operation of CP and PE Array. It is only the initial start up cost that makes the software reduction technique slower here. For larger input images, the software approach has the poten tial to outperform the dedicated hardware approach.
Overall the FFoS application with dedicated hardware is only 6.1% faster than with the software reduction tech niques. Yet, the software techniques enable flexible in com bine operator, scalability and no additional cost in chip area.
RELATED WORK
Reduction is encountered frequently in the target domains of wide-SIMDs and multiple solutions to support reduction have been proposed in the past. The most common approach is to implement dedicated hardware to support a fixed type of reduction. For example S.Seo et al. [5] suggest a dedicated adder tree as an extension to AnySP [7] in order to support the H.264 video codec efficiently. Other examples of SIMDs optimized for video processing that include dedicated hard ware include SIMD-2D [3] and the work by Dong-Xiao Li et al. [4] .
In the SLiM-II [1] a dedicated interconnect is used to sup port reduction. Essentially the red lines in Figure 5 are im plemented as direct, one cycle latency, connections between PEs. To perform one reduction operation with this network IIOg2 Vsize l communication steps are required. This ap proach is flexible in type of operation, but a single reduction operation takes O( lIog2 Vsize l) operations, as consecutive operations cannot be pipelined. Furthermore implement ing the red lines as connections would result in PEO having IIOg2 Vsize l additional connections, which the instruction set must support selecting from.
It is clear that efficient reduction support for wide-SIMDs is a relevant topic for many applications. The proposed solutions in the related works all use additional hardware to support reduction causing them to either lose general ity, or end up with an inherently slower and more complex design. The novel reduction algorithms introduced in this paper avoid the downsides of dedicated hardware and offer an interesting trade-off between pure performance, flexibil ity, scalability and chip area.
CONCLUSIONS
In this paper we have introduced two novel reduction al gorithms optimized for wide SIMDs with highly scalable, low-power interconnects that provide only minimal connec tivity. It has been shown that the algorithms are much more effective than a straightforward approach and can even com pete with dedicated hardware solutions. The added flexibil ity of the algorithms can in practical cases give an edge over hardware solutions. Since there is no additional hardware involved and only short local wires for communication are required, these software approaches are cheaper in area and can scale virtually unlimited. As almost all types of inter connect provide the required connectivity, these algorithms could be mapped to existing processors that lack hardware support and for future designs it should be a reason to re consider adding hardware support at all.
For future work, it would be an interesting topic to see if a slightly more complex network with a few long connections could help to minimize the start up cost of the software approaches. For example a network that allows blocks of eight PEs to be skipped with a single hop.
