Aggressive pipelining allows FPGAs to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit pipelining and reduce the efficiency and speed of an FPGA implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280MHz on a Xilinx Spartan-3 (XC3S-5000-4), the maximum frequency supported by the component's DCM.
Introduction
FPGAs have high computational density (e.g. they offer a large number of bit operations per unit spacetime) when they can be run at high throughput (e.g. [1] ). To achieve this high density, we must aggressively pipeline designs exploiting the large number of registers in FPGA architectures. In the extreme, we pipeline designs so that only a single LookUp- Table  ( LUT) delay and local interconnect is in the latency path between registers (e.g. [2] ). Pipelined at this level, conventional FPGAs should be able to run with clock rates in the hundreds of megahertz.
For acyclic designs (feed forward dataflow), it is always possible to perform this pipelining. It may be necessary to pipeline the interconnect (e.g. [3, 4] ), but the transformation can be performed and automated.
However, when a design has a cycle which has a large latency but only a few registers in the path, we cannot immediately pipeline to this limit. No legal retiming [5] will allow us to reduce the ratio between the total cycle logic delay (e.g. number of LUTs in the path) and the total registers in the cycle. This often prevents us from pipelining the design all the way down to the single LUT plus local interconnect level and consequently prevents us from operating at peak throughput to use the device efficiently. We can use the device efficiently by interleaving parallel problems in C-slow fashion (e.g. [5, 6] ), but the throughput delivered to a single data stream is limited. In a spatial pipeline of streaming operators, the throughput of the slowest operator will serve as a bottleneck, forcing all operators to run at the slower throughput, preventing us from achieving high computational density.
Saturated accumulation (Section 2.1) is a common signal processing operation with a cyclic dependence which prevents aggressive pipelining. As such, it can serve as the rate limiter in streaming applications (Section 2.2). While non-saturated accumulation is amenable to associative transformations (e.g. delayed addition [7] or block associative reduce trees (Section 2.4)), the non-associativity of the basic saturated addition operation prevents these direct transformations.
In this paper we show how to transform saturated accumulation into an associative operation (Section 3). Once transformed, we use a parallelprefix computation to avoid the apparent cyclic dependencies in the original operation (Section 2.5). As a concrete demonstration of this technique, we show how to accelerate a 16-bit accumulation on a Xilinx Spartan-3 (X3CS-5000-4) [8] from a cycle time of 11.3ns to a cycle time below 3.57ns (Section 5). The techniques introduced here are general and allow us to pipeline saturated accumulations to any throughput which the device can support.
Background

Saturated Accumulation
Efficient implementations of arithmetic on real computing devices with finite hardware must deal with the fact that integer addition is not closed over any non-trivial finite subset of the integers. Some computer arithmetic systems deal with this by using addition modulo a power of two (e.g. addition input (x i ) 0 50 100 100
11
-2 modulo sum 0 50 150 250 5 3 (mod 256) satsum (y i ) 0 50 150 250 255 253 (maxval=256) Table 1 . Accumulation Example modulo 2 32 is provided by most microprocessors). However, for many applications, modulo addition has bad effects, creating aliasing between large numbers which overflow to small numbers and small numbers. Consequently, one is driven to use a large modulus (a large number of bits) in an attempt to avoid this aliasing problem.
An alternative to using wide datapaths to avoid aliasing is to define saturating arithmetic. Instead of wrapping the arithmetic result in modulo fashion, the arithmetic sets bounds and clips sums which go out of bounds to the bounding values. That is, we define a saturated addition as:
SA(a,b,minval,maxval) { tmp=a+b; // tmp can hold sum // without wrapping if (tmp>maxval) return(maxval); elseif (tmp<minval) return(minval); else return(tmp) } Since large sums cannot wrap to small values when the precision limit is reached, this admits economical implementations which use modest precision for many signal processing applications.
A saturated accumulator takes a stream of input values x i and produces a stream of output values y i : Table 1 gives an example showing the difference between modulo and saturated accumulation.
Example: ADPCM
The decoder in the Adaptive Differential Pulse-Compression Modulation (ADPCM) application in the mediabench benchmark suite [9] provides a concrete example where saturated accumulation is the bottleneck limiting application throughput. Figure 1 shows the dataflow path for the ADPCM decoder. The only cycles which exist in the dataflow path are the two saturated accumulators. Note that we can accommodate pipeline delays at the beginning of the datapath, at the end of the datapath, and even in the middle between the two saturated accumulators (annotated in Figure 1 ) without changing the semantics of the decoder operation. As with any pipelining operation, such pipelining will change the number of cycles of latency between the input (delta) and the output (valpred).
Previous attempts to accelerate the mediabench applications for spatial (hardware or FPGA) implementation have achieved only modest acceleration on ADPCM (e.g. [10] ). This has led people to characterize ADPCM as a serial application. With the new transformations introduced here, we show how we can parallelize this application.
Associativity
Both infinite precision integer addition and modulo addition are associative. That is:
However, saturated addition is not associative. For example, consider: 250+100-11 infinite precision arithmetic: (250+100)-11 = 350-11 = 339 250+(100-11) = 250+89 = 339 modulo 256 arithmetic:
(250+100)-11 = 94-11 = 83 250+(100-11) = 250+89 = 83 saturated addition (max=255):
(250+100)-11 = 255-11 = 244 250+(100-11) = 250+89 = 255 Consequently, we have more freedom in implementing infinite precision or modulo addition than we do when implementing saturating addition.
Associative Reduce
When associativity holds, we can exploit the associative property to reshape the computation to allow pipelining. Consider a modulo-addition accumulator:
Unrolling the accumulation sum, we can write:
Exploiting associativity we can rewrite this as:
Whereas the original sum had a series delay of 3 adders, the re-associated sum has a series delay of 2 adders. In general, we can unroll this accumulation N − 1 times and reduce the computation depth from N − 1 to log 2 (N ) adders. With this reassociation, the delay of the addition tree grows as log(N ) while the number of clock sample cycles grows as N . The unrolled cycle allows us to add registers to the cycle faster (N ) than we add delays (log(N )). Consequently, we can select N sufficiently large to allow arbitrary retiming of the accumulation.
Parallel-Prefix Tree
In Section 2.4, we noted we could compute the fi- 
Figure 2. 16-input Parallel-Prefix Tree
adders. With only a constant factor more hardware, we can actually compute all N intermediate outputs: [11] ). We do this by computing and combining partial sums of the form S[s, t] which represents the sum: x s + x s+1 + . . . x t . When we build the associative reduce tree, at each level k, we are combining S[(2j) 2 k , (2j + 1) 2 k −1] and S[(2j + 1) 2 k , 2 (j + 1) 2 k −1] to compute S[(2j) 2 k , 2 (j + 1) 2 k −1] (See Figure 2 ). Consequently, we eventually compute prefix spans from 0 to 2 k -1 (the j = 0 case), but do not eventually compute the other prefixes. The observation to make is that we can combine the S[0,
we add a second (reverse) tree to compute these intermediate prefixes. At each tree level where we have a compose unit in the forward, associative reduce tree, we add (at most) one more, matching, compose unit in this reverse tree. The reverse, or prefix, tree is no larger than the reduce tree; consequently, the entire parallel-prefix tree is at most twice the size of the associative reduce tree. Figure 2 shows a width 16 parallel-prefix tree for saturated accumulation. For a more tutorial development of parallel-prefix computations see [11, 12] .
Prior Work
Balzola et al. attacked the problem of saturating accumulation at the bit level [13] . They observed they could reduce the logic in the critical cycle by computing partial sums for the possible saturation cases and using a fast, bit-level multiplexing network to rapidly select and compose the correct final sums. They were able to reduce the cycle so it only contained a single carry-propagate adder and some bit-level multiplexing. For custom designs, this minimal cycle may be sufficiently small to provide the desired throughput. In contrast, our solution makes the saturating operations associative. Our solution may be more important for FPGA designs where the designer has less freedom to implement a fast adder and must pay for programmable interconnect delays for the bit-level control.
Associative Reformulation of Saturated Accumulation
Unrolling the computation we need to perform for saturated additions, we get a chain of saturated additions (SA), such as:
We can express SA (Section 2.1) as a function using max and min:
The saturated accumulation is repeated application of this function. We seek to express this function in such a way that repeated application is function composition. This allows us to exploit the associativity of function composition [14] so we can compute saturated accumulation using a parallel-prefix tree (Section 2.5)
Technically, function composition does not apply directly to the formula for SA shown in Equation 5
because that formula is a function of four inputs (having just one output, y). Fortunately, only the dependence on y is critical at each SA-application step; the other inputs are not critical, because it is easy to guarantee that they are available in time, regardless of our algorithm. To understand repeated application of the SA function, therefore, we express SA in an alternate form in which y is a function of a single input and the other "inputs" (x, minval, and maxval) are function parameters:
We define SA[i] as the ith application of this function, which has x = x[i], m = minval, and M = maxval:
This definition allows us to view the computation as function composition. For example: To reduce the critical latency implied by Equation 8, we first combine successive nonoverlapping adjacent pairs of operations (just as we did with ordinary addition in Equation 4). For example:
To make this practical, we need an efficient way to compute each adjacent pair of operations in one step: We can visualize this fact graphically as shown in Figure 3 . Any input below minval or above maxval into the second SA will be clipped to the constant minval or maxval. Input clipping on the first SA coupled with the add offset on the second can prevent the composition from producing outputs all the way to minval or maxval (See Figure 3) . So, the extremes will certainly remain flat just like the original SA. Between these extremes, both SAs produce linear shifts of the input. Their cascade is, therefore, also a linear shift of the input so results in a slope one region. Consequently, SA[i−1, i] has the same form as SA[i] (Equation 6). As we observed, the composition, SA[i−1, i], does not necessarily have m = minval and M = maxval. However, if we allow arbitrary values for the parameters m and M , then the form shown in Equation 6 is closed under composition. This allows us to regroup the computation to reduce the number of levels in the computation. 
Composition Formula
We have just proved that the form SA [x,m,M ] is closed under composition. However, to build hardware that composes these functions, we need an actual formula for the [x, m, M ] tuple describing the composition of any two SA functions SA [x1,m1, M 1] and SA [x2,m2, M 2] .
Each SA is a sequence of three steps: TRanslation by x, followed by Clipping at the Bottom m, followed by Clipping at the Top M . We write these three primitive steps as tr x , cb m , and ct M , respectively:
As shown in Figure 4 , a composition of two SAs written in the form of Equation 10 leads to a new SA written in the same form. The calculation is the following sequence of commutation and merging of the "tr"s, "cb"s, and "ct"s:
I. Commutation of translation and clipping.
Clipping at M 1 (or m1) and then translating by x2 is the same as first translating by x2 and then clipping at M 1 + x2 (or m1 + x2).
II. Commutation of upper and lower clipping.
This is seen by case analysis: first suppose m2 ≤ M 1+x2. Then both sides of the equation are the piecewise linear function
On the other hand, if m2 > M1 + x2, then both sides are the constant function m2.
III. Merging of successive upper clipping. This is associativity of min.
Alternately, this can also be computed directly from the composed function.
Applying the Composition Formula
At the first level of the computation, m = minval and M = maxval. However, after each adjacent pair of saturating additions (SA[i−1], SA[i]) has been replaced by a single saturating addition (SA[i−1, i]), the remaining computation no longer has constant m and M . In general, therefore, a saturating accumulation specification includes a different minval and maxval for each input. We denote these values by minval[i] and maxval [i] .
The SA to be performed on input number i is then:
Composing two such functions and inlining, we get:
minval[i]), maxval[i])
We can transform this into:
This is the same thing as Figure 4, Compose Compose
maxval[i])
This gives us:
This allows us to compute SA[i, j](y) as shown in Figure 5 . One can note this is a very similar strategy to the combination of "propagates" and "generates" in carry-lookahead addition (e.g. [12] ).
Wordsize of Intermediate Values
The preceding correctness arguments rely on the assumption that intermediate values (i.e. all values ever computed by the Compose function) are mathematical integers; i.e., they never overflow. For a computation of depth k, at most 2 k numbers are ever added, so intermediate values can be represented in W +k bits if the inputs are represented in W bits. While this gives us an asymptotically tight result, we can actually do all computation with W +2 bits (2's complement representation) regardless of k.
First, notice that maxval is always between minval[i] and maxval[i]. The same is not true about minval , until we make a slight modification to Equation 16; we redefine minval as follows: , so none of these quantities ever requires more than W bits to represent. If we use (W +2)-bit datapaths for computing x , x can overflow in the tree, as the "x"s are never clipped. We argue that this does not matter. We can show that whenever x overflows, its value is ignored, because a constant function is represented (i.e. minval = maxval ). Furthermore, we need not keep track of when an overflow has occured, since if minval = maxval, then minval = maxval at all subsequent levels of the computation, as this property is maintained by Equations 17 and 19.
Putting it Together
Knowing how to compute SA[i, i−1] from the parameters for SA[i] and SA[i−1], we can unroll the computation to match the delay through the saturated addition and create a suitable parallel-prefix computation (similar to Sections 2.4 and 2.5). From the previous section, we know the core computation for the composer is, itself, saturated addition (Eqs. 15, 17, and 19). Using the saturated adder shown in Figure 6 , we build the composer as shown in Figure 7 . 
Implementation
We implemented the parallel-prefix saturated accumulator in VHDL to demonstrate functionality and get performance and area estimates. We used Modelsim 5.8 to verify the functionality of the design and Synplify Pro 7.7 and Xilinx ISE 6.1.02i to map our design onto the target device. We did not provide any area constraints and let the tools automatically place and route the design using just the timing constraints. We chose a Spartan-3 XC3S-5000-4 as our target device. The DCMs on the Spartan-3 (speed grade -4 part) support a maximum frequency of 280 Mhz (3.57ns cycle), so we picked this maximum supported frequency as our performance target.
Design Details
The parallel-prefix saturating accumulator consists of a parallel-prefix computation tree sandwiched between a serializer and deserializer as shown in Figure 8 . Consequently, we decompose the design into two clock domains. The higher frequency clock domain pushes data into the slower frequency domain of the parallel-prefix tree. The parallel-prefix tree runs at a proportionally slower rate to accomodate the saturating adders shown in Figures 6 and 7 . Minimizing the delays in the tree requires us to compute each compose in two pipeline stages. Finally, we clock the result of the prefix computation into the higher frequency clock domain in parallel then serially shift out the data at the higher clock frequency.
It is worthwhile to note that the delay through the composers is actually irrelevant to the correct operation of the saturated accumulation. The composition tree adds a uniform number of clock cycle delays between the x[i] shift register and the final saturated accumulator. It does not add to the saturated accumulation feedback latency which the unrolling must cover. This is why we can safely pipeline compose stages in the parallel-prefix tree. Area We express the area required by this design as a function of N (loop unroll factor) and W (bitwidth). Intuitively, we can quickly see that the area required for the prefix tree is roughly 5 2 3 N times the area of a single saturated adder. The initial reduce tree has roughly N compose units, as does the final prefix tree. Each compose unit has two W -bit saturated adders and one (W +2)-bit regular adder, and each adder requires roughly W/2 slices. Together, this gives us ≈ 2 × (2 × 3 + 1) NW/2 slices. Finally, we add a row of saturated adders to compute the final output to get a total of 17 2 NW slices. Compared to the base saturated adder which takes 3 2 W slices, this is a factor of 17N 3 = 5 2 3 N . Pipelining levels in the parallel-prefix tree roughly costs us 2 × 3 × N registers per level times the 2 log 2 (N ) levels for a total of 12N log 2 (N )W registers. The pair of registers for a pipe stage can fit in a single SRL16, so this should add no more than 3N log 2 (N )W slices.
This approximation does not count the overhead of the control logic in the serializer and deserializer since it is small compared to the registers. For ripple carry adders, N = O(W ) and this says area will scale as O W 2 log (W ) . If we use efficient, logdepth adders, N = O (log(W )) and area scales as O (W log (W ) log (log (W ))).
If the size of the tree is N and the frequency of the basic unpipelined saturating accumulator is f , then the system can run at a frequency f × N . By increasing the size of the parallel-prefix tree, we can make the design run arbitrarily fast, up to the maximum attainable speed of the device. In Table 2 we show the value of N (i.e. the size of the prefix tree) required to achieve a 3ns cycle target. We target this tighter cycle time (compared to the 3.57ns DCM limit) to reserve some headroom going into place and route for the larger designs.
Results Table 3 shows the clock period achieved by all the designs for N = 4 after place and route. We beat the required 3.57ns performance limit for all the cases we considered. In Table 3 we show the actual area in SLICEs required to perform the mapping for different bitwidths W . A 16-bit saturating accumulator requires 1065 SLICEs which constitutes around 2% of the XC3S-5000. We also show that an area overhead of less than 25× is required to achieve this Table 3 . Accumulator Comparison speedup over an unpipelined simple saturating accumulator; for N = 4, 5 2 3 N ≈ 23, so this is consistent with our intuitive prediction aboves.
Summary
Saturated accumulation has a loop dependency that, naively, limits single-stream throughput and our ability to fully exploit the computational capacity of modern FPGAs. We show that this loop dependence is actually avoidable by reformulating the saturated addition as the composition of a series of functions. We further show that this particular function composition is, asymptotically, no more complex than the original saturated addition operation. Function composition is associative, so this reformulation allows us to build a parallel-prefix tree in order to compute the saturated accumulation over several loop iterations in parallel. Consequently, we can unroll the saturated accumulation loop to cover the delay through the saturated adder. As a result, we show how to compute saturated accumulation at any data rate supported by an FPGA.
