Pipelining Saturated Accumulation by Papadantonakis, Karl et al.
Pipelining Saturated Accumulation
Karl Papadantonakis, Nachiket Kapre, Student Member, IEEE,
Stephanie Chan, and Andre´ DeHon, Member, IEEE
Abstract—Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve
high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit
parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a
cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative
operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device.
This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3
(XC3S-5000-4) FPGA, the maximum frequency supported by the component’s DCM.
Index Terms—High-speed arithmetic, pipeline and parallel arithmetic and logic structures, saturated arithmetic, accumulation, parallel
prefix.
Ç
1 INTRODUCTION
OVER the last few decades, a large fraction of the clockrate increases in microprocessors has come from
increased pipelining (e.g., [1]) to the point where modern
processors run with about 10 gate delays (e.g., fanout-four
inverter (FO4) delays) per clock cycle. ASICs, ASIC-based
DSPs, and FPGAs have traditionally not been pipelined as
heavily, but their clock rates could also be increased by
heavy pipelining (e.g., [2] and [3]). For acyclic designs
(feedforward data flow), it is always possible to pipeline
designs down to just a few gate delays (or Lookup-Table
evaluations for FPGAs). It may be necessary to pipeline the
interconnect (e.g., [3] and [4]), but the transformation can be
performed and automated.
However, when a design has a cycle with a large latency
but only a few registers in the path, we cannot immediately
pipeline to this limit. No legal retiming [5] will allow us to
reduce the ratio between the total cycle logic delay (e.g.,
number of gates in the path) and the total registers in the
cycle. This often prevents us from pipelining the design all
the way down to the gate plus local interconnect level and,
consequently, prevents us from operating at peak through-
put to use the device efficiently. This phenomenon also
impacts processors; even though the processor is heavily
pipelined, loop-carried data dependencies implied by the
cycle prevent the processor from issuing instructions for the
single instruction stream at the full clock rate. We can use
these devices efficiently by interleaving parallel problems in
C-slow (e.g., [5] and [6]) or multithreaded (e.g., [7] and [8])
fashion, but the throughput delivered to a single data
stream is limited. In a spatial pipeline of streaming
operators, the throughput of the slowest operator is the
bottleneck, forcing all operators to run at the slower
throughput, preventing us from achieving high efficiency.
Saturated accumulation (Section 2.1) is a common signal
processing operation with a cyclic dependency which
prevents aggressive pipelining. As such, it can become
the rate limiter in streaming applications (e.g., Sections 2.2
and 2.3). While nonsaturated accumulation is amenable to
associative transformations (e.g., delayed addition [9] or
block associative reduce trees (Section 2.5)), the nonasso-
ciativity of the basic saturated addition operation prevents
these direct transformations.
In this paper, we show how to transform saturated
accumulation into an associative operation (Section 3).
Once transformed, we use a parallel-prefix computation to
avoid the apparent cyclic dependencies in the original
operation (Section 2.7). As a concrete demonstration of this
technique, we show how to accelerate a 16-bit accumula-
tion on a Xilinx Spartan-3 (X3CS-5000-4) FPGA [10] from
a cycle time of 11.3 ns to a cycle time below 3.57 ns
(Section 5). The techniques introduced here are general
and allow us to pipeline saturated accumulations to any
throughput which the device can support. The parallel-
prefix techniques further allow our designs to take in
multiple inputs per cycle and produce multiple outputs
per cycle (Section 6.1). As a result, we can design our
saturated accumulation to match any throughput which
the device’s I/O can support.
The techniques presented here were motivated by the
high latency of programmable interconnect in FPGAs, and
the results were first reported at an FPGA conference [11].
Nonetheless, the techniques are general and apply to any
technology, including ASICs which can benefit from micro-
architectural transforms which enabled greater pipelining
[2] and superscalar and VLIW processors which benefit from
208 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
. K. Papadantonakis is with Myricom, Inc., 325 N. Santa Anita Ave.,
Arcadia, CA 91006. E-mail: karl.papadantonakis@gmail.com.
. N. Kapre is with the Department of Electrical and Systems Engineering,
200 S. 33rd St., Philadelphia, PA 19104, and also with the California
Institute of Technology, Pasadena, CA 91125.
E-mail: nachiket@ieee.org.
. S. Chan is with Numerica Corp., 4850 Hahns Peak Dr., Suite 200,
Loveland, CO 80538. E-mail: stephanie.t.chan@gmail.com.
. A. DeHon is with the Department of Electrical and Systems Engineering,
University of Pennsylvania, 200 S. 33rd St., Philadelphia, PA 19104.
E-mail: andre@seas.upenn.edu.
Manuscript received 22 July 2007; revised 31 Dec. 2007; accepted 25 June
2008; published online 16 July 2008.
Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller,
and E. Schwarz.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TCSI-2007-07-0368.
Digital Object Identifier no. 10.1109/TC.2008.110.
0018-9340/09/$25.00  2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
transformations which increase instruction-level paralle-
lism. For this journal version, we have included detailed
proofs and more tutorial descriptions which could not be
included in the shorter conference version, expanded the
prior work comparisons, illustrated how to exceed the one
result per cycle bound, and included discussion on general-
ization of these techniques beyond saturated accumulation.
2 BACKGROUND
2.1 Saturated Accumulation
Efficient implementations of arithmetic on real computing
devices with finite hardware must deal with the fact that
integer addition is not closed over any nontrivial finite
subset of the integers. Some computer arithmetic systems
deal with this by using addition modulo a power of two
(e.g., addition modulo 232 is provided by most micropro-
cessors). However, for many applications, modulo addition
has bad effects, creating aliasing between large numbers
which overflow to small numbers and small numbers.
Consequently, one is driven to use a large modulus (a large
number of bits) in an attempt to avoid this aliasing problem.
An alternative to using wide data paths to avoid aliasing
is to define saturating arithmetic. Instead of wrapping the
arithmetic result in modulo fashion, the arithmetic sets
bounds and clips sums which go out of bounds to the
bounding values. That is, we define a saturated addition as
SA(a, b, minval, maxval) {
tmp ¼ aþ b; // tmp can hold sum
// without wrapping
if (tmp > maxval)
return (maxval);
elseif (tmp < minval)
return (minval);
else
return (tmp)
}
Since large sums cannot wrap to small values when the
precision limit is reached, this admits economical imple-
mentations which use modest precision for many signal
processing applications.
A saturated accumulator takes a stream of input
values xi and produces a stream of output values yi:
yi ¼ SAðyi1; xi; minval; maxvalÞ: ð1Þ
Table 1 gives an example showing the difference between
modulo and saturated accumulation.
2.2 Example: ADPCM
The decoder in the Adaptive Differential Pulse-Compression
Modulation (ADPCM) application in the mediabench
benchmark suite [12] provides a concrete example where
saturated accumulation is the bottleneck limiting application
throughput. Fig. 1 shows the data-flow path for the ADPCM
decoder. The only cycleswhich exist in the data-flowpath are
the two saturated accumulators. Note that we can accom-
modate pipeline delays at the beginning of the data path, at
the end of the data path, and even in the middle between the
two saturated accumulators (annotated in Fig. 1) without
changing the semantics of thedecoder operation.Aswith any
pipeliningoperation, suchpipeliningwill change thenumber
of cycles of latency between the input (delta) and the output
(valpred).
Previous attempts to accelerate the mediabench applica-
tions for spatial (hardware or FPGA) implementation have
achieved only modest acceleration on ADPCM (e.g., [13]).
This has led people to characterize ADPCM as a serial
application. With the new transformations introduced here,
we show how we can parallelize this application.
If we had multiple, independent ADPCM streams to
decode, we could C-slow (e.g., [5] and [6]) the design and
run C interleaved streams through a highly pipelined data
path. The techniques introduced here address the cases
where we either want to accelerate a single stream or where
it is advantageous to avoid the additional latency, complex-
ity, or state storage required in order to interleave streams.
2.3 Example: Telecommunication Standards
Many telecommunication standards (e.g., ETSI/3GPP
enhanced full rate and adaptive multirate speech proces-
sing, ITU G.723.1, ITU G.729) provide specifications or
reference implementations based on limited-precision
saturated arithmetic. For new implementations of the
standard to be credible, it is advantageous, and often
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 209
TABLE 1
Accumulation Example
Fig. 1. Data flow for ADPCM decode.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
necessary, for the implementations to provide bit-exact
results which match the standard. The technique we
demonstrate here allows parallelism and pipelining in
the saturated accumulations while remaining bit-exact
with the serial reference specification.
2.4 Associativity
Both infinite precision integer addition and modulo addi-
tion are associative. That is: ðAþBÞ þ C ¼ Aþ ðBþ CÞ.
However, saturated addition is not associative. For exam-
ple, consider: 250þ 100 11:
infinite precision arithmetic:
ð250þ 100Þ  11 ¼ 350 11 ¼ 339;
250þ ð100 11Þ ¼ 250þ 89 ¼ 339;
modulo 256 arithmetic:
ð250þ 100Þ  11 ¼ 94 11 ¼ 83;
250þ ð100 11Þ ¼ 250þ 89 ¼ 83;
saturated addition ðmax ¼ 255Þ:
ð250þ 100Þ  11 ¼ 255 11 ¼ 244;
250þ ð100 11Þ ¼ 250þ 89 ¼ 255:
Consequently, we have more freedom in implementing
infinite precision or modulo addition than we do when
implementing saturating addition.
2.5 Associative Reduce
When associativity holds, we can exploit the associative
property to reshape the computation to allow pipelining.
Consider a modulo-addition accumulator
yi ¼ yi1 þ xi: ð2Þ
Unrolling the accumulation sum, we can write
yi ¼ ðyi3 þ xi2Þ þ xi1ð Þ þ xi: ð3Þ
Exploiting associativity, we can rewrite this as
yi ¼ ðyi3 þ xi2Þ þ ðxi1 þ xiÞð Þ: ð4Þ
Whereas the original sum had a series delay of three adders,
the reassociated sum has a series delay of two adders (see
Fig. 2). In general, we can unroll this accumulation N  1
times and reduce the computation depth from N  1 to
log2ðNÞ adders.
2.6 Asymmetric Associative Reduction and Partial
Unrolling
Associativity actually allows us to take things a step further.
Instead of building balanced reduce trees, we can build
unbalanced trees that allow us to reduce the delay on some
inputs more than others (see Fig. 3a). In particular, this
allows us to minimize the delay on the feedback cycle.
Consequently, the delay in the cyclic path can be a single
operation delay rather than the OðlogðNÞÞ delay for a
balanced tree. In many cases, it is sufficient to unroll the
loop only N additions in order to accommodate the delay of
the single operator in the feedback. As shown in Fig. 3b, the
associative reduction preceding the feedback path can now
be pipelined to match the achievable clock rate of the final
feedback cycle.
2.7 Parallel-Prefix Tree
In Section 2.5, we noted we could compute the final sum
of N values in OðlogðNÞÞ time using OðNÞ adders. With
only a constant factor more hardware, we can actually
compute all N intermediate outputs: yi; yi1; . . . yðiðN1ÞÞ
(e.g., [14] and [15]).
We do this by computing and combining partial sums of
the form S½s; t which represents the sum: xs þ xsþ1 þ . . .xt.
When we build the associative reduce tree, at each level k,
we are combining S½ð2jÞ2k; ð2jþ 1Þ2k  1 and S½ð2jþ 1Þ2k;
2ðjþ 1Þ2k  1 to compute S½ð2jÞ2k; 2ðjþ 1Þ2k  1 (see
Fig. 4). Consequently, we eventually compute prefix spans
from 0 to 2k  1 (the j ¼ 0 case), but do not eventually
compute the other prefixes. The observation to make is
that we can combine the S½0; 2k  1 prefixes with the
S½2k0 ; 2k0 þ 2k1  1 spans ðk1 < k0Þ to compute the inter-
mediate results. To compute the full prefix sequence
ðS½0; 1; S½0; 2; . . . S½0; N  1Þ, we add a second (reverse)
tree to compute these intermediate prefixes. At each
tree level where we have a compose unit in the
forward, associative reduce tree, we add (at most) one
more, matching, compose unit in this reverse tree. The
reverse, or prefix, tree is no larger than the reduce tree;
consequently, the entire parallel-prefix tree is at most twice
the size of the associative reduce tree. Fig. 4 shows a width
16 parallel-prefix tree for associative accumulation. For a
210 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
Fig. 2. Using associativity to reduce serial adder delay.
Fig. 3. Asymmetric associativity to reduce delay around feedback cycle.
(a) Fast path for feedback. (b) Pipelining inputs to match feedback cycle.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
more tutorial development of parallel-prefix computations,
see [14] and [16].
We can also build asymmetric parallel-prefix trees to
minimize the delay on the critical feedback cycle as
described in Section 2.6. In Fig. 4, the y½1 input allows
the y½15 term to feedback to the final adder stage with a
single adder delay of latency in the case where we have
partially unrolled a longer accumulation stream so we can
process 16 inputs in a single adder-delay cycle time.
2.8 Delayed Addition
For associative operations, we can use a redundant
representation for the accumulation sum and exploit
delayed addition [9] to achieve full-adder-bit-level pipelin-
ing. This will likely result in a more compact implementa-
tion than the example in the previous section. However, the
associative reduce tree will be more directly applicable to
our solution with the transformations introduced in the
next section (Section 3).
2.9 Prior Work
de Dinechin et al. attacked the problem of saturating
accumulation at the DSP instruction level [17]. They show
how to get a factor of two speedup by cutting the sequence
of saturated additions in half and processing the two halves
in parallel. Similar to the technique presented here, their
algorithm computes revised maximum and minimum
values on the second half of the sequence so they can
correctly compose and saturate the second half sum with
the saturated sum of the first half. They do not show how to
recurse their decomposition or generally describe how to
achieve greater parallelism. Further, their algorithm only
produces the final result yN1, where N is their saturated
accumulation block size, and not the intermediate results
y0; y1; . . . ; yN2.
Balzola et al. show how to achieve bit-exact saturated
accumulation by implementing an N-input saturated adder
[18], [19]. TheirN-input adder structure is similar in spirit to
the unrolled associative additions in Sections 2.5 and 2.6 in
that they unroll by a factor of N to accumulate N values in a
delay slightly greater than one carry-propagate adder delay.
Since the saturated addition is not associative, they
independently compute additions for all possible satura-
tions in the prefix; on the critical feedback path, they only
need to select the appropriate inputs rather than perform a
complete addition for all but the final addition. As a result,
their area grows as OðN2Þ, and the delay of their unrolled
N-input addition is one adder delay plus OðNÞ multiplexer
(mux) delays. Asymptotically, the design provides only a
constant speedup (i.e., from one adder delay per input to
one mux delay per input). Their design also only produces
the final result of the N-input accumulation, yN1, and not
the intermediate results y0; y1; . . . ; yN2.
In contrast, we show how to make saturated accumula-
tion associative (Section 3), enabling the use of efficient
parallel-prefix techniques (Section 2.7). Parallel prefix
allows us to achieve arbitrary speedups and to produce
all the intermediate results ðy0; y1; . . . ; yN2Þ in the accumu-
lation. Further, the parallel-prefix technique allows to keep
the area linear ðOðNÞÞ in the unrolling factor, N . Latency
from input ðxiÞ to output ðyiÞ is OðlogðNÞÞ. After presenting
our sample implementation results in Section 5.4, we
provide a quantitative comparison to the speedups and
area overheads reported for the Balzola implementation.
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 211
Fig. 4. Sixteen-input parallel-prefix tree.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
3 ASSOCIATIVE REFORMULATION OF SATURATED
ACCUMULATION
3.1 Saturated Addition as a Transformation
Function
Unrolling the computation we need to perform for
saturated additions, we get a chain of saturated additions
(SA) as shown in Fig. 5a. We can express SA (Section 2.1) as
a function using max and min:
SAðy; x; minval; maxvalÞ
¼ min max ðyþ xÞ; minvalð Þ; maxvalð Þ: ð5Þ
The saturated accumulation is repeated application of this
function. We seek to express this function in such away that
repeated application is function composition. This allows us
to exploit the associativity of function composition [20] so
we can compute saturated accumulation using a parallel-
prefix tree (Section 2.7).
Technically, function composition does not apply
directly to the formula for SA shown in (5) because that
formula is a function of four inputs (having just one
output, y). Fortunately, only the dependence on y is critical
at each SA-application step; the other inputs are not
critical, because it is easy to guarantee that they are
available in time, regardless of our algorithm. To under-
stand repeated application of the SA function, therefore,
we express SA in an alternate form in which y is a function
of a single input and the other “inputs” (x, minval, and
maxval) are function parameters:
SA½x;m;MðyÞ ¼def SAðy; x;m;MÞ: ð6Þ
We define SA½i as the ith application of this function,
which has x ¼ x½i, m ¼ minval, and M ¼ maxval:
SA½i ¼def SA½x½i;minval;maxval: ð7Þ
This definition allows us to view the computation as
function composition. For example
y½i ¼ SA½i  SA½i 1  SA½i 2  SA½i 3 y½i 4ð Þ ð8Þ
(see Fig. 5b).
3.2 Composing the SA Functions
To reduce the critical latency implied by (8), we first
combine successive nonoverlapping adjacent pairs of
operations (just as we did with ordinary addition in (4)).
For example,
y½i ¼ SA½i  SA½i 1ð Þ  SA½i 2  SA½i 3ð Þð Þ y½i 4ð Þ:
To make this practical, we need an efficient way to
compute each adjacent pair of operations in one step:
SA½i 1; i ¼def SA½i  SA½i 1: ð9Þ
This composition is shown in Fig. 5c.
Viewed (temporarily) as a function of real numbers, SA½i
is a continuous, piecewise linear function, because it is a
composition of “min ,” “max ,” and “þ,” each of which are
continuous and piecewise linear (with respect to each of
their inputs). It is a well-known fact that any composition of
continuous, piecewise linear functions is itself continuous
and piecewise linear (we demonstrate this for our particular
case below). We can easily visualize the continuity and
piecewise linearity of SA½i (see Fig. 6).
Let us now try to understand the mathematical form of
the function SA½i 1; i. As the base functions SA½i 1
and SA½i are continuous and piecewise linear, their
composition (i.e., SA½i 1; i) must also be continuous
and piecewise linear. The key thing we need to under-
stand is: how many segments does SA½i 1; i have? Since
SA½i 1 and SA½i each have just one bounded segment of
slope one, we argue that their composition must also
212 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
Fig. 5. Saturated addition sequence viewed as function composition.
(a) Unrolled chain of four saturated additions. (b) Viewing each
saturated addition as a transformation function from one input to one
output. (c) Composing the single-input, single-output functions for a pair
of connected, saturated additions to define a new single-input, single-
output transformation function.
Fig. 6. Operation performed by one saturated addition.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
have just one bounded segment of slope 1 and have the
form of (6).
We can visualize this fact graphically as shown in Fig. 7.
Any input below minval or above maxval (Fig. 7b) into
the second SA will be clipped to the constant minval or
maxval. Input clipping on the first SA coupled with the
add offset on the second can prevent the composition from
producing outputs all the way to minval or maxval
(Fig. 7a). So, the extremes will certainly remain flat just like
the original SA. Between these extremes, both SAs produce
linear shifts of the input. Their cascade is, therefore, also a
linear shift of the input and results in a slope one region
(Fig. 7c). Consequently, SA½i 1; i has the same form as
SA½i (6). As we observed, the composition SA½i 1; i does
not necessarily have m ¼ minval and M ¼ maxval. How-
ever, if we allow arbitrary values for the parameters m and
M, then the form shown in (6) is closed under composition.
This allows us to regroup the computation to reduce the
number of levels in the computation.
3.3 Composition Formula
We have just proved that the form SA½x;m;M is closed under
composition. However, to build hardware that composes
these functions, we need an actual formula for the ½x;m;M
tuple describing the composition of any two SA functions
SA½x1;m1;M1 and SA½x2;m2;M2.
Each SA is a sequence of three steps: TRanslation by x,
followed by Clipping at the Bottom m, followed by
Clipping at the Top M. We write these three primitive
steps as trx, cbm, and ctM , respectively:
trxðyÞ ¼defyþ x;
cbmðyÞ ¼def maxðy;mÞ;
ctMðyÞ ¼def minðy;MÞ;
SA½x;m;M ¼ ctM  cbm  trx:
ð10Þ
As shown in Fig. 8, a composition of two SAs written in
the form of (10) leads to a new SA written in the same form.
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 213
Fig. 7. Saturated add composition. (a) Clip low. (b) Clip high. (c) Linear region. (d) Composite function. N.b. axes are rotated for the SA½i transform so
that we can align the y½i 1 output from the SA½i 1 transform with the y½i 1 input to the SA½i transform.
Fig. 8. Operator composition for chained saturated additions.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
The calculation is the following sequence of commutation
and merging of the “tr”s, “cb”s, and “ct”s:
1. Commutation of translation and clipping. Clipping
atM1 (or m1) and then translating by x2 is the same
as first translating by x2 and then clipping at M1þ
x2 (or m1þ x2).
2. Commutation of upper and lower clipping:
cbm2  ctM1þx2 ¼ ctmaxðM1þx2;m2Þ  cbm2:
This is seen by case analysis: first, suppose
m2 M1þ x2. Then, both sides of the equation are
the piecewise linear function
M1þ x2; y M1þ x2;
m2; y  m2;
y; otherwise:
8<
: ð11Þ
On the other hand, if m2 > M1þ x2, then both sides
are the constant function m2.
3. Merging of successive upper clipping. This is
associativity of min .
3.4 Applying the Composition Formula
At the first level of the computation, m ¼ minval and
M ¼ maxval. However, after each adjacent pair of saturat-
ing additions ðSA½i 1; SA½iÞ has been replaced by a single
saturating addition ðSA½i 1; iÞ, the remaining computation
no longer has constant m and M. In general, therefore, a
saturating accumulation specification includes a different
minval and maxval for each input. We denote these
values by minval½i and maxval½i.
The SA to be performed on input number i is then
SA½iðyÞ ¼ min max yþ x½ið Þ; minval½ið Þ; maxval½ið Þ: ð12Þ
Composing two such functions and inlining, we get
SA½i 1; iðyÞ ¼ SA½i SA½i 1ðyÞð Þ
¼ min max min max yþ x½i 1ð Þ; minval½i 1ð Þ;
maxval½i 1þ x½i;
minval½i; maxval½i:
ð13Þ
We can transform this into
SA½i 1; iðyÞ
¼ min max  yþ x½i 1 þ x½ið Þ;
max

minval½i 1 þ x½ið Þ;
minval½i;
min

max

maxval½i 1 þ x½ið Þ;
minval½i; maxval½i:
ð14Þ
This is the same computation as Fig. 8, when we let
M2 ¼ maxval½i, m2 ¼ minval½i, M1 ¼ maxval½i 1, and
m2 ¼ minval½i 1.
Now we define Compose as the six-input, three-output
function which computes a description of SA½i 1; i given
descriptions of SA½i 1 and SA½i:
x0 ¼ x½i 1 þ x½i; ð15Þ
minval0 ¼ max minval½i 1 þ x½ið Þ; minval½ið Þ; ð16Þ
maxval0 ¼ min max maxval½i 1 þ x½ið Þ; minval½ið Þ;
maxval½i:
ð17Þ
This gives us
SA½i 1; iðyÞ ¼ min max ðyþ x0Þ; minval0ð Þ; maxval0ð Þ: ð18Þ
Note that this is exactly the same form as (5), with the
primed variables replacing the original input variables. This
allows us to compute SA½i; jðyÞ as shown in Fig. 9. One can
note that this is a very similar strategy to the combination of
“propagates” and “generates” in carry-look-ahead addition
(e.g., [15], [16], and [21]).
3.5 Wordsize of Intermediate Values
The preceding correctness arguments rely on the assump-
tion that intermediate values (i.e., all values ever computed
by the Compose function) are mathematical integers; i.e.,
they never overflow. For a computation of depth k, at most
2k numbers are ever added, so intermediate values can be
represented in W þ k bits if the inputs are represented in
W bits. While this gives us an asymptotically tight result,
we can do better from a practical point of view; we can
actually do all computation with W þ 2 bits (2’s comple-
ment representation) regardless of k.
214 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
Fig. 9. Composition of SA½ði 3Þ; i.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
First, notice that maxval0 is always between minval½i and
maxval½i. The same is not true about minval0, until we make
a slight modification to (16); we redefine minval0 as follows:
minval0 ¼ min max minval½i 1 þ x½ið Þ; minval½ið Þ;
maxval½i:
ð19Þ
This change does not affect the result because it only causes
a decrease in minval0 when it is greater than maxval0. While
it is more work to do the extra operation, it is only a
constant increase, and this extra work is done anyway if the
hardware for maxval0 is reused for minval0 (see Section 4).
With this change, the interval ½minval0; maxval0 is contained
in the interval ½minval½i; maxval½i, so none of these
maximum or minimum values ever requires more than
W bits to represent.
3.6 Wordsize of Intermediate x0
In this section, we show that we need only use a ðW þ 2Þ-bit
data path to compute x0 (15). Whenever x0 overflows a
ðW þ 2Þ-bit data path, its value is ignored, because a
constant function is represented (i.e., minval0 ¼ maxval0).
To bound all x0 that occur for nonconstant functions, we
make one observation and one assumption:
1. (observation) There is one ðminval; maxvalÞ for all i
such that
minval½i  minval and
maxval½i  maxval: ð20Þ
This was demonstrated at the end of the previous
section (Section 3.5).
2. (assumption) For all original x½i (i.e., the inputs), we
have
x½ij j   ¼def maxval minval:
This is always true for the inputs when
minval  x½i  maxval:
We use the broader interval 2 to deal with
intermediate values of x0.
We now show, for any x½i k; i in the multilevel
computation, if jx½i k; ij > 2, then minval½i k; i ¼
maxval½i k; i.
For a contradiction, assume that some S ¼def SA½i k; i is
not a constant function when jxSj > 2. Consider points y
and y0 such that SðyÞ 6¼ Sðy0Þ.
From the form of S, we know that it only takes on
values in the interval ½minvalS; maxvalS. If SðyÞ or Sðy0Þ
are endpoints of this nonempty interval, we can inter-
polate (extending to real numbers) and find new y, y0, so
that, without loss of generality, y and y0 are both in the
region of the domain of S, where S has slope 1.
Interpolation is a technicality only needed to handle the
case where minvalS þ 1 ¼ maxval, such that there are not
two, distinct integer values for y and y0 which are in the
slope 1 region.
Since S locally has slope 1 around y (and y0), the
clipping feature in S must not be active around y. This
means that y (and y0) are in the interval ½minvalS 
xS; maxvalS  xS, which is contained in the interval
½minval xS; maxval xS (observation 1).
Since jxSj > 2, we deduce that y and y0 are outside of
the interval ½minval; maxvalþ since
maxvalS  2  maxval 2 ¼ minval
or
minvalS  ð2Þ  minval ð2Þ
¼ maxvalþ:
By interpolation, we can always choose distinct y and y0 so
that they do not straddle this interval. Now, consider what
happens when the first input in the sequence xik . . .xi is
applied to such a value. Using Assumption 2, we see that
yþ x½i k are to one side of the interval ½minval; maxval.
Therefore, SA½i k must take y and y0 to the same value,
and therefore, SA½i k; i also has this property: i.e.,
SðyÞ ¼ Sðy0Þ, a contradiction.
How many bits do we need to represent intermediate x0?
If we assume the accumulator is a W -bit signed 2’s
complement value, then
maxval  2ðW1Þ  1
minval  2ðW1Þ
  2ðW1Þ  1
 
 2ðW1Þ
 
¼ 2W  1:
We care about an x0 only if jx0j  2 < 2Wþ1  1. Hence, we
can simply add the “x’s” in ðW þ 2Þ-bit 2’s complement
arithmetic (at all levels of the computation), and if there is
an overflow then we do not care about the result.
The 2 and ðW þ 2Þ-bit bounds are tight: the computation
can really have representations of nonconstant functions that
use all W þ 2 bits. For example, suppose W ¼ 8, with
minval ¼ 128 and maxval ¼ 127. Suppose x0 ¼ x1 ¼ 254.
The function SA½0; 1 is not constant, as SA½0; 1ð380Þ ¼ 128
while SA½0; 1ð381Þ ¼ 127, yet x½0; 1 ¼ 508 requires 10 bits
to represent. One might observe that in this case the
function is in fact constant because the accumulator never
starts at those values. However, this does not imply that
minval ¼ maxval, and while we could add extra hardware
to make this the case, it would not be worth adding this
hardware just in order to save one bit. Finally, restricting
the inputs to a smaller bound than is helpful only in small
trees, as increments up to  can be achieved through a
number of small increments.
4 PUTTING IT TOGETHER
Knowing how to compute SA½i 1; i from the parameters
for SA½i 1 and SA½i, we can unroll the computation to
match the delay through the saturated addition and create a
suitable, asymmetric parallel-prefix computation (similar to
Sections 2.5 through 2.7). From the previous section, we
know the core computation for the composer is, itself, an
unsaturated addition (15) and two saturated additions ((17)
and (19)). Using the base saturated adder shown in Fig. 10,
we build the composer as shown in Fig. 11.
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 215
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
5 IMPLEMENTATION
5.1 Experiment
We implemented the parallel-prefix saturated accumulator
in VHDL and targeted a Xilinx Spartan-3 XC3S-5000-4
FPGA to demonstrate functionality and obtain performance
and area estimates. We used Modelsim 5.8 to verify the
functionality of the design and Synplify Pro 7.7 and Xilinx
ISE 6.1.02i to map our design onto the target device. We did
not provide any area constraints and let the tools
automatically place and route the design using just the
timing constraints. The DCMs on the Spartan-3 (speed
grade 4 part) support a maximum frequency of 280 MHz
(3.57-ns cycle), so we picked this maximum supported
frequency as our performance target. We report area in
Spartan-3 slices; each Spartan-3 slice contains two 4-input
Lookup Tables with fast carry logic such that each slice can
serve as two full adder bits.
5.2 Design Details
The parallel-prefix saturating accumulator consists of a
parallel-prefix computation tree with an asymmetric feed-
back input (cf. Section 2.6) sandwiched between a serial-
izer and deserializer as shown in Fig. 12. Consequently,
we decompose the design into two clock domains. The
higher frequency clock domain pushes data into the lower
frequency domain of the parallel-prefix tree. The parallel-
prefix tree runs at a proportionally slower rate to
accommodate the saturating adders shown in Figs. 10
and 11. Minimizing the delays in the tree requires us to
compute each compose in two pipeline stages. Finally, we
clock the result of the prefix computation into the higher
frequency clock domain in parallel then serially shift out
the data at the higher clock frequency.
As introduced in Section 2.6, the delay through the
composers is actually irrelevant to the correct operation of
the saturated accumulation. The composition tree adds a
uniform number of clock cycle delays between the x½i shift
register and the final saturated accumulator. It does not add
to the saturated accumulation feedback latency which the
unrolling must cover. This is why we can safely pipeline
compose stages in the parallel-prefix tree.
Data is transferred into the slower domain by serializing
it in the faster domain and allowing the slower frequency
domain to “capture” the signal synchronously on its clock
edge. This encapsulated dual frequency clocking scheme
allows the rest of the system to have a consistent interface
with this design.
5.3 Area
Weexpress the area required by this design as a function ofN
(loop unroll factor) and W (bitwidth). The area required for
the prefix tree is roughly 5 23N times the area of a single
saturated adder. The initial reduce tree has roughly N
compose units, as does the final prefix tree. Each compose
unit has two W -bit saturated adders and one ðW þ 2Þ-bit
regular adder. As noted, a Spartan-3 slice can support two
full-adder bits, so each adder requires roughly W=2 slices.
Similarly, the W -bit maximum and minimum muxes also
each require W=2 slices. Together, this gives us  2 ð2
3þ 1ÞNW=2 slices. Finally, we add a row of saturated adders
to compute the final output to get a total of 172 NW slices.
Compared to the base saturated adder (i.e., Fig. 10) which
takes 32W slices, this is a factor of
17N
3 ¼ 5 23N .
Pipelining levels in the parallel-prefix tree roughly costs
us 2 3N registers per level times the 2 log2ðNÞ levels for
a total of 12N log2ðNÞW registers. The pair of registers for a
pipe stage can fit in half a slice (i.e., SRL16 configuration), so
this should add no more than 6N log2ðNÞW slices:
AðN;WÞ  6N log2ðNÞW þ
17
2
NW: ð21Þ
216 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
Fig. 10. Saturated adder.
Fig. 11. Composition unit for two saturated additions.
Fig. 12. N ¼ 4 parallel-prefix saturating accumulator.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
This approximation does not count the overhead of the
control logic in the serializer and deserializer since it is
small compared to the registers.
To pipeline down to the gate or Lookup Table level, we
must unroll to cover the delay through the base saturated
adder (Fig. 10). This delay is one W -bit adder delay plus a
small constant number of gate delays output multiplexing.
If we use ripple carry adders, then we need an unroll
factor, N , which is OðW Þ. Substituting N ¼ OðWÞ into (21)
and we get OðW 2 logðW ÞÞ. If, instead, we use an efficient,
log-depth adder, we substitute N ¼ logðW Þ into (21), and
we see that area scales as
AfullpipesataddðW Þ ¼ O W logðWÞ log logðWÞð Þð Þ: ð22Þ
If the size of the tree is N and the frequency of the basic
unpipelined saturating accumulator is f , then the system
can run at a frequency f N . By increasing the size of the
parallel-prefix tree, we can make the design run arbitrarily
fast, up to the maximum attainable clock rate of the device.
As Section 6.1 notes, we can continue to exploit parallelism
to run even faster if the application context provides and
consumes more than one input and output per cycle. In
Table 2, we show the value of N (i.e., the size of the prefix
tree) required to achieve a 3-ns cycle target. We target this
tighter cycle time (compared to the 3.57-ns DCM limit) to
reserve some headroom going into place and route for the
larger designs. We observe that a value ofN ¼ 4 is adequate
to make the design run as fast as the device can support for
adder widths up to 32 bits.
5.4 Results
Table 3 shows the clock period achieved by all the
designs for N ¼ 4 after place and route. We beat the
required 3.57-ns performance limit for all the cases we
considered. Since we only constrained the synthesis tools
to optimize for a 3-ns cycle, variations in cycle time
around 3 ns arise from imperfectly estimated physical
routing delays. For N ¼ 4, the latency from x½i to y½i is
38 fast clock cycles (see Fig. 12) or, roughly 136 ns at the
3.57-ns clock period.
In Table 3, we show the actual area in terms of the
Spartan-3 slices required to perform the mapping for
different bitwidths W . A 16-bit saturating accumulator
requires 1,065 slices which constitutes around 2 percent of
the XC3S-5000. We also show that an area overhead of less
than 25 is required to achieve this speedup over the
unpipelined base saturating accumulator (Fig. 10); for
N ¼ 4, 5 23N  23, so this is consistent with our intuitive
prediction above.
Balzola’s 5-input saturated adder [19] is equivalent to
our N ¼ 4 unrolling in that both can take in five inputs in a
cycle. They compare their accelerated designs to a “serial”
case that actually contains a simple combinational cascade
of four saturated adders. Their fastest design uses 5.7 times
the area of the four saturated adder cascade or 4 5:7  23
times the area of the base saturated adder. With this
design, they achieve a speedup of 3.5 times the “serial”
case. As a result, their area overhead and throughput
enhancement are quite similar to ours at W ¼ 16 and
W ¼ 32 (see Table 3). The Balzola design has a smaller
increase in the latency from x½i to y½i than our design.
Note the following:
1. Our design spends roughly half of its area, produ-
cing intermediate y½i outputs that the Balzola design
does not produce; if we were to omit these
intermediate outputs to functionally match the
Balzola design, our area overhead would be half
the reported size.
2. Our design has better asymptotic scaling, both in
area (OðN logðNÞÞ versus OðN2Þ) and achievable
delay (arbitrary versus constant factor speedup).
Since these designs are already reaching parity in
area overhead at this small N , this suggests our
design will be smaller for N > 4.
A factor of 25 in area is a large cost to pay. However, the
base saturated adder is tiny and usually only a small
fraction of the area in a spatial design (e.g., Fig. 1) or in a
DSP (i.e., memories often take up much more space than all
the arithmetic processing logic combined, and the dedicated
multipliers are much larger than the adders). Since the
saturated addition is only a small fraction of the design
area, the 25 area expansion of this one unit may only
increase the area of the overall design by a modest amount.
When the saturated addition is the single bottleneck that
prevents the entire system from running at high through-
put, it will often make sense to pay this cost.
6 GENERALITY AND OPEN QUESTIONS
6.1 Beyond One Result per Clock
The clock period on the device is limited by the minimum
overhead time on registers (setup, hold, clock jitter) and a
minimum amount of logic between registers. For example,
the design in Fig. 12 has one mux between registers on the
fast clock domain. In CMOS, 6-8 Fanout-4 inverter delays is
considered a common lower bound on the clock period
(e.g., [22]).
Nonetheless, since the core saturated addition operations
can now be performed in parallel, we can achieve
throughputs that exceed the clock-cycle bound if it is
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 217
TABLE 2
Minimum Size of Prefix Tree Required to Achieve 280 MHz
TABLE 3
Accumulator Comparison
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
possible to bring in and produce multiple values in parallel.
Fig. 13 shows the generalization in Fig. 12, where N ¼ 8 and
the design consumes/produces two values per cycle on the
fast clock. As with the Fig. 12 design, the slow clock has a
period four times the fast clock.
6.2 Beyond Accumulation
The techniques used here are actually quite general.
Functional composition is associative, so we can always
unroll the loop, associate each loop stage with its inputs,
and perform a parallel-prefix reduction on the loop
instances. This composition is applicable even if there
are a series of different operations in each loop body or
even different operations between loop instances. For this
to be useful, however, the composed function of multiple
loop instances must have shallower depth than the
original, serial path through the set of loop instances.
Also, it must be inexpensive to compute the composed
function; in this case, computing the arguments to the
composed function was, asymptotically, the same com-
plexity as computing the function (cf. (15), (17), and (19)
to (5) and (6)).
In formulating the associativity of saturated accumula-
tion, we worked with the composition of the functions max ,
min , and addition where one input to each function was
early bound—i.e., bound outside of the loop. As a result of
the early-bound inputs, the computation was a single chain
of dependent computations, and we showed how to use
parallel prefix to compute the outputs of the chain with low
latency. Multiplication with one early-bound input can be
added to this group. More generally, we can compute
efficient functional compositions on any, potentially hetero-
geneous, chain composed from this extended function
group.
We can also perform this prefix optimization on this
function group even if intermediate results are forwarded
to multiple functions. With multiple use of intermediates,
we do not strictly have a single chain but rather a tree. In
these cases, we can still extract the chain producing each
output and perform a parallel prefix on each chain. This
may require that we duplicate the chain prefixes which feed
into multiple tree branches.
An important open question for future research is to
generally characterize the class of functions that have this
kind of lightweight composition. That is, more generally,
for a given composition of functions:
. How expensive is it to compute the composed
function?
. Howmuch shorter is the path through the composed
function than the sum of the paths through the
original functions?
These associative transformation can be powerful op-
tions for exploiting area-time tradeoffs. High-level design
automation tools can exploit them for optimizing perfor-
mance. With a sufficiently broad set, it may be possible to
integrate these into a superscalar processor design, allowing
the processor to issue a set of dependent instructions and
reduce them associatively.
7 SUMMARY
Saturated accumulation has a loop dependency that,
naively, limits single-stream throughput and our ability to
fully exploit the computational capacity of modern inte-
grated circuits, particularly as clock rate scaling slows and
future performance improvements depend more on ex-
ploiting the increased area capacity to improve throughput.
We show that this loop dependency is actually avoidable by
reformulating the saturated addition as the composition of a
series of functions. We further show that this particular
function composition is, asymptotically, no more complex
than the original saturated addition operation. Function
composition is associative, so this reformulation allows us
to build a parallel-prefix tree in order to compute the
saturated accumulation over several loop iterations in
parallel. Consequently, we can unroll the saturated accu-
mulation loop to cover the delay through the saturated
adder. As a result, we show how to compute saturated
accumulation at any data rate supported by the device’s
clocking and I/O.
ACKNOWLEDGMENTS
This research was funded in part by the US National Science
Foundation under Grant CCR-0205471. Stephanie Chan was
supported by the Marcella Bonsall SURF Fellowship. Karl
Papadantonakiswas supported by aMoore Fellowship. Scott
Weber and Eylon Caspi developed early FPGA implementa-
tions of ADPCM which helped identify this challenge.
Michael Wrighton provided VHDL coding and CAD tool
usage tips.
218 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009
Fig. 13. N ¼ 8 parallel-prefix saturating accumulation with two inputs
and two outputs per cycle ðffast ¼ 4 fslowÞ.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
REFERENCES
[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock
Rate versus IPC: The End of the Road for Conventional
Microarchitectures,” Proc. 27th Int’l Symp. Computer Architecture
(ISCA ’00), pp. 248-259, 2000.
[2] D. Chinnery and K. Keutzer, Closing the Gap between ASIC &
Custom: Tools and Techniques for High-Performance ASIC Design.
Kluwer Academic Publishers, 2002.
[3] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung,
O. Rowhani, V. George, J. Wawrzynek, and A. DeHon, “HSRA:
High-Speed, Hierarchical Synchronous Reconfigurable Array,”
Proc. Int’l Symp. Field-Programmable Gate Arrays (FPGA ’99),
pp. 125-134, Feb. 1999.
[4] D.P. Singh and S.D. Brown, “The Case for Registered Routing
Switches in Field Programmable Gate Arrays,” Proc. Int’l Symp.
Field-Programmable Gate Arrays (FPGA ’01), pp. 161-169, Feb. 2001.
[5] C. Leiserson, F. Rose, and J. Saxe, “Optimizing Synchronous
Circuitry by Retiming,” Proc. Third Caltech Conf. VLSI, Mar. 1983.
[6] N. Weaver, Y. Markovskiy, Y. Patel, and J. Wawrzynek, “Post-
Placement C-Slow Retiming for the Xilinx Virtex FPGA,” Proc.
Int’l Symp. Field-Programmable Gate Arrays (FPGA ’03), pp. 185-194,
2003.
[7] B. Smith, “Architecture and Applications of the HEP Multi-
processor Computer System,” Proc. Fourth Symp. Real-Time Signal
Processing, pp. 241-248, 1981.
[8] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and
R.L. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an
Implementable Simultaneous Multithreading Processor,” Proc.
23rd Int’l Symp. Computer Architecture (ISCA ’96), pp. 191-202, 1996.
[9] Z. Luo and M. Martonosi, “Accelerating Pipelined Integer and
Floating-Point Accumulations in Configurable Hardware with
Delayed Addition Techniques,” IEEE Trans. Computers, vol. 49,
no. 3, pp. 208-218, Mar. 2000.
[10] Xilinx Spartan-3 FPGA Family Data Sheet,Xilinx, Inc., dS099, http://
direct.xilinx.com/bvdocs/publications/ds099.pdf, Dec. 2004.
[11] K. Papadantonakis, N. Kapre, S. Chan, and A. DeHon, “Pipelining
Saturated Accumulation,” Proc. IEEE Int’l Conf. Field-Programmable
Technology (FPT ’05), pp. 19-26, Dec. 2005.
[12] C. Lee, M. Potkonjak, andW.H. Mangione-Smith, “MediaBench: A
Tool for Evaluating and Synthesizing Multimedia and Commu-
nications Systems,” Proc. 30th Ann. Int’l Symp. Microarchitecture
(MICRO ’97), pp. 330-335, 1997.
[13] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal, “Maps: A
Compiler-Managed Memory System for Raw Machines,” Proc.
26th Int’l Symp. Computer Architecture (ISCA ’99), pp. 4-15, 1999.
[14] W.D. Hillis and G.L. Steele, “Data Parallel Algorithms,” Comm.
ACM, vol. 29, no. 12, pp. 1170-1183, Dec. 1986.
[15] R.P. Brent and H.T. Kung, “A Regular Layout for Parallel
Adders,” IEEE Trans. Computers, vol. 31, no. 3, pp. 260-264,
Mar. 1982.
[16] F.T. Leighton, Introduction to Parallel Algorithms and Architectures:
Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992.
[17] B.D. de Dinechin, C. Monat, and F. Rastello, “Parallel Execution of
the Saturated Reductions,” Proc. IEEE Workshop Signal Processing
Systems (SiPS ’01), pp. 373-384, 2001.
[18] M. Schulte, P. Balzola, J. Ruan, and J. Glossner, “Parallel
Saturating Multioperand Adders,” Proc. Int’l Conference on
Compilers, Architecture, and Synthesis for Embedded Systems
(CASES ’00), pp. 172-179, 2000.
[19] P.I. Balzola, M.J. Schulte, J. Ruan, J. Glossner, and E. Hokenek,
“Design Alternatives for Parallel Saturating Multioperand Ad-
ders,” Proc. Int’l Conf. Computer Design (ICCD ’01), pp. 172-177,
Sept. 2001.
[20] J.H. Hubbard and B.B.H. Hubbard, Vector Calculus, Linear Algebra,
and Differential Forms: A Unified Approach. Prentice Hall, 1999.
[21] S. Winograd, “On the Time Required to Perform Addition,”
J. ACM, vol. 12, no. 2, pp. 277-285, Apr. 1965.
[22] M. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,
and P. Shivakumar, “The Optimal Logic Depth Per Pipeline Stage
Is 6 to 8 FO4 Inverter Delays,” Proc. 29th Int’l Symp. Computer
Architecture (ISCA ’02), pp. 14-24, 2002.
Karl Papadantonakis received the BA degree
in mathematics from Cornell University in 1996
and the SM and PhD degrees from the
California Institute of Technology in 2002 and
2006, respectively. Since 2006, he has been
designing and integrating high-performance
VLSI components for the 10-gigabit Ethernet
switch and network interface products at
Myricom, Inc., Arcadia, California. His research
interests include providing general solutions to
common problems that have multiple constraints such as performance,
power, and cost.
Nachiket Kapre received the BE degree in
electronics and telecommunication engineering
from the Government College of Engineering,
Pune, India, in 2002 and the SM degrees in
electrical engineering and computer science
from the California Institute of Technology,
Pasadena, in 2005 and 2006, respectively.
Since 2006, he has been a visiting graduate
student at the University of Pennsylvania. He is
currently working toward the PhD degree at the
California Institute of Technology. His research interests include finding
ways to effectively solve large problems on spatial computing
architectures. He is a student member of the IEEE and the IEEE
Computer Society.
Stephanie Chan received the BS degree in
applied and computational mathematics from
the California Institute of Technology in 2006.
Since 2006, she has been a research scientist at
the Numerica Corp., Fort Collins, Colorado,
where she designs algorithms for tracking
systems. Her research interests include exploit-
ing mathematical patterns to solve real-world
problems.
Andre´ DeHon received the SB, SM, and PhD
degrees in electrical engineering and computer
science from the Massachusetts Institute of
Technology in 1990, 1993, and 1996, respec-
tively. From 1996 to 1999, he co-ran the BRASS
group in the Department of Computer Science,
University of California, Berkeley. From 1999 to
2006, he was an assistant professor of computer
science at the California Institute of Technology.
Since 2006, he has been an associate professor
in the Department of Electrical and Systems Engineering, University of
Pennsylvania, Philadelphia. His research interests include how to
physically implement computations from substrates, including VLSI
and molecular electronics, up through architecture, CAD, and program-
ming models. He places special emphasis on spatial programmable
architectures (e.g., FPGAs) and interconnect design and optimization.
He is a member of the IEEE and the IEEE Computer Society.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
PAPADANTONAKIS ET AL.: PIPELINING SATURATED ACCUMULATION 219
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on January 16, 2009 at 12:44 from IEEE Xplore.  Restrictions apply.
