Abstract. The main contribution of this work is to propose a number of broadcastefficient VLSI architectures for computing the sum and the prefix sums of a w k -bit, k 2, binary sequence using, as basic building blocks, linear arrays of at most w 2 shift switches. An immediate consequence of this feature is that in our designs broadcasts are limited to buses of length at most w 2 making them eminently practical. Using our design, the sum of a w k -bit binary sequence can be obtained in the time of 2k 0 2 broadcasts, using 2w k02 + O(w k03 ) blocks, while the corresponding prefix sums can be computed in 3k 0 4 broadcasts using (k + 2)w k02 + O(kw k03 ) blocks.
Introduction
Recent advances in VLSI have made it possible to implement algorithm-structured chips as building blocks for high-performance computing systems. Since computing binary prefix sums (BPS) is a fundamental computing problem [1, 4] , it makes sense to endow general-purpose computer systems with a special-purpose parallel BPS device, invoked whenever its services are needed. Recently, Blelloch [1] argued convincingly that scan operations -that boil down to parallel prefix computation -should be considered primitive parallel operations and, whenever possible, implemented in hardware. In fact, scans have been implemented in microcode in the Thinking Machines CM-5 [10] .
Given an n-bit binary sequence a 1 ; a 2 ; 1 1 1 ; a n , it is customary to refer to p j = a 1 + a 2 + 1 1 1 + a j as the j-th prefix sum. The BPS problem is to compute all the prefix sums p 1 ; p 2 ; 1 1 1 ; p n of A. In this article, we address the problem of designing efficient and scalable hardware-algorithms for the BPS problem. We adopt, as the central theme of this effort, the recognition of the fact that the delay incurred by a signal propagating along a medium is, at best, linear in the distance traversed. Thus, our main design criterion is to keep buses as short as possible. In this context, the main contribution of this work is to show that we can use short buses in conjunction with the shift switching technique introduced recently by Lin and Olariu [5] to design a scalable hardwarealgorithm for BPS. As a byproduct, we also obtain a scalable hardware-algorithm for a parallel counter, that is, for computing the sum of a binary sequence (BS).
A number of algorithms for solving the BS and BPS problems on the reconfigurable mesh have been presented in the literature. For example, Olariu et al. [8] showed that an instance of size m of the BPS problem can be solved in O( log n log m ) time on a reconfigurable mesh with 2n 2 (m + 1), (1 m n), processors. Later, Nakano [6, 7] presented O( log n p m log m ) time algorithms for the BP and BPS problems on a reconfigurable mesh with n 2 m processors. These algorithms exploit the intrinsic power of reconfiguration to achieve fast processing speed. Unfortunately, it has been argued recently that the reconfigurable mesh is unrealistic as the broadcasts involved occur on buses of arbitrary number of switches.
Our main contribution is to develop scalable hardware-algorithms for BS and BPS computation under the realistic assumption that blocks of size no larger than w 2 w 2 are available. We will evaluate the performance of the hardware-algorithms in terms of the number of broadcasts executed on the blocks of size w 2 w 2 . Our designs compute the sum and the prefix sums of w k bits in 2k 0 2 and 3k 0 4 broadcasts for any k 2, respectively. Using pipelining, one can devise a hardware-algorithm to compute the sum of a kw k -bit binary sequence in 3k + dlog w ke 0 3 broadcasts using blocks. Due to the stringent page limitation, we omit the description of the pipeliningscaled architecture.
Linear arrays of shift switches
An integer z will be represented positionally as x n x n01 1 1 1 x 2 x 1 in various ways as described below. Specifically, for every i, (1 i n), Binary:
Unary: 5 mod w; Unary base w: Exactly like the base w representation except that x i is represented in unary form. The task of converting back and forth between these representation is straightforward and can be readily implemented in VLSI [2, 3, 9] Fix a positive integer w, typically a small power of 2. At the heart of our designs lays a simple device that we call the S(w). Referring to Figure 1 (1)
We implement the S(w) using the shift switch concept proposed in [5] . As illustrated in Figure 1 (2) and that
In the remainder of this work, when it comes to performing the sum or the prefix sums of a binary sequence a 1 ; a 2 ; 1 1 1 ; a m , we will supply this sequence as the bit input of some a suitable linear array S(w;m ) and will insist that b = 0, or, equivalently, (4) and, similarly, 
Computing sums on short buses
Our design relies on a new block U (w;w k ), (k 1), whose input is a w k -bit binary sequence a 1 ; a 2 ; 1 1 1 ; a w k and whose output is the unary base w representation A k+1 A k 1 1 1 A 1 of the sum a 1 + a 2 + 1 1 1 + a w k . We note that A k+1 is 1 only if a 1 = a 2 = 1 1 1 = a w k = 1. Recall that for every i, (1 i k), 0 A i w 0 1.
The block U (w;w k ) is defined recursively as follows. U (w;w 1 ) is just a T (w;w).
U (w; w
2 ) is implemented using blocks T (w;w) and T (w;w 2 ): The w 2 input are supplied to T (w;w 2 ). T (w; w 2 ) outputs the LSD (Least Significant Digit) of the sum in unary base w representation, and the MSD (Most Significant Digit) of the sum in distributed representation. T (w;w) is used to convert the distributed representation into the unary base w representation of the MSD of the sum.
Let k 3 and assume that the block U (w; w k01 ) has already been constructed. We now detail the construction of the block U (w;w k ) using w blocks U (w; w k01 ). The reader is referred to Figure 3 for an illustration of this construction for w = 4 and k = 5. Similarly, A k+1 = D k . The equations above confirm that the design in Figure 3 computes the unary base w representation of the sum a 1 + a 2 + 1 1 1 + a w k . Specifically, the rightmost T (w;w 2 ) computes A 1 . It receives the unary representation of 0 (i.e. consist of 1 bit each and since D k01 involves w bits, a block T (w;2w) is used to compute A k . In this figure, the circles indicate circuitry that converts from unary to filled unary representation. At this time it is appropriate to estimate the number of broadcasts involved in the above computation. 
Computing prefix sums on short buses
The main goal of this section is to propose an efficient hardware-algorithm for BPS computation on short buses.
Observe that each block U (w;w k ) has, essentially, the structure of a w-ary tree as illustrated in Figure 4 . The tree has nodes at k 0 1 levels from 2 to k. For an arbitrary node v at level h in this w-ary tree, let a s+1 ; a s+2 ; : : : ; a s+w h be the input offered at the leaves of the subtree rooted at v and refer to Hence, by computing p(v); t (v 1 ); t (v 2 ); : : : ; t (v w01 ), we obtain the prefix sum of each of the children of v. This computation proceeds from the root down to the leaves. Once this is done, by computing the local prefix sums at the leaves, we can get the resulting prefix sums of the input sequence. We now show how this idea can be implemented efficiently. We begin by using a block U (w;w k ) to compute the unary base w representation of t(v) for every node v in the w-ary tree. For this purpose, let P k+1 P k P k01 1 1 are always 0 and will be ignored. Our design will have, therefore, to solve the following problem:
Input: The prefix sum p(v) = P k P k01 1 1 1 P 1 of v and the sum t(v j ) = A Note that we do not have to compute p(v 1 ) = P
It is easy to confirm that, once this task is completed, the same task can be performed for the children v 1 ; v 2 ; : : : ; v w in order to compute the prefix sums of the grandchildren of v. By continuing in the same fashion until the leaves are reached, we get the final prefix sums. Figure 6 illustrates this task for w = 4, h = 5, and k = 6.
Fig. 7. Illustrating the details of the prefix sums design of Figure 6
Let us now estimate the number of broadcasts used by our design. Each child v j of the root r of the w-ary tree corresponding to the U (w;w k ) outputs its unary Thus, all the prefix sums can be computed in 3k04 broadcasts. Next, we estimate the number of blocks used. Since each node uses k blocks, and the tree has w k02 + w k03 + 
