Ahstrwt
Introduction
A great deal of interest exists in the design of umfigurabIe digital systems WillSS]. ConfiguraWty is desirable for:
Most implementations of configurable digital systems rely on off-the-shelf FPGA technology in view of ready availability of such hardware components along with associated design tooh W g 9 7 1 , IVill981, [Ratb991. Howevet, the bit-level computational capabilitiea of FPGAs makes the synthesis of arithmetic-intensive functions quite complex. Even though same of the prformanct lost in fitting the computation to a general-purpose structure is regained by using clever algorithms or customized word widths IKIot961, (LmOO] , [Shir95l, ITiOu991, performancc does remain far below that of custom hardware.
Case studies indicate that even small, short-word digital filters, for example, use up hundreds of logic blocks in an FPGA W81; such designs utilize neither the full flexible functionality of the logic blocks nor the extensive interconnect resources that link these blocks. Most often, the FPGA resouTces are configured into regular arrays of full-adders. The flip side of this coin is that systems with "chuny' function units (ALUs, multipliers, etc.) are more rigid and it is hard to decide what functionalities to include in the chunky units [Mang97] . At the extreme, logic blocks might be replaced by "processors"; but then, even more speed is sacrificed to gain the added flexibility.
In this paper, we propose and evaluate particular types of cells that can be used in FPGA-like structures to facilitate the synthesis of arithmetic-intensive computations. Our thesis is that such structures are limited by their U 0 pins and intercell connectivity. The cells themselves can be made quite complex, even with present-day technology (transistors are cheap; pins and wires are expensive). Such complex cells need not lead to performance degradation or excessive power dissipation when only very simple functionalities are required, provided that the unused sections of a cell do not f m part of the critical signal path and are not clocked in idle mode.
Our scheme involves digit-serial arithmetic in radix 2, using eitha two's-complement encoding or redundant ~epresenration with the digit set (-1. 0, l ) , or in radix 4. Extension to higher radices is possible but not treated here. B. The principles of operation and detailed design of the ASMM shown in Fig. 1 (1, 0), (0, 0). and (0, 1) . respectively, conveasion from binary to BSD becomes triviat, a bit b is encoded as (0, b) , with the sign bit of a Z's-complement number encoded as (b# 0). BSD digits, Whereby digit Val= -1, 0, and 1 are
Conversion from BSD to binary, though more complicated, is also straightforward and can be done on the fly Fce8'7l.
Because some form of U 0 or boundary cell must be provided in any case, the conversion hardware, consisting of two registers and controlled transfers between them, does not lead to undue complexity. With reference to 
Data-Driven Control Scheme
The ceIIs described in Section 2 must be programmable in two distinct but complementary ways. First, the cell function must be dynamically alterable. Possible functions include additive multiplication with prestored coefficient(s) w (and U), AM with variable coeffkient(s) as input(s), and simple addition. Simple multiplication is the same as AM with the additive input y set to 0. Note that simple addition cannot be said to be equivalent to AM with one multiplicative input set to 0, because the latencies are different far the two operations (see Table I ).
Many arithmetic-intensive computations of interest in signal processing can be formulated as dependence graphs with AM nodes, augmented with simple nodes that basically nansfer or redirect (change the flow direction of) the input data [KwaiW] . Before showing how our additive multiply cells can be modified to handle such operations, let us pint to the second aspect of programmability to accommodate varying word widths. In practice, word width changes are much less fiequent than coefficient value adjustments, so this type of adaptation is best handled at configuration set-up (compile) time.
Figure 3 depicts the COmp~tatiofl Of yk = z i W X i W j 9 subject to 0 S i, j S N -1. Each of the black nodes in Fig. 3 represents an AM operation. Also shown is the projection of the dependence graph onto a three-node linear array with its two data streams x and y. Practical use this linear array for performing the convolution algorithm, requires that four operations be perfonned [ Figure 4 shows an augmented dependence graph in which the three auxiliary operations (besides inner-product steps) are also shown. At the left end of Fig. 4 , the weighting coefficients are delivered through the input links of data stream y, flowing to the shaded nodes; the store operation is simply a change in the direction of data flow, from diagonal to horizontal, for the coefficients.
Besides storing the coefficients. we need to deliver data elements to nodes that operate on them first. For example, computing yo involves only one inner-product step that occurs within a node at the bottom of dependence graph. The white nodes in the middle of Fig. 4 forward data elements to nodes where the first computation is to occur. Flg. 4. Dependence graph for convolution algorithm with data loadlng and result drainage also Included.
Note that in Fig. 4 , nodes of the same type are v e r t i d y slimed, so that a mebit tag attached to the elements of data stream x s u f 6 x s to instruct the nodes as to which of the: four functions (actually two functions, and one control bit, after merging compatible nodes) they shoutd perform.
The foregoing is an example of a data-driven control me:thodology that we have successfully applied to the design of several application-specific digital systems (see.
e.g.. [Kwai96] . Kwai971, Parh991).
For the arithmetic arrays under discussion here, we take the simple case where the cell only performs the multiply-add operation and use a single control tag bits for each input data digit. This "FD" tag signifies that an input is the fust incoming digit of an operand (LSD or MSD, depending on the computation mode). Operand length is stored in the cell control logic at configuration set-up time, so timing of the last digit is automatically deduced.
Cell operation is determined by the coincidence of FD tags.
Basically, the FD tag of the additive input y triggers the computation, while the FD tags of other operands specify change of value for those operands. Far example, the AMM2+ cell will perform the functions shown in Table I1 depending on the FD tags of w and x. Each time a new value of w or x is provided, it is stored in the cell; so, loading of coefficients does not require additional control. 
Comparisons and Tradeoffs
As a simple example, we consider the implementation of the convolution algorithm ( Fig. 3) with N = 4, using the tzll types of Table I . We assume a word width of 16 bits, leading to k = 8 for the radix4 implementation and k = 16 lor the three radix-2 options. In all but the ADMM2 case, seven cells will be cascaded vertically (see Fig. 5 ). In all but h e radix4 case which requires no horizontal cascading, 'two cells are connected horizontally (Fig. 5) . However, where cells are used to delay y, a single cell will suffice, even though the control circuit of that cell acts as if it is performing a ldbit AM. The external I/O pin count includes FD control tag bits.
Note that in Fig. 3 , equitemporal lines extend diagonally in SW-NE direction; thus, the y values must be delayed in their downward vertical movement. The cell count includes both those used for computation and the ones that merely delay results to achieve proper timing (Fig. 5) . Note that the clock rate is assumed to be the same for all four designs. This is not unreasonable, given the fairly small differences in gate levels wichin cells, per 
