Abstract: Distributed arithmetic @A) has been widely used to implement inner product computations with a fixed input. Conventional ROM-based DA suffers from large ROM requirements. A new DA algorithm is proposed that expands the fxed input instead of the variable input into bit level as in ROM-based DA. Thus the new DA algorithm can take advantage of shared partial sum-of-products and sparse nonzero bits in the fixed input to reduce the number of computations. Unlike ROM-based DA that stores the precomputed results the new DA algorithm uses a predefined structure to compute results. When applied to a 1-D eight-point DCT system the new DA algorithm only needs 30% of hardware area and has faster speed as compared with ROMbased DA. To illustrate the efficiency of the proposed algorithm a 2-D IDCT chip was implemented using 0 . 8~ SPDM CMOS technology. The chip with size 4575 x 5 5 2 5~ can deliver a processing rate of 50 Mpixels per second.
Introduction
Computation of inner products dominates computation cost in many digital signal processing (DSP) applications. Though inner-product designs using multipliers and accumulators (MAC) are fast, the associated cost is intolerable when long-length inner-product computation is considered. Instead of using MAC, distributed arithmetic (DA) [l] uses ROM that stores the precomputed partial sum of inner products. Ths computational efficiency makes DA popular in various DSP applications in which one of the multiplication operands is futed, including filters, convolution and video processing applications like discrete cosine transfom @CT) and inverse DCT (IDCT) [2, 31. DCT and IDCT [4] has been selected as an important approach in many video codec standards [5] to reduce the spatial redundancies induced from the correlation of signals. Due to the inherent high computation complexity requirement, high-speed and low-cost designs are inevitable in many real-time video applications. To achieve the goals, most of the VLSI implementations of 2-D DCTflDCT use row-column decomposition to convert a 2-D transform into consecutive 1-D ones and adopt DA to compute I-D DCT/IDCT.
The DA technique distributes arithmetic operations rather than lumps them as multipliers do. Conventional DA [l] called ROM-based DA decomposes the variable input of the inner product into bit level to generate precomputed data. ROM-based DA uses a ROM table to store the precomputed data, which makes it regular and efficient in silicon area in VLSI implementation. However, when the We present a new DA algorithm [7] called adder-based DA for solving the problem. This algorithm, in contrast to conventional DA, decomposes the other operand of inner product into bit level, distributes the multiplication operation, and shares the common summation terms. The adderbased DA exploits the distribution of binary value pattern and may maximise the hardware sharing possibility in the implementation. Therefore the adder-based DA requires less hardware area and smaller computation cycle time than ROM-based DA. Due to its inherent sharing property, the proposed adder-based DA technique is very suitable for multiple inner-product calculations like DCT and IDCT [7] . Since the adder-based DA shares common summation terms between computations, it can be regarded as one of the class of the common subexpression algorithms [7-101. Unlike previous word-level sharing algorithms, the bit-level DA formulation provides implementation benefits for efficient bit-serial hardware designs.
2
An inner product of length L is defined as 
2=I
where A, is a fned coefficient, and X, is a variable input. To keep equations simple, A, and X , are expressed in unsigned fraction (two's complement form can also be used) as follows: 
Thus, one can shift and then accumulate the term Sj,r2-J at each cycle t to obtain the inner product. Since S, is computed using adders the proposed DA algorithm is called adder-based DA. There are two representations of the adder-based DA for two implementation styles. One is bit-parallel form of eqn. 5 and the other is bit-serial form of eqn. 7. Fig. 3 shows the SFG of the adder-based DA design and the corresponding implementation. All j-planes are combined and collapsed into a summation network, in whch only unique and nonzero nodes have to be computed. Fig. 4 shows an adder-based DA example to illustrate how adder-based DA works. The adder-based DA algorithm first decomposes fured coefficients into bit level. After rearranging these terms, one finds that additions are needed only at the nonzero bits of fixed coeficients. This is called zero-one pattern property. In addition, the summation term X I + X, can be shared between bit weights 2l and 2O. This is called common term sharing property. Thus, hardware area is saved by exploiting these two properties. Fig. 5 shows the adder-based DA architecture realising the bit-serial form of eqn. 5 . The architecture consists of an input buffer ('parallel-to-serial converter), a summation network, and a shlft-add part. The input buffer converts parallel inputs and serially outputs bits of each input. The parallel-to-serial converter can be omitted in the archtecture realising the bit-parallel form of eqn. 7. The summation network is a tree structure that connects the required input and generates their summation terms. The summation network of the example is shown in Fig. 6 . The shiftadd part shifts and accumulates the output of the summation network to generate final inner-product results. 
Architecture design

Precision analysis
Given the error constraint, one can derive the required word length of coefficient Ai. If-the coefficient A; has infinite precision, Ai has finite word length A4 and \AT -Ad s 2-' , the maximum error of the adder-based DA design is 
3 Comparisons with relevant approaches
I Comparisons of DA algorithms
To illustrate the area and speed advantages we used 1-D eight-point DCT as an example whose results are shown in Table 1 . We only compare ROM tables and summation networks since the two DA methods mainly differ on ths part. All the data are estimated by using a 0 . 8~ SPDM CMOS standard-cell library. The summation network only requires 10Y0 of the transistor counts and 30% of hardware area compared with ROM tables in the ROM-based DA design. The gate area in the summation network is 0.378mm2 that occupies 47.25% of the summation network area. From the Table the adder-based DA design consumes less hardware area and is faster than the ROM-based DA design. 
2.4ns network adders
Utilising dflerent function units, the ROM and summation network, results in the different requirements of word length and numbers of accumulation cycle in shift-adders. From eqn. 4, the maximum value of the ROM data in the ROM-based DA design is maxpilAiT,k] = ZslIAiJ. So the output of the ROM should be at least Llog2(E,LI(Ai[)] + 1 bits wide and the width of the accumulator should also be the same. In the adder-based DA design the accumulator is M-bits wide because the summation network generates M-bits output. Hence the adder-based DA design needs shorter word length of shift adders in generating final inner-product results.
The number of accumulation cycles in the ROM-based DA design is smaller than that in the adder-based DA design. It needs N accumulation cycles in the ROM-based DA design. However, the accumulators in the adder-based DA design need additional cycles to add the carry from the summation network. The additional number of cycles is ( [log2 maxlCkl AijTill + 1) -N 5 log2 M , which depends on the application specifications. Considering the 2-D IDCT [I31 design example, the input is 12 bits wide and the output will not exceed 12 bits. In this situation no additional cycles are needed.
Comparison with other subexpression sharing approaches
The proposed DA-based algorithm can be regarded as one of the classes of subexpression sharing techniques whch can be classified according to how the shared common term is generated, as illustrated in Fig. 7 . It shows the computation of the equation Y = a*xl + b*x2 + c*x3 + @x,, where a = OO101Olb, b = 01O101Ob, c = 11101Olb, and d = 11O01Olb. For each sharing type the input to the common term should be available simultaneously for computations. Approaches in [8, 91 generated the shared teims in the word-serial bit-parallel input direction (horizontal circles in Fig. 7) . The shared-term generation in [lo] was extended to be in the skewed direction (diagonal circles in Fig. 7) . Our proposed approach shared the common terms in the wordparallel bit-serial direction (vertical gray circle in Fig. 7) , which is very suitable for the hardware implementation when exploiting the parallel processing features. DA-based approaches can separate the low transition probabihty MSBs from the hgh transition probability LSBs, whch is beneficial to the power management. For the implementation of the adder-based DA design, eqn. 5 or eqn. 7 can be utilised. Eqn. 5 is suitable for software implementation while eqn. 7 is suitable for VLSI implementation due to the DA bit-serial nature. Software implementation of eqn. 5 is suitable for simple programmable processor without multipliers since the multiple multiplication and additions are reduced to just a few operations of shift and addition. Further reduction of operation numbers can be achieved by combining adder-based DA with fast algorithms. Software implementation of eqn. 5 also enables adder-based DA to apply to the adaptive designs since we can examine coeficient bits dynamically.
Due to the inherent sharing property, adder-based DA is very suitable for calculations of multiple inner products. Adder-based DA can share the common computations among multiple inner products and combine them into a summation network. In contrast, the ROM-based DA design has to store-separated ROM tables for each set of coefficients. The coefficients of multiple inner products can be either multiple dimensional coefficients like DCT or different sets of coefficients like DCT and DST. To calculate multiple inner products, just use a summation network to generate the desired subexpressions. If the coeEcient sets are different like DCT and IDCT, use a shuMe network to select the terms required. Fig. 8 shows the architecture for multiple inner product calculations. The drawback of such design is extra shuffle network area and delay. The optimisation problem of the proposed adder-based design is how to find the common terms from the nonzero subexpressions to reduce the summation network area. Ths is analogous to the logic minimisation problem that extracts the common terms, i.e. a NP complete problem.
Fortunately, in some real systems like DCT and IDCT, the search space is not large such that one can find the optimal solution by exhaustive search. For more complicated cases, logic minimisation tools can be used to find the solution. Algorithms and optimisation techniques developed in previous subexpression techniques [8-101 can be modified to find efficient common subexpressions by considering the type of adder-based DA subexpressions.
In addition to direct optimisation, data partition provides a tradeoff between performance and area as the design specification changes. Data partition can be either dataindependent or data-dependent. Data independent method directly partitions vector size L and input word length N, which has been used in previous work on ROM-based DA designs [6] . Data-dependent partition exploits bit patterns of coefficients. The rule for this type of partition is to group the coefficients with similar zero-one patterns into one partition. Fig. 9 shows an example on DCT coefficients. Depending on the bit patterns, different realisation methods can achieve lower hardware cost. Fig. 10 shows the block diagram of I-D eight-point IDCT. The target throughput rate is one pixel per cycle, that is, the design has to complete 1-D eight-point IDCT computation in eight cycles. To attain such throughput, the speed of summation networks is designed to do a two-bits addition per cycle since the word length of the input is 12 bits. The common terms of summation networks are searched by the exhaustive method. The IDCT coefficients are first scaled by 42 to reduce the nonzero bits. The scaling factor is easily 'recovered by a shift since row-wise IDCT following by columnwise IDCT w i l l make the scale factor be two. The serial adder in the summation network is composed of two full adders and a D-fip-flop (DFF) with reset. The two networks contain 22 full adders, 11 DFFs and 30 output latches. The total gate count of the network is 481, while 40% of the gate count is needed for output latches. The final shift-adders accumulate two 16-bit words per cycle by using a carry save adder and a BLC adder [12] .
: I %%;
h-h-:
ioutwt enable, I L Fig. 11 Microphotograph oj2-D IDCT ch@ Fig. 11 shows the microphotograph of the chip fabricated using 0 . 8~ SPDM CMOS technology. The chip size is 4575 x 5 5 2 5~ and it achieves SOME working frequency. The precision of the chip meets the requirement of the IEEE standard [13] , as listed in Table 2 .
Conclusion
We have presented a new DA algorithm called the adderbased DA and its application to 2-D IDCT processor. This algorithm decomposed the fixed coefficients into bit level instead of decomposing variable input into bit level. Thus, one can exploit the constant and numerical characteristics of the fned input to share and save hardware cost. This effectiveness makes the adder-based DA be a superior design choice over the ROM-based DA in current DA applications. Considering a 1-D DCT design, the adderbased DA only needs 30% of ROM area as compared with the ROM-based DA approach. A 2-D IDCT chp was designed and implemented based on the proposed adderbased DA approach to illustrate the efficiency associated with the proposed approach.
Acknowledgment
This work is supported by National Science Council, R.O.C., under the grant NSC-86-2221-E-009-014.
