ABSTRACT uct, such as A . T or A . T2 complexity, where A and T are
INTRODUCTION
Finite-impulse-response (FIR) filters are basic processing elements in applications such as video signal processing and audio signal processing. The automatic synthesis of an optimal integrated circuit to handle diverse filtering speed requirements is among the most challenging objectives in CAD research today. To achieve this objective, two problems must be resolved. The first one concerns the selection of architecture types. Various architecture types, such as systolic arrays [l], transversal structure [2], stored-product structure [3], have been proposed for FIR filters of various lengths. Given these various architectures, a suitable optimization criterion is needed t o help in the selection of an appropriate architecture for a specified filter length. One conventional optimization criterion is the cost-speed prod- -.
the area and computing time, respectively. Such a criterion gives a general and theoretical analysis of a prototype of a VLSI architecture, but is not practical or precise enough for use in designing the application-specific integrated circuits (ASICs). A more practical criterion, which is adopted in this paper, is to select a VLSI architecture with minimum hardware cost under a certain speed specification. However, an accurate estimation of the hardware and speed cost of VLSI architectures must be given to support the use of this selection criterion. In fact, the hardware and speed cost vary with the technologies provided by different foundries, such as routing technology and available implementations of functional units. For example, it is hard to estimate whether an architecture with higher routing complexity and fewer arithmetic units is more or less costly than one with lower complexity and more arithmetic units. Hence, it is very difficult to give a fair evaluation of the cost of various types of architectures. Once a certain type of architecture is selected, the second problem concerns the speed-specific configurations. For example, to reduce hardware costs, fewer multipliers are used in an architecture design for low speed requirements than in one designed for high speed requirements. Hence, an aut* matic design tool should be able to synthesize different architecture configurations to suit diverse speed requirements. This may entail designing a large number of architecture configurations for an application and then selecting a suitable one among them. However, the synthesis and selection procedure involved in this approach is difficult to carry out.
In this paper, the problems of selecting an appropriate architecture type and of speed-specific configuration are solved in the following ways: First, we show that the multipliers in FIR filters can be efficiently replaced by ROMs, hence, the resulting architecture styles, which we call memory-based architectures (MBAs) can be arbitrarily selected without sacrificing general speed and hardware performance. To prove this claim, we show the conditions of linearity and time-invariant input of a linear function can be efficiently implemented by ROMs. Also, it is illustrated that FIR can be formulated to have these two properties and hence MBAs are feasible.
Second, we show that FIR filters can be implemented flexibly using different numbers of memory modules. The hardware and speed cost of each module are lower than those of a multiplier. To achieve the flexibility, we show that a highly efficient memory replacement sometimes causes dif-IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993 ferent contents of ROM modules and hence the number of ROM modules is proportional to filter length or wordlength. This paper is intended to show that both highly efficient memory replacement and identical ROM contents can be obtained by a proper algorithm formulation and hence a flexible number of memory modules can be used to realize an FIR filter.
Third, a parameterized MBA is presented. In addition to utilizing the formulation results described above, the design of this architecture takes into account the computing parallelism, memory partitioning and pipelining. The resulting architecture is characterized by three design parameters: D, the number of bits to be processed in parallel, K, the memory partition factor, and P, the number of pipeline stages in one processing unit. Different MBA configurations can be obtained by substituting different values for these parameters.
Fourth, hardware-speed evaluation formulae for the parameterized MBA are established for a given speed and hardware cost model of the required elements in an MBA. These elements include memory modules, adders, and shift registers. Since all MBAs have the same types of elements and similar configurations, the selection of an optimized MBA will not be affected much by factors such as routing and placement, and different MBAs can be evaluated using the same hardware-speed evaluation formulae.
Finally, based on the hardware-speed evaluation formulae and the parameterized architecture, an optimized MBA configuration can be easily designed by using computers to search for the best values of the design parameters. Thus, speed-specified configurations can be designed.
The technical content of this paper is organized as follows: In Section 2., efficient memory replacement based on algorithm properties is described. These properties include the time-invariant input and functional linearity. With the results of Section 2., computational units in FIR filters can be efficiently replaced by a flexible number of memory modules. In Section 3., architecture issues involved in implementing the formulation results of Section 2. are considered. These issues include memory partition, adder selection, and pipelining. A parameterized MBA is presented by taking these architecture issues and formulation results into consideration. In Section 4., the evaluation formulae for the speed and hardware cost of the parameterized MBA are established. These formulae can be adapted to different technologies by giving different technology parameters for the components used in the MBA. Therefore, with a given set of technology parameters, different configurations of the proposed parameterized architecture can be evaluated using the same formulae and the optimal configuration which meets the speed requirement can be obtained. A design example and discussion are also given in this section. In section 5., concluding remarks are given and other applications of the technical content of this paper are suggested.
ALGORITHM FORMULATION
In this section, the formulation of algorithm for designing MBAs is systematically investigated. Two general conditions that lead to an efficient replacement of ROMs for a combinational circuit are addressed in the first subsection. In the second subsection, two aspects of algorithms that will be used in evaluating algorithm formulations for FIR filters are presented. In the last subsection, various algorithm formulae for FIR filters are derived and analyzed using these conditions and aspects.
Two Conditions for Memory Reduction
The basic memory replacement model is illustrated in Figure 1 , which shows how a combinational circuit with function y = f(2') can be replaced by a ROM with the addressing lines as the bit lines of 2' and a ROM size of 2N'Lz words. In most cases, the ROM size will be too large to be implemented. The ROM size can be reduced, however, whenever one or both of two conditions holds. The first condition is that some inputs of the combinational circuit are time-invariant. To replace a combinational circuit with H input lines, a ROM size of 2H words is needed. If, however, the signals in some of the H lines, say G lines, are timeinvariant, the ROM size can be reduced to 2H-G words.
For example, a multiplier with two operands can be replaced by one ROM with a size of 22.L words, as shown in Figure 2a , where L is the wordlength of the operands.
Such a replacement will induce a ROM area that is larger than that of the original multipliers. However, if one of the operands is time-invariant, as shown in Figure 2b , the ROM size can be reduced to 2L words.
The other condition for reducing the ROM size is that the function f(.) is linear. In this case, if we represent the input signal as a linear combination of z'i, i.e., 2' = xi ci . Z i , then it follows from the properties of linear functions that
I I
Different schemes for decomposing 2' will lead to different implementations. For example, we can decompose 2' into the summation of two sub-vectors:
1T r- To illustrate the scheme, the multiplier in Figure 2 is again taken as an example. In Figure 2c , if we decompose z' as in (2), the multiplier can be replaced by two memory modules and one adder. The total ROM size is 2 . 2 f words with some overhead in adders. For a value of L of eight, the hardware-speed cost for the two ROMs and the adder has been shown to be more efficient than that of most multiplier architectures [4]. In general, if the original ROM size is Z L and the ROM module is uniformly partitioned into q submodules, then the total size will be reduced to q.2 q words, with some overhead of adders to sum the partial results from each submodule. Therefore, the memory partition scheme can reduce the memory size exponentially.
Two Aspects for Evaluating Algorithms
In this subsection, output wordlength and the space-timecommutative property are introduced for use in evaluating the performance of algorithm formulae in the succeeding discussions. Consider an FIR filter with tap weights a, for an input sequence zj. This filter can be represented as follows:
tioned-into q submodules, and the total ROM size can be reduced from ( 2 N ' L r ) . L , to ( q . 2 T ' L ' ) . L y . In the reduction process, the output wordlength L , or the required precision of each ROM is assumed to be a constant. In general, the output wordlength of each submodule can also be reduced through the partitioning. However, the reduction in output wordlength depends on a suitable decomposition of input data {I,}. The algorithm formulae described in the next subsection can factor the FIR equation to reduce not only the exponent of (2N'Lr) . L , but also the L , of factorized
ROMs. This is one reason why we introduce these algorithm formulae in addition to the memory partition scheme. Another property that may be used to evaluate the algorithm formulae is the space-time-commutative property. The space-time-commutative property is the property that a reduction (or an increase) of a factor of hardware results in an increase (or a reduction) of approximately the same factor in computation time. If an algorithm formula has this property, based on a particular type of hardware realization, then the hardware architecture for that algorithm can be flexibly modified to meet different speed requirements. For example, suppose that an algorithm mostly consists of multiplications. If multipliers are used to realize these multiplications, then the number of multipliers required in the i = O architecture varies with different speed requirements. If the required speed is low, fewer multipliers can be used, thereby reducing the hardware cost. If, on the other hand, the required speed is high, a larger number of multipliers should be used. Since one objective of this paper is to design a memory-based architecture that is easily tuned to various speed requirements, this property is essential. The attainment of this property depends on a cautious examination of algorithm formulae and hardware units. The example diswhere k is the time index starting from zero. In this equation, the number of multiplications required to obtain each output yk is equal to the tap length N . A direct form implementation of the FIR filter is shown in Figure 3 . In most applications, the tap weights are time-invariant. Therefore, the dotted block shown in Figure 3 can be replaced by one ROM with a size of ( 2 N ' L = ) . L,, where L, and L , are the wordlengths of operands z, and yk, respectively. If full IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993 N -Taps cussed above shows that a realization based on multipliers possesses the space-time-commutative property. However, ROM tables are not as flexible as multipliers are. For example, the ROM in Figure 2a can be used to realize multiplications with any operand values while that in Figure 2b can only realize multiplications with the operand a. Hence, if the ROM in Figure 2b is adopted as a hardware unit, algorithms have to be formulated to consist of multiplications with only an operand a to attain the space-timecommutative property. This example implies that a highly efficient memory replacement would seriously restrict the range of feasible algorithm formulae. In the next subsection, an algorithm formulation that yields both highly efficient memory replacement and the space-time-commutative property is presented. If (3) is represented as an inner-product, it follows that
The superscript, T , denotes the transpose of a vector. Since the function is linear, the input vector can be decomposed to obtain various formulae. One simple approach to doing this is the StoredProduct [3], in which 2' is decomposed at the word level, as follows:
If the decomposing scheme in ( 6 ) is substituted into (4), the output can be obtained from the linear combination of the N separated terms, i.e.,
Each term JTz;-, in ( shows that the multiplication on each tap is replaced by one ROM module. However, the main drawback of this architecture is that it is not space-time-commutative and hence the number of ROM tables must be equal to the tap length N . For considering the space-time-commutative property, (7) is reformulated as
From ( 8 ) , it can be seen that the contents of each ROM module are coded based on aizjv-, for i = O , l , . . . , N -1. Since different ROM tables are dedicated to different tap values a,, these ROM modules are not space-timecommutative. Consequently, hardware costs cannot be reduced through the space-time-commutative property when the tap length is long or the required speed is low.
Another approach to decompose Z, one that uses the concepts of "Distributed Arithmeticn[5, 61, is to represent the operand 2' at the bit-level, as follows:
where z ' denotes the j t h bit of data z. Here, z is represented in an unsigned binary form. The extension for two's complement is discussed in the Appendix. We now represent (9) as a linear combination of bit-level vectors, where Z'J denotes the vector that is composed of the j th bit of each entry of 2'. Substituting (La +log, N) . The functions of the ROM modules in (11) are all identical and equal to f ( 2 k ) = a"2'k in (11). In other words, any ROM module can take the place of any other module, which implies the formula in (11) is space-time commutative. Figure 6 shows an architecture that realizes (11) by only one ROM time-serially. If Figure 6 is compared with Figure 5 , four primary features can be found. First, the L , ROM modules are replaced by a single module. Second, since the input of each ROM module like that in Figure 5 consists of the same significant bits of input sequence {zi} and the input is assumed to proceed data-serially and bit-serially, b i t -l e v e l decomposition: t h e DA approach each black block on the input line of Figure 6 should consist of L , delay elements to ensure functional correctness.
Third, the weighted adder in Figure 5 is now the shiftand-accumulation adder. Fourth, the computation speed of Figure 6 is L , times slower than that of Figure 5 . The architectures in Figure 5 and Figure 6 are two extreme cases, with the maximal and minimal number of memory modules, respectively. In general, the number of ROM modules for different hardware-speed requirement is flexible. If we let the structure in Figure 6 represent a slice of a memory module, then if D slices of ROM modules are used, we can set j to be equal to % . j ' + j", where j' = 0, 1 , . . . , D -1 and j" = 0, 1, . . . , % -1. Substituting this factorization into (11) yields following equation:
B i t -s e r i a l a r c h i t e c t u r e r e s u l t s from
An architecture that realizes (12) is illustrated in Figure 7 .
In this figure, each ROM module is used to realize one term a 'k , the modules are identical to those in Figure 5 and Figure 6 . The terms in the bracket are realized timeserially by one slice, which consists of one ROM module and accumulator, as marked in Figure 7 . The D terms in the braces are realized by D slices in parallel. 
i g i t -s e r i a l a r c h i t e c t u r e
we shall adopt it to design a parameterized MBA.
ARCHITECTURE DESIGN
In this section, we shall use the results of the last section as a basis for considering three issues involved in the architecture design of MBAs. In the first of the three subsections that follow, a way of further reducing the memory size through a memory partition scheme is presented. In the second subsection, the implementation of additions and the arrangement of the addition sequence are addressed, so that MBAs with low hardware cost and high pipelinability can be designed. In the final subsection, our results concerning algorithm formulation and architecture design are integrated to develop a parameterized MBA that can be easily tuned to various hardware-speed requirements. In the discussions above, the number of submodules K is assumed to be a factor of filter length N such that % is an integer. In general, K is not necessarily a factor of N . In cases where it is not, the ROM size of each submodule will not be equal. However, since a larger ROM size will result in a longer table lookup time and the ROM size is exponentially proportional to the number of input addressing lines, the memory size of the submodules should be as uniform as possible. If we take N = 18 and K = 5 as an example, to achieve higher speed and lower cost, N should be partitioned into five sets with line numbers of ( 3 , 3 , 4 , 4 , 4 ) instead of ( 3 , 3 , 3 , 3 , 6 ) or ( 3 , 3 , 3 , 4 , 5 ) . In other words, there Figure 9 . Such an arrangement would reduce the hardware cost in two ways. First, the wordlength of the hardware implementing the direct summation and weighted summation can be reduced, because the output wordlength increase with the times of accumulations. Second, the number of accumulators is reduced from D to one. In Figure 9 , aside from the implementation of the ROMs, the implementation of the adders is most critical in determining hardware cost and speed. In this paper, carry-save adders (CSAs) [7] are adopted as the basic addition unit. It is known that managing the carry propagation in an adder dominates the hardware and speed cost of the adder. There are basically three types of adders [7] : carry-ripple adders, carry-look-ahead adders and carry-saved adders. Carry-ripple adders propagate the carry stage-by-stage and hence consume much time to complete an addition. On the other hand, carry-look-ahead adders implement the carry propagation through extra combinational circuits and hence achieve high speed at a high hardware cost. Instead of giving one addition result, as carry-ripple adders and carrylook-ahead adders do, CSAs [7, 81 avoid the carry propagation by giving two partial results, a method that makes them efficient in both hardware and speed cost. For the addition of two operands, CSAs cannot be applied, because the two partial results have to be summed together to get the final result. Since the additions in Figure 9 have multiple operands, they can be implemented by a series of CSAs and only one adder in the final stage is needed to add the two partial results of CSAs for the final result as shown in Figure 10 . Also, since one slice in this figure is activated every % cycles to produce one output datum, the speed requirement of this final addition is not very critical when L, is greater than D . Therefore, low-speed and low-cost adders like carry-ripple adder may be used in this case as illustrated in Figure 10 . In Figurelo, accumulations are also implemented by two CSAs. Thus, the clock cycle time is limited by the delay time of the two CSAs within this accumulation, which is independent of the wordlength of the operands.
In Figure 10 , the sequence in each slice is arranged to take pipelining into consideration. It can be checked from According to the delay transfer rules in [9] , the same number of delay elements can be moved from all the inbound edges to the outbound edges of a cut line without modifying the system's behavior. Consider the two data flow arrangements in Figure 11 . The cut line applied on Figure l l ( a ) leads to the insertion of delay elements for all the edges that cross the line. Thus, one extra delay is introduced. In Figure ll (b), the data flow direction for the tapped input, z, and accumulation are arranged contrariwise, so that the delay on the inbound edge (tapped z) is transferred to the outbound edge (accumulation). Therefore, no extra delay time is introduced to obtain pipelining. Thus, the contra data flow is more efficient and so it is adopted in Figure 10 .
Since that the wordlength of input data is 1 bit while that of accumulation data is 2 . L,, an extra ( 2 . L, -1) delay elements will be introduced for each cut-set. Also, the weighted summation sequence should be pipelined to meet the desired sampling period T,. Since there are 2 . D stages of CSAs, the CSAs should be pipelined into L stages to meet the following constraint:
Dataflow arrangement and cut-set
Thus, the weighted summation is pipelined into [z'L;yA1 stages. In Figure 10 , for generality and to obtain flexibility in pipelining, the addition sequence is evenly cut into P stages. A higher value of P leads to more pipeline stages and shorter cycle time, but more delay elements are needed.
The pipelining factor, P, will serve as a design parameter for our parameterized MBA.
IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993
The Parameterized MBA
On the basis of the foregoing discussion, we now propose the parameterized MBA shown in Figure 10 . This parameterized MBA is composed of D slices, each of which processes one bit of input data z in one clock cycle. Thus, D bits are processed at a time and two partial results are obtained from each slice in each cycle. These partial results are produced from the K partitioned ROM submodules and summed through (K -2) stages of CSAs. The (K -2) stages of CSAs are pipelined into P stages, as depicted in the shaded rectangle in Figure 10 . These partial results are weighted-summed through a series of CSAs as depicted at the bottom of Figure 10 . Since there are two partial results produced in each slice, two stages of CSAs are used for each slice in the series. As for the accumulating section, two stages of CSAs are applied to perform the accumulation efficiently. Finally, the final carry-ripple addition on the lower left of Figure 10 , as discussed above, is activated every % clock cycles.
To sum up, the MBA is characterized by three parameters: the digit size, D; the number of partitioned memory submodules, K ; and the number of pipeline stages, P. Various configurations of MBAs can be obtained by substituting different values into the three parameters.
HARDWARESPEED EVALUATION
In this section, the hardware-speed cost of the MBA introduced above is formulated based on the three parameters and the hardware-speed cost of the basic cells. Through the formulae presented here, the hardware and speed cost of the parameterized MBA can be numerically described in a way that takes into account the architecture structures and implementation technology. Also, a CAD tool is developed in Section 4.3. to search for the values of parameters so that the hardware cost is minimized for a particular a speed specification.
The three basic elements in the parameterized MBA are the ROM, full adders, and delay elements. The delay time and hardware cost of the three elements can be found from the cell library that will be used to implement the MBAs. Here, they are represented as the parameters tabulated in Table 1 . The terms Tmem(n, WO) and Cmem(n, W O ) in this table mean that the time and cost of the memory is determined by the number addressing lines and the output wordlength. These terms will be used to develop the hardware-speed formulae in this section.
Cost Evaluation
To evaluate the cost of the parameterized MBA, we shall discuss the costs of the three basic components individually. We begin with the cost of the memory modules. In Thus, the total cost of the delay elements becomes
The total hardware cost of the parameterized MBA is formulated as follows:
Speed Evaluation
Since each data sample is processed in parallel within D slices and in serial within each slice, the sample period can be estimated as the time for the serial processing, i.e., % Figure 12 . d i f f e r e n t d e s i r e d sample periods the logic gates. To inspect how the memory cost factor a in Table 1 affects the design parameters, the optimal values are plotted with respect to the cost factor in Figure 13 . The memory cost factor (ranging from 0.01 to 1 in the figure) is defined as the ratio of the hardware cost of one memory cell to that of a two-input NAND gate. As shown in the figure, the memory cost factor affects mostly the number of partitions and pipeline stages. A higher memory cost factor results in finer partitions of the memory and more pipeline stages. These curves are drawn under the same specification as that in Table 3 of the design space and schemes for efficient memory replacement, algorithm formulation, architecture design, and evaluation method. Various schemes and design considerations were integrated to produce, a parameterized MBA that can easily be tuned to various hardware-speed requirements. This parameterized MBA is characterized by three design parameters. Differently configured MBAs result from specifying different values for these parameters. Also, hardware-speed evaluation formulae were established based on the required elements in MBAs. These elements include ROM, adders and shift registers. These formulae and a cell library of a target technology can be used to design an optimally configured MBA by searching for the best values of the design parameters with the aid of a computer. The research results can also be extended to IIR and multidimensional filters by decomposing multidimensional filtering into the summation of multiple inner products. Moreover, as we have shown in [13, 14, 151, transformations like Discrete Fourier Transform, Discrete Cosine Transform, and Discrete Sine Transform can be formulated as a convolution form. Since the formulation of convolution is the same as that of FIR filters, the speed-specified design method presented in this paper can also be applied to realize such transformations.
Optimized parameters values f o r
APPENDIX: TWO'S COMPLEMENT CONSIDERATIONS
In (ll), 2' is represented in an unsigned binary form. That is, if L , is eight, the dynamic range of zk is from 0 to 2'-1 = 255. However, in many applications, two's complement is Therefore, each entry of 2'+ 2Lr-1 .I' is an unsigned binary such that the value of f(Z+ ZLr-' . f) can be evaluated as discussed in this paper. As for the second term of (18), since it is a constant, it can be evaluated in advance. This process for dealing with two's complement with linear systems that have only unsigned computations is depicted in Figure 14 .
If we investigate further, we find that the offsetting of 
I=O
Therefore, the first term in (19) corresponds to only the inversion of the MSB of z j .
Another overhead in (18) is the subtraction of f(f) .
a L r -l . However, since this term is a constant for all z's, it can be preloaded in the accumulator of our architecture instead of being subtracted by an extra subtractor.
