Abstract-In conventional memory-based multiplication design, the multiplier is replaced by a read only memory (ROM). Since the memory size increases exponentially with the input length, in this paper, a modified hardwareefficient approach for memoryless-based multiplication is proposed. The very large scale integration (VLSI) measure indicates that the proposed approach involves less hardware complexity compared with the existing one. Then the proposed approach is applied in the finite impulse response (FIR) filter. It is observed that the proposed memorylessbased multiplication can be decomposed into a number of small units. Thus, we present the design optimization of onedimensional (1-D) and two-dimensional (2-D) fully systolic arrays for area-delay-efficient implementation of finite impulse response (FIR) filter, using the proposed memoryless-based multiplier. For efficient realization of FIR filters of different orders, the systolic designs are synthesized by Synopsys Design Compiler along with the FIR filter using the existing lookup table (LUT)-based multiplier. The key measure metric, namely the area-delay product is estimated for different filter orders. Analysis of the result obtained indicates that the proposed 2-D structure involves significantly less area-time complexity when compared to the FIR filter using the existing LUT-based multiplier. Besides, the 2-D systolic array is found to offer a fixed duration of cycle period for each processing element (PE), and therefore suits well for filter implementation with large filter order.
I. INTRODUCTION
Finite impulse response (FIR) digital filters are widely used as common components in many digital signal processing (DSP) systems [1] , [2] . Since the complexity of implementation grows with the filter order, the design of hardware efficient and high throughput FIR filter has become much more demanding. In conventional design, however, the multipliers in the structure require a large portion of chip-area, and accordingly, the delay of the structure is large due to the large time required in multiplication.
Multiplierless memory-based techniques [3] - [10] have been widely used in many applications, in recent years, for their high throughput processing and cost-effective structures. Generally, there are two basic memory-based techniques for multiplication [3] . One of the techniques is the direct-read only memory (ROM)-based implementation of multiplications while the other is based on distributed arithmetic (DA) for inner product computation. While for the direct ROM-based technique, in a recent paper, a new approach to LUT implementation for memory-based multiplication was proposed [11] .
Though the approach proposed in [11] is efficient in implementation, the approach can be improved further. For example, it is noticed that there was an address encoder and a control circuit in the architecture [11, Fig. 2] , which may increase the chip area. Therefore, if we could improve the memory-based technique, we may get a hardware efficient structure for implementation. In this paper, we aim at presenting a new memoryless-based technique to replace the conventional direct ROM for hardware-efficient implementation.
In FIR filtering, one of the convolving sequences is the input samples while the other is the fixed coefficients of the filter. This behavior of the FIR filter makes it possible for memory-based multiplication realization. It yields faster output compared with the multiplier-based designs because it stores the pre-computed results in the memory units, which can be read out and accumulated to obtain the result. The conventional memory-based structure for FIR filter was first introduced in a 1993 paper by Lee et al. [12] . However, it is not suitable for implementation of the FIR filter in systolic hardware since the product available from the memory elements are summed together by a network of adders. While the systolic designs have an efficient area-time implementation, being supported by its feature such as modularity and regularity of the structure [2] . Besides, they also possess potential to obtain low latency implementation since all the processing elements (PEs) in the systolic array are fully pipelined. In this paper, we aim at presenting the design optimization of one-and two-dimensional fully systolic structures for FIR filter implementation using the proposed memoryless-based multiplication.
The rest of the paper is organized as follows: The new memoryless-based technique is described in Section II. In Section III, we have derived the proposed structure for FIR filter implementation. The comparison of hardware and time complexities are described in Section IV. And the conclusion is presented in Section V.
II. MEMORYLESS-BASED TECHNIQUE
In this section, we simply outline the conventional and the existing direct-ROM-based techniques and discuss the design of new memoryless-based cells to be used in the FIR filter.
A. The Conventional and The Existing Direct-ROMBased Techniques
According to the approach suggested in [12] , each of the multiplication nodes of a structure performs multiplications with a fixed coefficient and therefore can be replaced by a ROM which stores the results of the multiplication of all possible values, as shown in Fig. 1 . However, the memory size increases exponentially with the input length.
In a recent paper, a new approach to LUT implementation for memory-based multiplication was proposed in [11] , where the memory size is reduced to half at the cost of some increase in combinational circuit complexity.
B. Proposed Design
In conventional memory-based technique, the ROM stores the result of the multiplication of all possible values. Here, we extend further to obtain a memorylessbased implementation.
The principle of the proposed memoryless-based implementation of multiplication is shown in Fig. 2, 3 and 4. Assuming the coefficient is M, the input length is 2, and then the values stored in the ROM should be 0, M, 2M and 3M, as shown in Fig. 2 . It is noticed that 3M can be replaced by 2M + M, as shown in Fig. 2 . Then the ROM is replaced by two 2×1 MUXs and an adder, and the coefficients of the MUXs are 2M and M, respectively, as shown in Fig. 2 . For 3-input word length, there will be 8 number of possible values stored in the conventional ROM. And likewise, it is noticed that 3M, 5M, 6M and 7M can have another representation, as shown in Fig. 3 . Thus, the ROM can be replaced by 3 MUXs and two adders, and the fixed coefficients of the MUXs are M, 2M and 3M, respectively. For large input length, the traditional memory can be replaced by a number of MUXs and adders, where the nth fixed coefficient of the MUX is 2 n-1 M, for 1≤n≤L, L is the input word length. Each of the MUX consists of W number of MUX cells (bit-level) working in parallel, where W is the word length of the coefficients. This memoryless technique has two major features. First, it suits well for any number of input word length. Second, the whole structure can be decomposed into a number of small units, which can be extended further to obtain a high-throughput structure for FIR filter implementation.
III. PROPOSED STRUCTURE
In Section II, the proposed memoryless-based technique has been introduced. In this section, we derive the proposed structure for FIR filter from dependence graph (DG).
The N-tap FIR filter in the can be given by equation (1):
where f(n-k), for k= 0, 1, …, N-1, are the filter's coefficients and x(k), for k= 0, 1, …, N-1, are the input of the filter.
Following the equation (1), we may derive the structure for FIR filter as shown in Fig. 4 , n= 0, 1, …, N-1.
A. Proposed 1-D Systolic Array for FIR Filter
According to the approach suggested in [12] , each of the multiplication nodes of the transposed structure performs multiplications with a fixed coefficient and therefore can be replaced by a ROM which stores the results of the multiplication of all possible values, as shown in Fig. 1 The node use the input bits as address for the proposed memorylessbased multiplier (MM) and reads the content stored at the location specified by the input address. The value read from MM is then added with the input available from its left, and the sum is passed to the node on its right. The only difference between node A and node B is that, in node A, there is no input from its left. The linear array consisting of N number of PE is shown in Fig. 6 , which is derived from the DG.
The input sequence {x(n)} is fed to a serial-in parallelout input register, where content of the register is serially right-shifted by one position and transferred in parallel to the PEs in every cycle. The bits of the input sample are fed to the PE in every cycle period that all bits of the input are fed to the PE at the same time. Besides, input to each PE is staggered by one cycle period with respect to the preceding PE to meet the causality requirement. There are two types of PEs in the structure, one is PE(1) while the other is PE (2) . The functions of the PEs are described in Fig. 6(b) and (c), respectively. The PE(1) consists of a MM. During a cycle period, the PE(1) reads the content of its MM specified by the input bits. The value read from MM is then transferred as output to its right. PE(2) consists of a MM and an adder. During a cycle period, each PE(2) reads the content of its MM. The value read from the MM is then added to the input from its left. During every cycle period, the sum is then transferred as output to its right. The structure will yield its first output N cycles after the first input is fed to the first PE, while the successive output will be available in every cycle.
B. Proposed 2-D Systolic Design for FIR Filter
In the 1-D systolic structure, the time required on MM is large because there are a number of adders contained in the MM, especially when the input word length is large. Therefore, let us first consider the architecture of the MM contained in each PE. As shown in Fig. 6 , for 4 input word length, the duration of each cycle being T C = T M + 3T A , where T M and T A are, respectively, the time required to perform a MUX-access operation and an addition. If we rearrange the MUXs in a parallel way, as shown in Fig. 7 , then the duration of each cycle will be T C = T M + T A . Thus, the DG for 2-D FIR filter is shown in Fig. 8 . It consists of two parts. The first part, namely the part-I, it consists of L rows, where each row consists of N-1 number of node B and one node A. The functions of node A and node B are depicted in Fig. 8(b) and (c), respectively. The x(n) l consists of one bit, which is derived from the lth bit of the input, is fed to node A on (l+1)th row and (n+1)th column. The node A, actually performs the function of a MUX, which uses the input bit as "address" to "read" the content out. As shown in Fig.  8(c) , the node B also uses the input bit as "address" to "read" the content out. The value "read" from the MUX is then added with the input from its left, and the sum is passed to the node on its right. The part-II consists of L-1 number of node C. The node C performs the adder operation, and therefore, the part-II can be implemented by a pipelined adder-tree. The DG can be used to derive a linear 2-D systolic array as shown in Fig. 9 . Note that for high-throughput computation of the FIR filter, each node A and node B of the DG of Fig. 8 can be replaced by a PE to obtain a 2-D systolic array as shown in Fig. 9 . Similarly, the whole structure can be decomposed into two parts. The first part consists of L number of rows, where each row consists of N-1 number of PE(2) and one PE(1). The input samples are fed to a bit-parallel word-serial converter which generates L number of bit streams of the input sequence, where each bit stream contains the corresponding bits of all the input words. The output of bit-parallel word-serial generator is fed to the right-shift registers, as shown in Fig. 9(a) . The function of each PE is shown in the Fig. 9(b) and (c), respectively. During every cycle period, the PE (2) "reads" the content of the MUX out, and then adds the input from its left, the sum is then passed to the PE on its right, while the PE(1) only "reads" the content of the MUX out and then transferred to the PE on its right. Similarly, the input to each PE is staggered by one cycle period with respect to the preceding PE to meet the causality requirement. The second part consists of a pipelined adder-tree. The 2-D structure will yield its first output (log 2 L+N) cycles after the first input is fed to the first PE, while the successive output will become available in every cycle.
IV. COMPARISON AND DISCUSSION

A. Hardware and Time Complexity
In Section II, we have proposed the modified memoryless design for multiplication. While in Section III, we have applied the proposed memoryless design in FIR filter, and thereafter obtain the design optimization of 1-D and 2-D systolic structures for FIR filter implementation. In this subsection, the hardware and time-complexities of the proposed 1-D and 2-D systolic arrays for FIR filter are presented.
The 1-D systolic array consists of N number of PEs, where each PE consists of an adder and a MM except for the first one which only contains a MM. The latency of the structure is N cycles, the duration of each cycle is T M + (log 2 L + 1)T A , where T M and T A are the time required to perform a MUX-access operation an addition in the PE. The structure will yield its first output N cycles after the first input is fed to the first PE, and the successive output will be available in every cycle. The proposed 2-D structure has NL number of PEs arranged in L number of rows. Each PE of this structure consists of an adder and a MUX except for the first one which only contains a MUX. The duration of each cycle of the 2-D systolic array is T M + T A . Because of the pipelined adder-tree in the proposed 2-D structure, the latency of the structure is N + log 2 L cycles, which is slightly higher than the 1-D structure. However, the 2-D structure has lower cycle period, which means the throughput of the 2-D structure is significantly faster than the 1-D systolic array. The number of MUXs, number of adders, duration of cycle, latency in cycles and throughput per cycle of the proposed 1-D and 2-D structures are listed in Table 1 . It is found that although the latency of the proposed 2-D structure is longer than the 1-D structure, the duration of cycle of the proposed 2-D structure is shorter than that of the 1-D structure. The number of adders and MUXs of the proposed 1-D structure is the same as that in the 2-D structure. Therefore, the area-delay product of the 2-D structure is less than that of the 1-D structure.
B. Comparison With the Existing Design
In [11] , the author proposed a new approach to LUT implementation and accumulation for memory-based multiplication. Here, we apply this approach to the FIR filter. Since the approach suggested in [11] does not have the same decomposition feature as the proposed memoryless approach, we just apply it in the 1-D structure, while for 2-D structure, the approach suggested in [11] is not suitable. For a fair comparison, we faithfully re-implemented the approach suggested in [11] , and the input word length is fixed as 8 and 16, respectively. We use the Synopsys Design Ware 0.18-μm TSMC library for 16-bit data width [13] to determine the area of adders of various sizes. Also, we obtained the area of the 2×1 MUXs from the TSMC 0.18-μm process 1.8-V SAGE-X standard cell library data-book [14] . The areadelay complexity of the proposed structures for L = 8 and L = 16 is plotted with the FIR filter using the existing approach of [11] in Fig. 10 and 11 , where the area and delay are measured in μm-square and nanoseconds, respectively.
It is clear that the proposed 2-D structure is less areadelay complexity than the other structures, for both L = 8 and L = 16. This should be due to the fact that the duration of the cycle of the 2-D remains the same, while in the other structures, the duration of the cycle grows linearly with the input word length. Apart from that, the area requirement of the proposed memoryless design is less than the existing one. All these factors yield the more efficient in area-delay complexity of the proposed 2-D structure. Because of the fixed duration of the 2-D structure, it suits well for FIR filter implementation with large filter order. Also the idea of bit level memoryless design in the 2-D structure can be extended to a number of applications, like the discrete cosine transform (DCT) and discrete Fourier transform (DFT). • Proposed 1-D Figure 10 . Comparison of area-delay product, for L = 8, where [11] denotes the existing design for L = 8. A modified hardware-efficient approach for memoryless-based multiplication is proposed. The proposed approach is less hardware complexity than the existing memory-based design for multiplication. Then the proposed approach is applied in the FIR filter. Because of the feature of the proposed approach that it can be decomposed into a number of small units, the design optimization of 1-D and 2-D systolic structures are proposed. The 2-D systolic structure is found to involve less area-delay complexity compared with the 1-D structure using the existing memory-based multiplier. Besides, unlike the 1-D systolic array, the duration of cycle of the 2-D structure does not grows linearly with the word length of the input samples. Thus it can be readily used as an IP core in a number of environments, especially for those high-order filters. Further work may concern about the more efficient design for multiplication.
V. CONCLUSION
