Abstract -This paper proposes a new real-time 2-D convolver filter chip without using any parallel multiplier. The proposed chip uses only one special shift-and-accumulation block instead of nine multipliers. Hence the chip can reduce the chip size by more than 70% of commercial 2-D convolver chips. Moreover, the proposed chip does not require row buffers to control input data sequence employed in commercial chips. We implemented the filter chip using the 0.8pm SOG cell library (KG60K). The filter chip consists of only 3,893 gates, operates at 125 MHz and can meet the real-time image processing requirement, i.e., the standard of CCIR601.
The amount of real-time image data is massive, and thus, commercial image filter chips have many multipliers to meet the real-time requirement [ 11, that is, the standard CCIR60 1 (MPEG-2), 720x480 pixels per frame and 30 frames per second (1 0.4 Mpixelshecond) [2] . Since multipliers occupy a large VLSI area [1, 3, 4] , several convolution filter architectures based on the powers-of-two coefficients algorithm have been proposed . For dedicated applications, convolution filter chips with hardwired shifters are particularly efficient . Programmable convolution filter chips have a difficulty to improve both data rate and hardware cost efficiency. Several architectures have been introduced to optimize the data rate performance and the hardware cost . However, these architectures perform shift-and-add operations in all taps (N) and require (N-1) adders to sum results of N shift-and-accumulation blocks. This paper proposes a new VLSI architecture for a 2-D convolution filter chip with using only one shift-andaccumulation block.
The remainder of the paper is organized as follows.
Section IT introduces the modified algorithm used for the computation unit in the proposed filter chip. Section Ill presents architecture details of the filter. In section IV, we compare performances with other filter chips. Finally, section V contains concluding remarks.
U. THE MODIFIED CONVOLUTION ALGORITHM FOR THE FILTER CHIP
This section describes the modified algebraic algorithm used in the proposed filter chip. Assuming that the mask size of a convolution filter is 3x3, mask coefficients are H(x,y), the input sequence is F(x,y) and the output sequence is G(x,y). Then the relationship between input and output sequences is represented by the equation (1) 2 2
~~ where x and y indicate a pixel position in the image data.
This work was supported in part by the IDEC (IC Design Education Center) and in part by the KOSEF under the grant No. 951-0915-124-2.
The direct implementation of the equation (1) require nine 8-bit x 8-bit multipliers and a 16-bit tree adder (eight 16-bit adders) as in the HSP48901 and HSP48908. The multipliers make nine products and the 16-bit tree adder sums nine products. In general, the area of an 8-bit x 8-bit multiplier is about 8 times lager than an 8-bit adder. Since nine multipliers require a large VLSI area, the convolution algorithm is modified not to use multipliers to save the area If coefficients are 8-bit binary numbers, then we can convert the equation (1) to the equation (2) [8-91. The result inside the braces in the equation (2) is 16-bit, but the result inside the braces in the equation (3) is 8-bit. Therefore, the modified equation (3) Hence, RUs make 72 partial products (9 partial products per clock cycle). CU performs a summation of nine partial products in a clock cycle and eight shift-and-accumulations of the summation results in 8 clock cycles. The proposed architecture is reconfigurable so that it can be used both for a 2-dimensional filter having a 3 x 3 mask and for an 1-dimensional filter having 9 taps. Hence, the architecture can be used for a filter, a convolver, a correlator, etc. RU consists of an 8-bit data register, an 8-bit coefficient register and eight logical AND gates. These gates can perform the multiplication with 8-bit data and 1-bit coefficient. The value in the data register is transferred to the next RU after 8 clock cycles. The coefficient bits are rotated left at each clock cycle. When the MSB of the coefficient register is 1, the output of logical AND gates is the pixel data itself. Similiarly, when the MSB of the coefficient register is 0, the output of logical AND gates are all zeros, that is, the zero partial product. Hence RU makes eight 8-bit partial products of a pixel data and a coefficient sequentially in 8 clock cycles without using any parallel multiplier.
CU consists of an 8-bit tree adder and an SA. The 8-bit tree adder composed of eight group CLAs (carry look-ahead adders) is a pipelined structure and sums nine partial products fiom nine RUs. The SA sums the 8-bit value from the tree adder and the 1-bit left shifted accumulator value. In 8 clock cycles, the SA performs 8 accumulations. Since CU performs the shift-and-accumulation after a summation of nine partial products, the operand size of the adders in the tree adder is reduced from 16-bit to 8-bit and the number of SAS is reduced form 9 to 1 compared with previous architectures [ 1, 8, 9] . Hence, we can dramatically reduce the VLSI area.
In 2-D convolution filtering, the previous row and the next row are re uired to calculate the current row. Hence, the IISP48908ql] has an on-chip row buffer which can store two rows (2 x 1024 pixels) and the HSP48901 [I] requires an off-chip row buffer. However, the on-chi or off-chip row buffer occupies a lar e VLSI area. To elminate the row the ICU (Input Control Unit) which is a simple Finite State Machine (FSM) to feed input sequence correctly.
buffer, the proposed F ilter introduces a new scheme called
IV. IMPLEMENTATION AND PERFORMANCE EVALUATION
The proposed architecture has been simulated using VHDL models and performed logic synthesis using the SYNOPSY STM Design Analyzer. We used the SamsungTM SOG cell library (KG60K) and verified completely function and timing simulations. The implemented filter chip which consists of 3,893 gates and has 47 signal pins and 17 power pins. The proposed filter chip can easily be expanded for larger size filters. This scalability has been verified at the gate level model. In addition, the proposed architecture can significantly reduce the gate count compared with the other Proposed Chip architectures [1, 8, 9] . Table 1 shows comparisons of the HSP48901 [I], the existing architecture [9] and the proposed chip. The HSP48901 has nine multipliers and the total gate count is 13,594 [l] . The existing architectures [8, 9] have nine SAS and the estimated total gate count is about 7,500. Hence, the proposed architecture requires only about 30% of the gate count for the HSP48901 and 50% of the gate count for the existing architecture [9] V . CONCLUSIONS This paper have proposed the new VLSI architecture for a convolution filter and presented its chi design and actual implementation. We have modifiet the convolution algorithm and derived the architecture that has only one shift-and-accumulation. The architecture can reduce the gate count by more than 70% of commercial filter chips having multipliers. In addition, compared with the previous architectures [5-71, the pro osed filter architecture can reduce the size of the tree adfer from 16-bit to 8-bit and the number of SAS from 9 to 1. The implemented filter chip can operates at 125 MHz and can process real-time image data. In addition, this filter may have good scalability for larger size filters and higher speed applications.
