. Decoding PSNR curves of sequence 5 with two bit-number allocation strategies. can also see that the selection of the parameters used in Section II is obviously very effective.
. Decoding PSNR curves of sequence 5 with two bit-number allocation strategies. DIFFERENT SITUATIONS can also see that the selection of the parameters used in Section II is obviously very effective.
V. CONCLUSIONS
The bit-number allocation strategy of TM5 may cause the coding quality decrease and buffer overflow at scene changes, therefore a new algorithm on MPEG-2 target bit-number allocation is proposed in this paper. In this paper, the influence of scene changes on coding quality is analyzed and an approach to improve the coding performance at scene changes is presented to avoid sharp quality decrease of scenechange frames. Experimental results indicate that the objective coding qualities of the influenced frames at scene change are significantly improved with less influence on other frames.
I. INTRODUCTION
Many fast block-matching motion estimation (BMME) algorithms have been developed in the past years [1] - [8] for low bit-rate video applications. These fast algorithms have the following characteristics: 1) a major design objective is to minimize the number of search positions or checking-points (CP's) in a given search range in order to speed up the computation; 2) the data dependency among pixels for the CP's in the search range is not considered; 3) the hardware implementation issues are seldom considered when designing these fast algorithms. In this paper, we propose a new fast BMME algorithm called checking-vector (CV)-based algorithm which is designed with the very large scale integration (VLSI) implementation in mind. A checking vector in this paper is defined as a vector consisting of several adjacent checking points. Compared to the fastest existing algorithms, the CV-based algorithm has the following characteristics: 1) CV's instead of CP's are used for hardware execution; 2) the data dependency among pixels within a CV is fully utilized. All the CP's in a CV are processed in parallel by hardware; 3) hardware implementation cost is considered when designing a BMME algorithm. 
II. DESIGN MOTIVATION
Among the existing fast BMME algorithms, three-step search algorithm (TSS) has been very popular for low bit-rate video applications [3] due to its simplicity. Several VLSI designs were presented for TSS and/or new TSS (NTSS) [9] - [13] . Some major characteristics of these architectures for TSS are listed in Table I , where the PE (processing element) performs the mean-absolute-difference (MAD), the processing speed is defined as the number of CP's searched in a search window per second, the I/O rate is defined as the data access time per pixel from a search range for the 352 2 288 picture size with 30 frames per second, and the latency is defined as the number of data access cycles needed to process one CP in a search window. The equations for calculating the processing speed and data access time are given in (1) and (2), respectively Processing speed = where M p is the total number of pixels accessed for all blockmatchings in a search window per current block, M frame is the number of current blocks in each frame, and frate is the frame rate.
The architectures in [9] and [10] have the lowest number of gate count. When using 16-pel parallel data access mode, i.e., one column pixels are input in one clock cycle, they can achieve very high processing speed. The I/O bandwidths for [9] and [10] are very high since no data dependency are used when inputting pixels from the search range. The architectures in [11] and [12] use serial data access scheme. These two architectures exploit data dependency to reduce the I/O bandwidth. However, the required hardware cost is high. The architectures in [13] have highest processing speed when using 16-pel parallel data access mode. Since it has a higher hardware cost, it is not a good choice for the TSS/NTSS algorithm for low bit-rate video applications. Since, in many cases, the TSS/NTSS algorithms cannot deliver good algorithmic performance. New fast algorithms with more CP's producing better algorithmic performance are needed. To implement these kinds of algorithms, the one-dimensional (1-D) array in [9] and the tree structure in [10] are not effective since they require higher I/O bandwidth. Among the architectures listed in Table I , the one in [13] is the best candidate for implementation since it has highest processing speed as well as lower I/O bandwidth. It can provide better design tradeoffs among the algorithmic performance, processing speed, silicon area, and I/O bandwidth.
The architecture in [13] was originally proposed for the slightly modified NTSS. This modification results in a simple data flow while maintaining almost the same algorithmic performance as the original NTSS algorithm in [2] . The architecture in [13] includes three 1-D arrays and two programmable-delay-unit (PDU) arrays. The function of the 1-D array is the same as that in [9] . The PDU is employed to provide variable delay unit for different search steps in TSS or NTSS. The hardware cost for PDU can be greatly reduced if each PDU produces only one delay unit (e.g., for the checking pattern in the last step in TSS or NTSS). This way, the three 1-D arrays and two PDU arrays can be folded to form one 1-D array with much lower hardware cost than the unfolded version. For fast BMME algorithms, this suggests a new kind of design methodology which will be elaborated in the next section.
III. THE NEW FAST BMME ALGORITHM NTSS uses the center-biased checking-point pattern and halfwaystop technique to reduce the computation cost. Using this concept, we develop a checking-vector (CV)-based fast BMME algorithm in which the CV's are placed in some specific search positions. We call this BMME algorithm as checking-vector-based four-step search (CV4SS) algorithm. The search pattern of CV4SS is shown in Fig. 1(a) . Each CV contains three CP's. Nine adjacent CP's are employed around the center and the halfway-stop technique is used for stationary or quasi-stationary blocks as in NTSS. If the minimum point among these nine CP's occurs at the center, the search procedure stops. We call this search step as Step 0. The nine CP's in Step 0 are grouped into three CV's as shown in Fig. 1(a) . If the minimum point is not at the center, the search procedure continues to search for the nine uniformly distributed CV's as shown in Fig. 1(a) (in fact, the number of new CV's is eight since the one at the center is already computed). This is called Step 1. If the minimum point happens at one of the CP's in a CV, the search procedure continues the second step around this minimum point [see Fig. 1(a) ]. In this step, six CV's are used with each two CV's in the same row having one overlapped CP. Finally, three CV's are used for Step 3 as shown in Fig. 1(a) .
This algorithm has four search steps within a 68 H 2 67 V search range. Each search step in CV4SS covers the CP's of the corresponding search step in TSS/NTSS. Therefore, TSS/NTSS can be considered as a special case of CV4SS. For a 16 2 16 block size, Fig. 1(b) and (c) shows the performance comparison with different algorithms in terms of MAD, where the search range of CV4SS algorithm is restricted to 67H 2 67V for a fair comparison with NTSS (i.e., we do not consider the CP's outside 67 H 2 67V search range). Obviously, the proposed CV4SS algorithm has better algorithmic performance than that of NTSS especially in some specific frames with large motions. That means the CV4SS algorithm is more robust than NTSS.
IV. THE PROPOSED VLSI ARCHITECTURE Fig. 2(a) shows the proposed architecture. As mentioned in Section II, it is basically a 1-D systolic array and it is the folded version of the architecture in [13] . For each CV with three CP's and 4 2 4 block size, the proposed architecture contains four PE's with each PE working in three pipelined stages. The MAD is employed as the criterion for hardware design.
At each data access cycle T , one skewed column pixels Y i from the search range and Xi from the current block are input to the PE array simultaneously. Each data access cycle T contains three pipelined cycles t (T = 3t) since each PE is working in three pipelined stages.
The PE structure is shown in Fig. 2(b) . The shift registers Rx0, Rx1, and Rx2 are used to store current block data for reusing. These three Fig. 3 . The registers Rp0, Rp1, and Rp2 are pipelined registers. The registers Rsum0, Rsum1, and Rsum2 are shift registers for storing the partial sums from the upwards PE for each CP in a CV.
The checking-vector control switch (CVCS) in Fig. 2(a) is used to fetch the MAD of each CP in a CV for comparison. The searchstep control switch (SSCS) is closed when a step search is finished. The minimum MAD of a certain step is sent to global controller and address generator to produce a new address for next step search.
V. PERFORMANCE ANALYSIS
The proposed architecture is limited to low bit-rate and MPEG-1 video applications. We first consider the hardware cost (silicon area), then analyze the I/O rate and processing speed.
1) Silicon Area:
Due to the regularity of 1-D systolic array, the silicon area is largely occupied by the various components, not the interconnections among these components and PE's. To make a fair comparison among the architectures in Table I , we evaluate the hardware cost of the proposed architecture by the number of equivalent gates. From Fig. 2 , it can be seen that there are two kinds of basic components. One is for computing the absolute difference and summation (jX 0 Y j + a), the other is register. From [12] , the number of gates for jX 0 Y j + a is 276, and for the 1-bit register is 8. The width of registers Rx0, Rx1, Rx2, Rp0, Rp1, and Rp2 is 8 b.
The maximum width for registers Rsum0, Rsum1, and Rsum2 is 12 b. Based on this consideration, and in addition with the equivalent gates for switches, accumulator, comparator, and global controller, we can estimate the total number of gates in the proposed architecture to be approximately 17.5k.
2) I/O Rate: For the proposed CV4SS algorithm, the number of CV's for all the four search steps is 3+8+6+3 = 20. For each CV, the number of pixels needed from the search range is 18216 = 288. of the proposed architecture is almost three times higher than the 1-D array in [9] .
Suppose two Y pixels (also two X pixels) are input to the serial/parallel converters within 20 ns. Then, to input a total number of 16 pixels (one column of a block) would need 160 ns. So, the data access cycle T should be 160 ns. Since T = 3t, the pipelined clock cycle should be 53.3 ns. This pipelined cycle for the proposed PE can be implemented with low hardware cost (e.g., using the simplest carry-ripple adder to compute MAD). In fact, we can easily obtain 10 ns processing speed for each pipelined stage for the proposed PE using today's VLSI technology. That means the I/O rate can be increased five times higher. A simple way for increasing I/O bandwidth is to use more I/O ports. With this consideration, we can use more CV's in a large search range to achieve better algorithmic performance for MPEG-1 applications. The processing speed can be further increased by using CV's with more CP's. It should be reminded that the tree architecture in [10] and the 1-D array in [9] can be used for any BMME algorithms since the CP's are processed without any utilization on data dependency. Hence, the required I/O bandwidth is very high, and it is almost three times higher than the proposed architecture for CV4SS algorithm. Especially, when the motion vector search is in a large search range with more CP's or CV's, the two architectures in [9] and [10] can hardly meet the requirement for real-time video applications. In this case, more circuits have to be used for on-chip memory in order to reduce the required I/O bandwidth. Consequently, the proposed architecture has better VLSI design tradeoffs than the 1-D array in [9] and the tree structure in [10] .
In general, for a CV containing K CP's, and N 2 N block size, the proposed architecture has N PE's with each PE working in K pipelined stages. The required I/O bandwidth for the proposed architecture is almost K times lower than the architectures in [9] and [10] .
So far, we have discussed the VLSI architecture and the performance analysis based on the utilization of data dependency within each CV. In fact, the data dependency between the CV's can also be exploited to achieve a higher throughput for higher quality video applications by using some extra circuits.
VI. CONCLUSION
In this paper, we have introduced a CV-based BMME algorithm based on the consideration of VLSI implementation. Compared to the widely used fast BMME algorithms such as the TSS and NTSS algorithms, the proposed BMME algorithm possesses better algorithmic performance. Although the proposed algorithm has higher computational complexity than TSS/NTSS for software execution, however, using the state-of-the-art VLSI technology, it can be implemented cost-effectively with the proposed VLSI architecture. Furthermore, the proposed BMME algorithm and the VLSI architecture can be easily extended with different designs of CV pattern for different video applications.
I. INTRODUCTION
Global motion caused by camera zooming, panning, and rotation is quite common in video sequences. It has been shown that in video compression, global motion can be modeled using a few parameters [1] - [3] . Global motion compensation can significantly reduce the residual of motion compensation and the entropy of local motion vector fields. The main difficulty in estimating global motion parameters resides in the existence of independently moving objects which introduce a bias to the estimated parameters. The algorithms presented in [2] - [5] use the least-square approximation (linear regression) to extract global motion parameters from the motion vector fields generated by a local motion estimation algorithm, such as block-matching. To reduce the disturbance of moving objects, a recursive procedure is used to gradually remove the motion vectors of moving objects from the least-square approximation by thresholding. However, thresholding will not be able to eliminate the influence of moving objects if the moving objects are relatively large. Moreover, these algorithms are computationally expensive.
This letter presents a new algorithm for global motion parameters estimation, which also operates on motion vector fields obtained by a local motion estimation algorithm. However, this algorithm exploits global motion information not only from stationary objects and the image background, but also from independently moving objects.
II. THE ALGORITHM
The global motion caused by camera zooming, panning, as well as rotation can be modeled by [1] , [5] 
where (x; y) is the position of a pixel in the previous frame, (ug; vg) is the associated motion vector caused by the global motion over one frame period, a 1 is the global motion parameter related to camera zooming, a2 represents camera rotating, and a3 and a4 are the panning parameters. From (1), we have the following relations: 
Manuscript received March 10, 1997; revised June 26, 1997. This paper was recommended by Associate Editor Y. Wang.
