Abstract-The implementation of fast dedicated processor for block matching motion estimation based on cumulants matching criteria implies the optimization of all of its components. Special care should be spent with the multiply-accumulate unit that is the core of many digital signal processing systems. Therefore, its optimization may be of outmost importance, specially if a significative number of such units are present in the platform. In this paper, the minimization of the size of one such unit is provided for a specific application, although the results have relevance in other scenarios.
Accumulator Size Minimization for a Fast Cumulant-Based Motion Estimator Jaime S. Cardoso, Student Member, IEEE, and Luís Corte-Real, Member, IEEE Abstract-The implementation of fast dedicated processor for block matching motion estimation based on cumulants matching criteria implies the optimization of all of its components. Special care should be spent with the multiply-accumulate unit that is the core of many digital signal processing systems. Therefore, its optimization may be of outmost importance, specially if a significative number of such units are present in the platform. In this paper, the minimization of the size of one such unit is provided for a specific application, although the results have relevance in other scenarios.
Index Terms-Cumulants, higher order statistics, multiplyaccumulate unit, optimization.
I. INTRODUCTION
T HE MOTION estimation is usually the most time consuming task in video processing systems, specially if complex tasks like cumulant-based block matching is used. It implies that for real time applications this critical operation must be implemented with a special care.
One way to achieve the high performance required in such situations is to use custom-computing machines, specifically tuned for the applications.
We developed a dedicated processor and implemented it in a field-programmable gate array (FPGA) based computing platform, [1] . This system was built as an expansion slot card for the PC bus. The custom processor executes the most computationally intensive task of motion estimation: the block-matching algorithm using a cumulant matching criteria.
In order to optimize the mapping of the dedicated processor into the FPGA platform, special care was taken in the optimization of all of its elements. This paper presents the optimization effort carried on in the implementation of one of such elements, the multiply-accumulate (MAC) unit. Results can be adapted to other applications.
A. Cumulant-Based Motion Estimation
The implemented algorithm, based on third-order cumulants (see [2] ) is the block matching algorithm described in [3] . The estimated displacement vector , belongs to the search region , assumed to contain the largest possible horizontal and vertical displacement and minimizes where is equal to the maximum value of and is equal to the minimum of -now , and are 1-D variables. The maximum used index of variation was 3 for and , as used in [3] . 3) Estimate the third-order moment of the block as the average of the moments of each line. By software simulation in a group of test sequences we have verified that these simplifications have a minor impact in the estimated motion vector, while reducing the computation load as much as fifty times.
B. Custom Processor 1) Reconfigurable Platform:
The reconfigurable platform is composed by two main boards (see Fig. 1 ). The RVC (Reconfigurable Vector Coprocessor) has 5 XILINX FPGAs (X0, X1, X2, X3, and X4), a main memory of 1 M 32 bit and four separate 256 K 8 bit SRAMs (see [4] ).
A second board that hosts up to four processing nodes (PPK, after Polygon Positioning Kernel, from the first application this system was used to). Each one has a XC4085 XLA-08 XILINX FPGA and two 32 K 16 bit SRAM. The PPK array communicates with the RVC via a 36-bit data bus and two more lines for handshake. 2) Dedicated Processor Architecture: Besides the interface with the host and with the main memory, mapped in X0, the custom processor is composed of four independently addressable RAMs, working as cache memories, that hold the current frame and the average of each segment of 16 consecutive pels, computed in X3 and X4 while transferring the frame from the main memory to the cache RAMs. The computation core is a multiply/accumulate unit of one pair of 2 cascaded multipliers and one accumulator, allowing the processing of two block rows simultaneously ( Fig. 2) and is replicated in the four PPK nodes. The multiplier is a combinatorial integer one, generated by an application developed for the effect.
For a full description of the implemented architecture, see [1] .
3) Accumulator Specification: The input to each multiplier is the difference between an element of a 16 element vector and the average of that vector. Each element of the original vector is an unsigned 8 bit integer.
A simplified view of the accumulator core is sketched in Fig. 3 . The cascaded multipliers were collapsed into a three input multiplier, the second pair of multipliers was discarded because its only purpose is to duplicate the work done by clock cycle and the input to the multipliers was generalized to come from three different sources instead of two, with one of them fed delayed to one of the multipliers. During sixteen clock cycles all the elements of the vectors are sequentially processed, each one being subtracted from the average, multiplied and accumulated.
In order to maximize the number of computing units per FPGA, there was the need to know the maximum value that could be accumulated in this MAC unit. Note that we want to know not only the maximum of the final accumulated value but also the maximum of the partial sums, because one of these may be bigger than the final accumulated sum.
A crude approach to the problem would lead to an accumulator size of 31 bits . 4) Paper Organization: The remainder of this paper is organized as follows: Section II presents a definition of the problem to solve. In Section III the problem is addressed, with the steps of the mathematical demonstration clearly presented. Finally, the conclusions are drawn in Section IV.
II. PROBLEM DEFINITION
We will focus on the maximization problem and later on achieve the minimum of the function under study. So we see that changing to or to will increase the value of the function, depending on the sign of . Concluding, it is always possible to move to a solution with a greater value of if we have . 1 The same applies to and .
•
This time we have to consider also the change in the sum due to the change in the term . In the new solution with the change due to that term will be and the total variation is given by
In the solution with the same change takes the value and the total variation is given by Therefore, the changes are always of opposite signs, implying that is always possible to get a solution with a greater value of , taking or . is not better but is also a maximum.
Conclusion:
At least an optimal solution is among those with . (The same applies to and .) Claim 3.2: In an optimal solution the terms in the sum are all positive.
Proof: We are now only interested in solutions with and equal to 0 or 255-binary variables. Hence, we shall consider that , and
, and at the end we just multiply the maximum value attained with these new variables by to get the maximum of the initial problem.
Notice that implies and implies (the same to and ). So of the eight possible distinct sequences to , namely 000, 001, 010, 011, 100, 101, 110, and 111, only half of them, 001, 010, 100, and 111 will give positive terms in the sum, while the others will contribute with negative terms.
Is it worth to look for solutions with negative terms in the sum? No, because a partial sum without that negative terms would have a greater value. Is it worth to look for solutions with positive terms in the terms out of the sum? No, because a partial sum with those positive terms included in the sum (increasing the value of ) would have a greater value of .
Summarizing, it is enough, for each partial sum, to look for an optimal solution among those with only positive terms in the sum, corresponding to terms equal to 001, 010, 100, or 111, and in the terms out of the sum with only negative terms, corresponding to 110, 101, 011, and 000.
It is important to stress that with these constraints, we do not guarantee to find the maximal value of each partial sum. However, we are sure that we will find the optimal solution, the maximum of the maximum of each partial sum. Proof: If we complement the elements of a vector, e.g., we will get a new solution with and . Hence, . We conclude that complementing a vector changes only the sign of the sum ; if we complement 2 vectors the value of does not change. Therefore, we can move from an optimal solution to another, complementing 2 vectors. This property will be used later.
Claim 3.4:
In an optimal solution the terms out of the sum must all be equal.
Proof: It is enough to verify that if we have out of the sum some and some , we can improve the solution by changing to 1 or to 0
and We see that all terms out of the sum must be equal (the same applies to and ). So out of the sum must be one and only one of the terms 011, 101, 110, and 000. But we can go a little further. As complementing two vectors does not change the value of we can make that all terms out of the sum be equal to 000.
Clarifying: we will look for an optimal solution among those with positive terms in the sum (001, 010, 100, and 111) and 000 out of it.
Let e the number of times that the terms corresponding to 001, 010, 100, and 111 appear in the sum, respectively. Claim 3.5: . Proof: With the constraints imposed to our search, in the solutions we are restricted to, we have , and . The function to maximize can be written as with the previous expression of simplifies to Suppose, without loss of generality that , and . We intend to prove that we must have in an optimal solution. To do so, suppose that we have an optimal solution with . Lets make in a new solution , and . The change in is of , which is positive because . Conclusion: in an optimal solution the difference between and must be less that 2, implying that and . So and is equivalent to . For a given we just have to change from 0 to to find among those solutions the optimal one (setting and and are univocally known and so and ). So we must calculate times-in the original problem we had different solutions!
A. Results
Computing the function in the identified 152 different solutions we conclude that the function attains its maximal value of 33 162 750 at, for instance That value is representable with 26 bits in two's complement.
B. Minimum of the Problem
Of course we also have to address the minimization problem to know if the minimum is also representable with 26 bits in two's complement. We could follow a path similar to the one followed in the maximum calculation. However that is not necessary. Bearing in mind that complementing a vector changes only the sign of , the minimum of must be minus the maximum. So it is also representable with 26 bits.
IV. CONCLUSION
The performance of a system is often limited by one key task. That is especially true in video processing systems, where the motion estimation is usually the most time consuming task. This implies that for real time applications this critical operation should be implemented with a special care.
It has been presented how an effective optimization of a key component in a custom motion estimator processor can bring important gains in comparison with an equivalent, non optimized implementation. The optimal size for the accumulator in a multiply accumulative unit was mathematical reached for this specific application. These results can be extended and/or adapted for similar applications.
