This paper presents the design and implementation on FPGA devices of an algorithm for computing the similarity between neighbor photograms in a video sequence using luminance information.
INTRODUCTION
The large possibilities that Reconfigurable Logic offers for making massive concurrent computation makes it wellsuited to implement complex algorithms. This feature has brought digital design a new scope, as shown by several papers where dedicated circuits implemented on FPGA have been proposed [1] , [2] , [3] . In almost all cases, the topology and the performance obtained depend on the specific application. Thus, each case should be analyzed individually.
This paper intends for profiting from concurrency and parallelism to implement video temporal segmentation on Reconfigurable Logic devices, putting emphasis on the advantages and limitations of FPGA technology and its development tools.
978-1-4244-3846-4/09/$25.00 ©2009 IEEE

Jose Ignacio Benavides Benitez
Escuela Politecnica Superior Universidad de Cordoba Menendez Pidal sIn, Cordoba, Espana E-mail: ellbebej@uco.es
Nicolas Guil Mata
Dpto. Arquitectura de Computadoras Universidad de Malaga C. Teatinos, Malaga Espana E-mail: nico@ac.uma.es
We choose to implement a function widely studied in [6] that can measure the similarity between two photograms based on its luminance distribution.
After fragmenting the algorithm, we propose specific solutions for each part, making it clear that obtaining the optimum hardware solution is an iterative process which profits from all features of FPGA devices [4] .
In Section 2, we briefly present the design philosophy of dedicated hardware, showing several alternatives using CLBs (distributed) or functional blocks (concentrated). In addition, we describe an interface with external memory, a mandatory complement when dealing with huge volumes of data.
Section 3 offers an introduction to video temporal segmentation through the artificial synthesis of visual human perception and presents the algorithm to calculate similarity that inspires this paper. Section 4 describes in detail the implementation and optimization of the proposed algorithm, emphasizing the design and evaluation criteria.
Finally, the conclusion section shows quantitative results obtained from comparing computing time using: the proposed hardware solution and pure software running in a pc.
DESIGN WITH DEDICATED CIRCUITS
Optimization criteria to achieve maximum performance are no longer homogenous when dealing with different parts of the same design, especially when the designer has the freedom to implement configurations on dedicated circuits. The use of distributed logic, together with dedicated functional blocks and data supply, are critical aspects of the design to be taken into account.
Distributed logic and Dedicated blocks
The main drawback of using a hardware reconfigurable device is the loss of performance imposed by the reconfiguration circuit itself [5] . For this reason, device manufacturers often include dedicated blocks capable of doing frequent and specialized tasks. Among them, we can find RAM blocks, multipliers, timers and communication devices whose performance is far superior to that of equivalent functions implemented using only general logic (CLBs).
Another important issue is the time delay introduced by routing. This justifies the inclusion on the FPGA of different purpose and quality routes (clocks, near, long, etc.).
This work will show the many possible alternatives for combining all this elements in a specific design. The adoption of each of them will depend on the specific implementation carried out.
Data supply
Video processing is a task characterized by a very high demand of data. For this reason, it must be borne in mind that when using a FPGA only a limited storage capacity is available on chip. Thus, an efficient interface is required with external memory capable of feeding and receiving data at the necessary rate. In this case, a Xilinx Spartan-3 development board is used, which is provided with a 32 bits width and IOns-access time static external memory bank. This bus width allows us to read and process four bytes at a time.
The reading and modification of data storage into the Spartan BlockRAM requires two clock cycles per operation. This memory is synchronous and can operate up to 200 MHz clock frequency, which matches perfectly with the IOns access time of the external memory.
APPLICATION: TEMPORAL VIDEO SEGMENTATION
An important application of this work is the temporal segmentation of a video for the purpose of identifying sequences, or certain scenes [10] . The technique is based on giving a quantitative value to the human perception of similarity between frames. In this theory, it is assumed that the properties of a certain stimulus, in this case an image, can be represented as a vector in a space of characteristics and, as a consequence, the similarity between two images is reduced to the appropriate measurement of a distance in a metric psychological space [6] .
One of the main features of an image is its luminance histogram, defined as the frequency of occurrence of the pixel's luminance values in each photogram.
The similarity between two photograms is inversely related to the distance between vectors representing its characteristics. In the case of a histogram of luminance, it can be defined through the following normalized equation [6] : (1) Where:
Hi[bJ is the histogram at level "b" from photogram "i".
W;[bJ) is the windowed histogram at level "b" from photogram "i"
ALGORITHM SEGMENTATION
In order to implement and optimize the expression (1), we divide the procedure in five sequential stages: Multipliers needed for stage 3 were not inferred from VHDL, because the Spartan 3 device has dedicated eighteen-bit multiplier blocks, with a much better performance than that attainable using distributed logic.
Calculating the Histogram
The real bottleneck occurs at the stage responsible for obtaining the histogram, and its interface with external memory is which limits the overall circuit performance. Therefore, it is very useful to reduce the amount of data to be processed.
It can be proved that the former calculation is equivalent to calculate the similarity from DC coefficients of an image compressed in the MPEG format [7] . These coefficients strongly reduce the size of the image by a 64: 1 ratio.
Considering the results obtained in [8] , where it has been proved that a circuit for obtaining the histogram using BlockRAMs as accumulators has a better performance than one using distributed logic, we proposed the hardware shown in Figure 1 for the first stage.
Notice that the dual port feature of Spartan-3 BlockRAM allows us to extract the histograms individual values in pairs. Port B input has been forced to "0" in order to clear the accumulators in the same clock cycle, leaving Several registers were inserted between output adders and at the address input of BlockRAMs in order to pipeline the stage, so as to optimize the interface with external memory and double the internal clock frequency.
Calculating the windowed terms Wi[b]
Fig
. 2 Concurrent windowing definition
In order to reduce the influence of slight variations in image luminance, each term of the histogram has been replaced with the sum of its adjacent neighbors. Figure 2 shows how the E i windowed values are generated, as well as the convenience of using concurrent calculations, due to the fact that the majority of Hi values are used in more than one E i calculation. Excluding the two boundary values, each Hi value contributes to E i -], E i and
Ei+1
Figure 3 shows how the algorithm symmetry allows the same sum term to be used twice in calculating the windowed value.
These two new blocks bring four 32-bits-width output ports, capable of supplying the eight necessary histogram values simultaneously. Figure 4 shows the first version of this stage. The folded buses that appear at Port B inputs of both blocks aim at repeating some histogram values in particular positions of the memory bank, thus solving the boundary problem mentioned in the above calculation.
The output summing stages were also pipelined to keep the whole stage maximum clock frequency near 200 MHz, while the three output buses allow extracting concurrently six windowed values in each clock cycle.
Input of FRAME Output of histogram 16 
Coupling considerations
In order to prevent the delay introduced by the next calculation from piling up, BlockRAMs 4 and 5 will retain data belonging to a photogram, while the following one is being processed in a different memory page. The BlockRAM dual port feature allows for decoupling and behaves like another pipeline stage. The only difference in this case is that the retention time unit is the whole photogram duration and not a single clock cycle.
Bearing in mind that 64 is the number of established histogram levels, each one represented by a 16 bit integer, Fig. 3 Grouping the windowed sum terms The concurrence of these eight values is achieved by using two storage BlockRAMs. As we will see later, these the proposed structure allows to process photograms containing up to 65,536 pixels. Taking as a reference a photogram of only 1600 pixels, Table 1 shows the calculation time expressed in normalized clock cycles.
Storing and windowing were done on the previous histogram. At the same time, the input stage builds the following histogram. Next to the windowing stage we have placed six multipliers followed by a matrix of a three-level adder, responsible for obtaining the correlation. The whole block has been pipelined using six registered stages, keeping the clock frequency near 140 MHz.
Two of the partial sums needed to calculate the similarity coefficient as defined by equation (1) are stored in the output. Partial terms of the denominator will be reused in the following photogram calculation; hence they are stored to avoid repeating the calculation (see Figure 5 ).
P3 (192 bits) P4 (96 bits) P5 (64 bits) E (32 bits)
PO (128 bits) PI (176 bits) P2 (192 bits)
P6 (64 bits) S (64 bits)
Output of windowed sum Fig. 5 Circuit for calculating the windowed correlation (includes windowed calculation pipelined)
Note that both summing terms use the windowed term of the same photogram W; [b] , only that while one is multiplied by the current histogram Hi[b] , the other one is multiplied by the previous one H i -1 [b] . The term W; [b] is retained during two clock cycles in the registers (shadowed in Figure 5 ) placed at the input of the multipliers, assuring that in each clock cycle the right histogram corresponds with its associated sum term. Starting from the histograms stored in BlockRAM belonging to two photograms, the stage takes only 28 clock cycles to complete two windowed sums.
The key of this high performance is in the concurrency of operations, the organization of the data and the pipeline structure.
It is also easy to interface this block with its neighbors due to the fact that the input and output buses are only 32 and 64 bits wide, respectively. In contrast, the internal bus width parallelism reaches up to 192 bits in the multipliers layer.
Square root calculation
The purpose of this stage is to obtain the square root of the denominator in equation (1) . To implement the operation, a shift and a sum were used, based on the algorithm described in [9] . The circuit in Fig. 6 is the hardware implementation of this operation, starting from a slight modification at the output of the circuit shown in Figure 5 .
It takes 16 clock cycles to complete processing a 32-bit-long input data. The sum term S[h(n).w(n)] is available at the previous correlation block one clock cycle prior to completing the sum term S[h(n-l)w(n)], which can be useful to anticipate the beginning of the calculation of the square root.
It should also be highlighted that some terms in the numerator of equation (1) belonging to histogram n will be used in the same calculation of the next photogram. For that reason, a register has been provided to retain it.
Product and division
Following the sequential stage responsible for calculating the square root, a multiplying block provides the product of the denominator in equation (1) . In this way, numerator and denominator are ready and available at the output of the circuit shown in Figure 6 to make the final division and obtain the normalized similarity coefficient between two photograms.
The division between two unsigned integers employing a shift and subtract algorithm is done by a circuit similar to the one shown in Fig. 6 , as described in [9] . The full process takes 32 clock cycles for the 32-bit solution required. Table 2 Summary of number of clock cycles taken per operation synchronization signals in order to properly control the flow of data. Therefore, it is necessary to provide a control circuit to generate the appropriate timing signals. Figure 7 shows the diagram of the Finite State Machine responsible for controlling the whole system. It was implemented in VHDL description language and placed and mapped with the rest of the modulus at the top level of the hierarchy tree.
Preliminary results after synthesis show that the maximum clock frequency drops to 110 MHz, 22% lower than the value reported when individual modulus are compiled independently. This is mainly due to the strong influence of routing on propagation delay time.
Because of the intensive use of pipeline architecture, the system exhibits a finite latency time from which a new value is output each 800 clock cycles (5,7 us at 110 MHz clock frequency).
Data flow has been organized in such a way that the first stage calculates the histograms on alternate RAM pages, taking 800 clock cycles to process each 1600-pixel photogram.
Each time the processing of a photogram is over, the rest of the calculation starts, which takes 109 clock cycles to complete. Except for the first stage (see Figure 1) , the rest of the stages stay idle 86% of the time.
Clearly, the bottleneck is determined by the width of the bus, which connects the external memory, and the above-described system. Widening this bus from 32 to 128 bits would make it possible to process 16 pixels at a time, lowering to 200 clock cycles the time needed to complete the calculation.
Even in this case, the usage factor of the remainder stages is never higher than 55 % of the total time. Besides, this time does not add because they work in parallel with the histogram stage.
CONCLUSIONS
The entire design was simulated and implemented using the software package ISE 8.2i from Xilinx and tested on a development board "SPARTAN-3 Starter Board" from Digilent provided with a XC3S1000 FT256 speed grade 4 chip, 1 Mbyte of SRAM (256 Kb x 32) and 10 ns access time. For the SOFT -HARD comparison 4 compressed videos on different subjects were used: a drama, a basketball game and two news programs, all of them with an average length of25.000 frames.
The algorithm was implemented using fixed point and selecting the necessary resolution for each stage to avoid overflow.
A close look at the summary shown in Table 2 , containing the number of clock cycles taken by each individual stage, suggests that it could be possible to reduce the parallelism of nearly-idle stages, which are greatly resource-consuming. One example is the circuit in Figure 5 , which employs such scarce and frequently used resource as the multiplier.
The Xilinx ISE synthesis tool gave the following report concerning resources occupied and available at chip level.
As regards the data reported in Table 3 , it should be made clear that in applications like the one on this paper, where there is no direct link between the number of used BlockRAM and multipliers, and due to Xilinx FPGAs peculiarity the use of a multiplier block disables the use of it associated BlockRAM and vise versa. Thus, the usage Table 3 Summary of resources usage factor must be calculated considering them as available pairs, in this case 14 multiplier-BlockRAM pairs of 24 available (58% busy) which is really almost twice (29%) the value reported by the Xilinx synthesis tool.
The interface with the external memory was set using an Agilent 64622D oscilloscope. Table 4 Hard-Soft solution comparison
In order to evaluate the performance of the hardwareimplemented solution using an FPGA (Spartan-3, 120 MHz) of the calculation of windowed and the sum of products developed in this work (match windowed correlation of table 2), we compared it with the equivalent software implementation presented in [6] (PC Pentium IV, 2Ghz).
This comparison is very illustrative because the FPGA is a reconfigurable device that is harmed by the overhead of configuration circuits that slow down clock rate and increase silicon area compared with a custom VLSI device. However, the calculation speed obtained with the FPGA exceeds that obtained using MMX instructions (SSE multimedia extension), by a factor of 1.75:1 as regards nominal delay and by a factor of 25:1 as regards clock cycles. We attribute this improvement to the optimization of data flow and to concurrence.
