Abstract-This paper describes the analysis of the Mumford and Shah functional from the implementation point of view. Our goal is to show results in terms of complexity for real-time applications, such as motion estimation based on segmentation techniques, of the Mumford and Shah functional. Moreover, the sensitivity to finite precision representation is addressed, a fast VLSI architecture is described, and results obtained for its complete implementation on a 0.13 m standard cells technology are presented.
INTRODUCTION
RECENTLY, the use of the Mumford and Shah functional [1] has been investigated in different applications [2] , [3] , [4] . In [4] , a novel interpretation of the optic flow has led to an extension of the Mumford and Shah functional to motion segmentation. This approach, named Motion Competition, is intended to join motion estimation and segmentation to derive a variational approach for the segmentation of the image domain into regions of homogeneous motion.
In terms of implementation several iterative algorithms can be found in the literature, e.g., [3] , [5] , [6] , [7] , [8] , [9] , and [10] . However, the computational complexity behind the Mumford and Shah model has not been stressed yet. The multigrid approach is an interesting technique to speed-up iterative methods for solving elliptic problems: The basic idea is to find first a solution to the problem, solving it on a coarse mesh and then to refine the original problem on a fine mesh. This approach has been successfully employed for the Mumford and Shah functional in [7] proving noteworthy speed-up in terms of iterations reduction. Its main drawback is the amount of memory required to store coarse solutions that will be employed during the solution of finer grids. Moreover, to move from a coarser grid to a finer one and viceversa, restriction and expansion operators are required.
Other solutions are based on the Steepest Descent (e.g., [3] and [9] ) to reduce the number of iterations. Besides, the Preconditioned Conjugate Gradient (PCG) method is a very effective technique for solving sparse linear equations systems as Ax ¼ b. It is based on preconditioning the so-called A-orthogonality that improves the converge to the solution employing A-orthogonal search directions d ðiÞ (d T ðiÞ Ad ðjÞ ¼ 0). Its use to speed-up the Mumford and Shah functional implementation has been addressed [10] , showing interesting results. The main drawback of the PCG method is its sensitivity to finite precision representation. In fact, working with high precision floating point tools (e.g., Matlab), this phenomenon is very limited. On the contrary, we are interested in fixed point implementations that could tackle PCG method effectiveness.
A simple numerical iterative solution based on the nonlinear Gauss-Seidel method can be employed [5] to reduce the number of iterations: The computational complexity is still proportional to the number of iterations k and to the image size r Â c, where r is the number of rows and c is the number of columns. Running a C model on a 3.06 GHz Intel Xeon with 2GB of RAM it can be observed that for 255 gray-level images, 270-280 ms are required on a QCIF (176 Â 144 pixels) frame and 1120-1130 ms on a CIF (352 Â 288 pixels) frame. The power consumption [11] with a supply voltage of 1.5 V is estimated in 60-70W that means more than 30 J per sample per iteration: Even with a modern processor, the computational complexity to perform real-time segmentation is too inadequate for a general purpose CPU.
The contribution of this work is twofold: First, to focus on the computational complexity of solving the Mumford and Shah functional, giving results on execution time and on finite precision representation effects. Second, to propose a fast hardware implementation suitable for new motion estimation techniques, as Motion Competition, with high frame rate video sequences. The proposed architecture key points are: 1) a fully parallel data-path with two hardware dividers and 2) an improved diagonal scanning order to effectively feed the data-path pipeline. The complete VLSI flow performed for the proposed architecture includes the generation of actual post place and route figures in terms of complexity, area, clock frequency, and power consumption.
THEORETICAL FRAMEWORK
The Mumford and Shah approach is based on a mathematical model that considers the segmentation problem as a partition of the image domain & R 2 in open subsets i . Given an image, whose intensity function is defined as g : ! R, the Mumford and Shah functional [1] aims to find a smooth approximation u of g in each subdomain i :
where K is the set of i boundaries and k K k denotes K set length, is a parameter related to the scale [9] , is a parameter related to the contrast [9] , x ¼ ðx 1 ; x 2 Þ, dx is the Lebesgue measure in the plane, and
In [12] , Ambrosio and Tortorelli demonstrated that the functional described in (1) can be approximated, in the À-convergence sense, by
with Èð"; zÞ
4" , where the function z yields an approximate description of the set of curves K and
Thus, the solution obtained minimizing E " ðu; zÞ tends to the solution obtained minimizing Eðu; KÞ as " ! 0. As (2) is an elliptic problem its minimizers satisfy the Euler-Lagrange differential equation both for u and z. So, we need to: 1) develop u and z Euler-Lagrange equations and 2) discretize g, u, and z on a uniform grid of N Â N nodes with mesh size h-thus, they become arrays g i;j ; u i;j ; z i;j . Together, they correspond to a nonlinear system of equations for the 2N 2 unknown values u i;j ; z i;j . Employing an iterative algorithm as the nonlinear Gauss-Seidel method [5] , we obtain 
From (3), it can be observed that noteworthy computational complexity is required to compute u i;j and z i;j . For a complete formal derivation of (3), refer to [5] .
PARAMETERS, ITERATIONS, AND FINITE PRECISION EFFECTS

Parameters Range Choice
Obtaining optimal values for , , and " is not straightforward; however, different approaches can be found in the literature [3] , [9] , [13] and this selection task is not a goal of this work. Nevertheless, selecting , , and " as powers of 2 would lead to strongly reduce the computational complexity [14] . In this paper, we will discuss the ranges suggested in [14] , namely, " 2 ½1=32; 1=2, 2 ½1=256; 1=2, and 2 ½1; 64 with the image values in the range ½0; 1. Fig. 1 shows the results obtained varying as a power of 2 in ½1; 64 with h ¼ 1, " ¼ 1=16, and ¼ 1=256 on the image "foreman:" a wide spectrum of images from smoother to sharper are covered. Fig. 2 shows the mean error obtained varying " and as powers of 2 in the ranges ½1=32; 1=2 and ½1=256; 1=2, respectively, with ¼ 8 for the same image: It can be noticed that the error is rather limited. Moreover, a similar behavior has been observed for several other images and parameters choices; thus, the performed analysis shows that the choice of , , and " as powers of 2 results in acceptable approximations of the Mumford and Shah functional. The solution of (2) with the nonlinear Gauss-Seidel method implies certain initial conditions (i.e., the values at iteration 0), that can be chosen, according to [5] , as u ð0Þ i;j ¼ g i;j and z ð0Þ i;j ¼ 1. Moreover, a proper number of iterations is required to obtain good results. From the direct implementation of (3) on some QCIF and CIF images, with h ¼ 1, ¼ 8, ¼ 1=256, and " ¼ 1=16, we can observe that, after approximately 20 iterations, the increase of accuracy on u and z (from the kth to the k þ 1th iteration) tends to saturate in terms of Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR). In Fig. 3 , this behavior is shown for different images ("Foreman," "Mobile," "Tempete," and "Paris") together with the mean value (dashed curves). The solid curves, representing the PSNR obtained from the mean MSE, better show the saturation phenomenon.
In the following, we will fix the Mumford and Shah functional parameters as h ¼ 1, ¼ 8, ¼ 1=256, " ¼ 1=16, and the maximum number of iterations to 20: With this set of values, we will concentrate on the Mumford and Shah functional (3) sensitivity to finite precision effects.
Finite Precision Effects
Since g i;j , u i;j , and z i;j are in the range ½0; 1, and ! 1, the following inequalities hold true for the numerator (Nu) and the denominator (Du) of u i;j in (3): 0 Nu 4 þ 1= and 1= Du 4 þ 1=, so that, 3 bits are needed for the integer part of Nu and Du. Analogously, we have 1=" for the integer part of Dz are needed. To reduce this number, we rewrite (3) (z i;j ) with h ¼ 1 as
Now, we have: 1= Nz 16" 2 = þ 1= and
With the range considered for , , and ", Nz and Dz need at least 3 and 9 bits, respectively, for the integer part. Although different choices of parameters could also be of interest, in the rest of the paper, analysis and synthesis results will be limited to the case Fig. 4 , we can conclude that a hardware dedicated implementation, employing m ! 16 would be able to achieve very good performance.
IMPLEMENTATION ISSUES
In [14] , software implementations on modern microprocessors of the Mumford and Shah functional are addressed, showing that high frame rate sequences cannot be processed in real-time; in fact, to perform 20 iterations on a QCIF frame 4 s and 12.5 s on the highperformance Texas Instruments TMS320C6711 and on the StrongARM SA-1100, respectively. Thus, in a system where real-time processing is needed, as the video scenario, a dedicated VLSI implementation would be required. From the analysis performed in Section 3, it can be observed that it is difficult to fix the values of the Mumford and Shah functional parameters, the number of iterations, and the number of bits m devoted to the representation of the fractional part of the data. To obtain a highly reusable architecture, a parametric VHDL model has been developed, focusing on high performance in terms of number of sustained frame, while granting high quality of the resulting u and z. In the literature, the implementation of the Gauss-Seidel method for the solution of equations systems has already been addressed. However, solutions proposed in the literature can not be straightforwardly adapted to the case of the Mumford and Shah functional. In fact, many works describe the implementation of the GaussSeidel method with particular emphasis on serial and shared memory parallel computing [15] , [16] or on systolic solutions [17] . As pointed out in [18] , there is a strong data dependency between the pixels produced with the Gauss-Seidel algorithm. In the following, we will refer to the neighbors of the considered pixel as north, south, east, and west: x iÀ1;j ¼ x N , x iþ1;j ¼ x S , x i;jþ1 ¼ x E , and x i;jÀ1 ¼ x W , where x can be either u or z.
The proposed architecture depicted in Fig. 5a is composed of 1. three buffers to store, respectively, the original image g, the approximation u, and the edges z, 2. two address generation units devoted to manage the correct scanning order to elaborate the samples, 3. a complex data-path to implement (3) and (4), and 4. a simple memory interface unit to load all the data required to perform the operations described by (3) and (4).
Address Unit
As it can be observed from (3) and (4), u i;j (and z i;j ) depends on north, south, east, and west values. Thus, scanning the image by rows and columns (or vice-versa by columns and rows) to evaluate the new value of u i;jþ1 the computation of u i;j and z i;j must be completed. In fact, for u i;jþ1 , the pixels u i;j and z i;j act as u W and z W , respectively. Instead of scanning the image row-wise and columnwise (or vice-versa), a "diagonal" scanning order can be employed (Fig. 6a) . A high latency for generating the first values is required, while, for the following pixels, the latency decreases as the pipeline is filled. Moreover, the algorithm iterative nature can be exploited to pipeline also on the iterations [18] . This improved diagonal scanning order is shown in Fig. 6b , where i and j represent rows and columns directions and k represents the iteration step; the number associated to each pixel represents the order used to process the different values. This scanning order leads to the generation of the addresses depicted in Fig. 7 on the left; they can be generated resorting to some counters and small control logic, as depicted in Fig. 7 on the right, where start À cnt and stop À cnt are up-counters to generate rows (i) and columns (j) current index limits. start À vcnt and stop À vcnt are down-counters to generate new index limits taking into account both the diagonal order and the algorithm iterations. value À cnt are an up-counter (i) for row indices generation and a down-counter (j) for columns indices generation. 
Data-Path
The proposed architecture implements (3) and (4), respectively, to generate both u i;j and z i;j . As it can be inferred from (3) and (4), the architecture bottleneck is the division. In the case of the proposed architecture, the division algorithm needs to be unrolled and pipelined in order to obtain a parallel divider able to sustain high throughput. Moreover, since u i;j and z i;j are in ½0; 1, the numerator will never be greater than the denominator; thus, we do not need to introduce operands normalization, as in most SRT dividers implementations [19] . As depicted in Fig. 5b , to perform these operations, the data-path can process at the same time u N , u S , u W , u E , z N , z S , z W , z E , and g i;j . As described by (3) and (4) z N , z S , z E , and z W can be employed to evaluate the square values required byũ u i;j and u i;j denominator; moreover, they are needed to calculateẑ z i;j . Employing four multipliers, the four square values can be calculated at the same time (topleft in Fig. 5b) ; meanwhile, the four z values are added together to obtainẑ z i;j (top-right in Fig. 5b ). During the same clock cycle, u N , u S , u E , and u W are employed to evaluate the first part of the discrete gradient ru j j 2 i;j and are stored into four registers. Since the discrete gradient is squared and multiplied by four, it can be computed by simply adding together the squared differences; in fact, the following expression holds true 4 ru j j
The discrete gradient calculation has been split into three parts: 1) subtrac-
EW , and 3) addition u y þ u x , where each part is implemented in a separate clock cycle (top-center in Fig. 5b) . While the discrete gradient second part is evaluated, the squared z values are added together to obtain u i;j denominator and multiplied by the proper u input (u N , u S , u E , or u W ) to compute the partial values required bỹ u u i;j . Moreover, in the same clock cycle,ẑ z i;j is shifted by 4" 2 = (see Fig. 5b in the middle) .
During the following clock cycle, the evaluation ofũ u i;j is completed and g i;j (shifted by ) is added to generate u i;j numerator. In the same clock cycle, the computation of u i;j denominator is completed adding to z squared values the term 1=. Moreover, the z i;j numerator and denominator are calculated: The former adding to 4" 2 =ẑ z the term 1=, the latter shifting the discrete gradient by "= and adding 1= þ 16" 2 =. Finally, two dividers are employed to compute in parallel u i;j and z i;j starting from their numerators and denominators (bottom part of Fig. 5b) .
From the results shown in Fig. 4 , it is worth noticing that probably more than 12 bits will be employed for the representation of data fractional part. Since u i;j and z i;j are in ½0; 1, the data-path has been implemented (synthesized, placed and routed) varying the data width representation (m) from 14 to 24 bits. Besides considering the ranges detailed in Section 3 for , , and ", internal data need up to 9 bits for the integer part representation. Results detailed in Fig. 8 show how the frequency achievable by the datapath changes varying the number of bits to represent the output results. From the results reported in Fig. 8 , it can be observed that to make the data-path able to process a new sample at each clock cycle, the address generation unit needs to be connected to a fast memory interface block.
Memory Interface Block
It is reasonable to suppose that u, z, and g values are stored (at least partially) in three buffers. The memory interface block is a simple finite state machine that starting from the matrix row address (i) and the matrix column address (j) generated by the address generation unit is able to load u N , u S , u E , u W , z N , z S , z E , z W , and g i;j from the buffers and to feed the data-path with these values. Exploiting the lower frequency achievable by the data-path, the memory interface block behaves as a sort of serial to parallel component. In Fig. 9a , a block scheme of the memory interface block is shown. The finite state machine (FSM) is devoted to receive the row and column indices, i and j, calculated by the address generation unit. Starting from these values, the FSM selects to add 1, 0, or -1 to it in order to implement i þ 1, i, or i À 1 (analogously with j). The result obtained on the column index J ought to be multiplied by the number of columns (#COLS) and added with the result obtained on the row index i; when an address is correctly generated it is validated by the FSM. Given an address on the proper address bus (g abus, u abus, and z abus), the data buffers (g, u, and z) latch the data required by the memory interface block on the data bus (g dbus, u dbus, and z dbus); finally, the FSM loads each value into its register (bottomright in Fig. 9a ). The memory interface block has been synthesized, placed, and routed on a 0.13 m standard cell technology for QCIF and CIF frames varying m. Since the critical path (for this implementation) is in the address generation (J Â #COLS þ i), the maximum achievable clock frequency depends only on the frame size and not on the data width. In fact, the maximum achievable frequency is about 215 MHz and 190 MHz for QCIF and CIF frames, respectively. Experimental results show that the amount of gates required increases as the data width grows (see Fig. 9b ). Finally, since the data-path produces two data per clock cycle, namely, u i;j and z i;j , they can be sent to the proper buffer resorting to a further address generator a multiplier and an adder.
PROPOSED ARCHITECTURE PERFORMANCE
In the following, we will focus on the case of m ¼ 16 and one bit per clock cycle and we will discuss the global architecture performance. Simulations performed on the proposed architecture VHDL description show that the data-path needs 20 clock cycles to evaluate a new couple of results (u and z). Due to the improved diagonal scanning, the data-path latency decreases as P Lat0 i¼1 i. In fact, after the 20th column is processed, the simulation shows that the architecture produces a couple of new values (u i;j , z i;j ) at each clock cycle. Considering post place and route frequency and that the memory interface should run faster than the data-path to fetch the required data, the proposed architecture can sustain more than 25 QCIF frames per second and about 15 CIF frames per second. In Fig. 10c , the performance achievable with the proposed architecture are summarized and compared with the runtime of the fixed point C model [20] described in [14] and of a competitive software implementation [6] within the Megawave package (Megawave 2.31a) [21] running on a Linux Mandrake 10-Pentium IV at 1.6 GHz with 512 MB of RAM. Post place and route of the whole architecture shows that 4:72mm 2 and 4:73mm 2 are required for QCIF and CIF frames, respectively (see Figs. 10a and 10b) .
Thanks to the parametric VHDL description, the two implementations have been achieved simply changing some parameters in the VHDL source and reperforming the design flow. In Figs. 10a and 10b, it can be observed that, since the QCIF design is slightly smaller than the CIF one, the QCIF chip is also slightly less dense than the CIF one. As far as the power consumption is concerned, post place and route analysis shows that about 337 mW are required for the QCIF architecture and 395 mW for the CIF architecture.
CONCLUSIONS
In this paper, the Mumford and Shah functional has been investigated from the implementation point of view. A detailed study of finite precision representation effect has been carried out, a fast VLSI architecture has been described, and results obtained through the implementation on a 0.13 m standard cells technology have been presented. 
