Abstract Image segmentation is one of the most important tasks in the image processing, and mean shift algorithm is often used for color image segmentation because of its high quality. The computational cost of the mean shift algorithm, however, is high, and it is difficult to realize its real time processing on microprocessors, though many techniques for reducing the cost have been researched. In this paper, we describe an FPGA system for the image segmentation based on the mean shift algorithm. In the image segmentation based on the mean shift algorithm, the image is once over-segmented, and then the small regions are merged considering the similarity between the over-segmented regions to obtain better segmentation. In our system, the mean shift filter is accelerated using a cache memory which can access to all pixels in a w s 9 w s pixel window at arbitrary position. This cache memory allows us to process w s 9 w s pixels in parallel every clock cycle. The region merging is also accelerated by not strictly managing the list structures used for the merging. This loose management causes the redundant and out-of-date data into the list structures, but it makes the pointer dereferences unnecessary, and the overhead by those data can be hidden by pipeline processing. The performance for 768 9 512 pixel images is fast enough for real-time applications.
Introduction
Image segmentation is the process of partitioning an image into multiple regions of interest, and it is generally the first task in many automated image understanding applications. Many algorithms for the image segmentation have been proposed to date, such as [1] [2] [3] [4] . Several hardware systems have been proposed to accelerate those algorithms, such as [5] [6] [7] [8] .
Mean shift algorithm is a procedure which is often used for color image segmentation because of its high quality. In the segmentation based on mean shift algorithm, first, the gradient of each pixel in a given image is calculated using a window centered by the pixel, and then, the pixel is moved along the gradient. This procedure is repeated until no movement will happen, and pixels which reach to the same bottom form a region. In general, many small regions are generated in this step (called over-segmentation). Then, the small regions are merged considering the similarity between the regions to obtain better segmentation (if we try to reduce the number of the regions by only the mean-shift, the results are far from what we expected in most cases). This over-segmentation and merging method is also used in other segmentation techniques such as watershed, K-mean clustering and so on.
The computational cost of a naive implementation of a mean shift algorithm is very high (O(X 9 Y 9 w s 9 w s 9 N r ), where X 9 Y is the image size, w s 9 w s is the window size and N r is the average number of the repetitions). Therefore, many techniques for reducing the computational complexity have been researched (for example, [9] [10] [11] ). However, it is still difficult to realize real-time processing on microprocessors. In Ref. [22] , it was reported that it takes 7 s for processing a 481 9 321 pixel image with Intel Core2 Quad 3 GHz. Therefore, it still takes a few seconds for processing a 768 9 512 pixel image even with the latest CPU. Nevertheless, only few papers about the acceleration of the mean shift algorithm by FPGA and GPU have reported to date because of the irregular memory access sequences required for tracing the movement of pixels along their gradients. In Refs. [12, 13] , the mean shift algorithm was used for object tracking, but the algorithm was applied only to the regions of interest. As for the region merging, to the best of our knowledge, no hardware systems have been proposed (in Ref. [14] , a method for the labeling was proposed) probably because of the inherent sequential and complex data management (this means that we can not expect high performance gain). However, this merging step should also be executed on the same platform to achieve higher performance in total.
In this paper, we describe FPGA implementation of both steps, over-segmentation using a mean shift algorithm and the region merging. The main difficulties in the segmentation are (1) how to read w s 9 w s pixels at arbitrary position in the image in parallel for tracing the movement of the pixels, and (2) how to fulfill the pipeline circuit (the pipeline depth for calculating the gradient is deep, and we need to interleave the calculation of several pixels), and the difficulties in the region merging are (3) the dynamic management of the lists which hold the contiguous regions (those regions are gradually changed as the merging progresses), and (4) the sorting of the pairs of two contiguous regions which are used to choose the two regions to be merged. It is also very important how to define the method to choose a pair to be merged to obtain high quality segmentation. However, the method should be defined according to the requirement for the system (high quality, high processing speed, etc.), and a very simple method is used in our paper, though it does not always give high quality segmentation.
For (1) and (2), we have designed a special cache memory which can access to w s 9 w s pixels at arbitrary position in parallel. Typically, w s is 15-31. Thus, this cache memory reads out 225-961 pixels in different positions every clock cycle. In our implementation, the given image is scanned from top to bottom, and the pixels on L lines are cached into this cache memory. According to our experiments, most images can be processed during one scan. When a pixel moves out of the cached region toward the already processed area, another scan from bottom to top is executed, and the scans are repeated until all pixels stop. As for (3) and (4), we have chosen not to manage the data in the list structures strictly. By this loose management, redundant and out-of-date data may continue to stay in the data structures and increase the amount of the computation. However, this relaxation makes it possible to manage those data in block without pointer dereferences (all data in the block can be accessed continuously), and the increase of the computation time by those data can be hidden by pipeline processing. This paper is organized as follows. The mean shift algorithms are introduced in Sect. 2, and its FPGA implementation is described in Sect. 3. In Sect. 4, an algorithm for the region merging is introduced, and its implementation is described in Sect. 5. Experimental results are shown in Sect. 6, and the conclusions are given in Sect. 7.
Mean shift algorithm
Mean shift analysis is a novel and powerful clustering approach originally reported in Ref. [15] . In spite of its excellent performance, it had been nearly forgotten until Ref. [16] extended it and introduced it to the image analysis community. In recent years, comprehensive analysis and successful applications of the mean shift occurred in the fields of tracking, image segmentation, information fusion, edge detection, clustering and classification, and video processing [9] .
Original mean shift algorithm
First, we briefly review the mean shift algorithm introduced in Ref. [17] mathematically. Given n data points
, the kernel density estimator with kernel function K(x) and a symmetric fixed bandwidth h can be written aŝ 
and is called the mean shift vector. The expression (2) shows that, at location x, the mean shift vector computed with kernel G is proportional to the normalized density gradient estimate obtained with kernel K. The mean shift vector thus always points toward the direction of maximum increase in the density. As the result, the mean shift iteration
is a hill climbing process to the nearest maximum of f h;K ðxÞ:
Mean shift algorithm implemented in our system
An improved mean shift algorithm for the joint spatialrange domain is proposed in Ref. [17] . We implemented this algorithm in our system, because of its high performance and simplicity. Let r c be r(x, y) (r(x, y) is a pixel in a given image), x c = x, y c = y and w s = 2w ? 1. Then, r n , dx n and dy n are calculated as follows. In the equations below, LUV color space is used, and h r is a constant which gives the threshold for choosing pixels close to r c .
and the steps above are repeated from the beginning. In the steps above, first, the number of pixels (m) which are similar to r c is counted. Then, the average of those pixels (r n ) and the movement of the coordinate (dx n and dy n ) are calculated. If the movement is very small, it is considered that the pixel converges to (x c , y c ), and otherwise, the movement is repeatedly calculated by assigning r n , x c ? dx n , y c ? dy n to r c , x c , y c , respectively. By applying this procedure, pixels in the image are moved along to their gradients, and converged to one of the bottoms of the gradients. Figure 1 shows an example how the pixels are moved. In Fig. 1 , pixel (A) and (B) are moved to the same bottom (C), and are considered to belong the same region.
The computational cost of this algorithm is O(X 9 Y 9 w s 9 w s 9 N r ) (N r is the average number of the repetitions).
Implementation method of mean shift algorithm
The computation of the mean shift algorithm described in Sect. 2.2 can be summarized as follows.
1. ðx c ; y c Þ ðx; yÞ and r c rðx; yÞ: 2. For each pixel at (x c , y c ), read out w s 9 w s pixels centered by (x c , y c ) in parallel, and count the number of pixels similar to r c . 3. At the same time, calculate (a) the sum of LUV of those pixels, and (b) the sum of their coordinates and calculate the average of them (r n and (dx n , dy n )).
4. Check the convergence, and repeat the steps above if not converged by assigning r n and (x c ? dx n , y c ? y n ) to r c and (x c , y c ).
For processing these steps efficiently, we need a data mapping method which enables the parallel access to w s 9 w s pixels centered by any (x, y). Figure 2 shows a data mapping method used in our system. In Fig. 2 , w s = 4 and all pixels in the image are mapped on a memory array which consists of w s 9 w s memory banks. Pixels labeled 'k' (k = '0' -'f') in the image are stored in the memory banks labeled 'k' ('0' -'f'), and w s 9 w s pixels (each being labeled one of '0' -'f') can be accessed in parallel. In this mapping method, the first w s 9 w s pixels on the upper-left corner of the image are placed on the same plain (addr = 0) of the memory array, the next w s 9 w s pixels are on the next plain (addr = 1) and so on. Suppose that we are going to read w s 9 w s pixels from (x, y) (the gray box in the image) (here, w s 9 w s pixels whose upper left corner is (x, y) are used to simplify the explanation). This box lies down on four plains (addr = 5, 6, 9, 10). To each of the w s 9 w s memory banks, different addresses are given as follows. 
Data mapping method
x offset and y offset are constants given to each column and row in advance. Then, the pixels in the gray box are read out from the memory banks using the different addresses. r n , dx n and dy n are calculated on the mean shift unit. In this computation, the positions of the pixels that appear on the memory array are different from those in the gray box in the image. In order to calculate dx n and dy n correctly, we need to give the true dx and dy to the mean shift unit. For this purpose, two kinds of tables are used (dx and dy table, and x and y table). There are only w s patterns of dx and dy (the range of dx and dy is [0,w s -1]). Therefore, the patterns are stored in dx and dy table (their depth is w s ), and these tables are accessed via x and y table whose depth is X and Y.
In an actual implementation, w s is an odd number (w s = 2w ? 1), and the coordinate of the center pixel is given. In this case, x -w and y -w are given to the circuit instead of x and y, and the range of dx and dy is changed to [ -w, ?w]. Then, dx = -w -1 and dy = -w -1 are used to mask the data on the memory array. With this data mapping method, we can apply the mean shift to any pixel in the image with a simple circuit.
Actual data mapping on FPGA
The size of on-chip memory banks is, however, not so large in practice, and we cannot store the pixels of whole image. Therefore, pixels of only L lines can be stored in the memory array (Fig. 3) . When the depth of each memory banks of the memory array is D, L becomes D 9 w s 9 w s / X. In this case, the processing order of the pixels in the image becomes very important. In our implementation, the image is scanned twice (or once if all pixels converged during the first scan) as shown in Fig. 4 . In Fig. 4a , the pixels are scanned from top to bottom. When the pixels of first L ? 1 lines are read into FPGA, the mean shift filter is applied to the pixels on y = 0. Then, the pixels on the next lines are read from the image, and the filter is applied to the pixels on y = 1. This sequence is repeated, and pixels of up to L lines are cached on the FPGA. The mean shift filter continues to be applied on the pixel on lth line in the L lines (Fig. 4a) .
Three cases happen during the computation:
1. the pixel converges to the bottom in the L lines, 2. the pixel moves upward and goes out of the L lines, 3. the pixel moves downward and goes out of the L lines.
When the second case happens, the pixel (r n , (x c ? dx n , y c ? dy n ) and its original coordinate) is put in queue-u, and the processing of the pixel is suspended. After scanning the image to the bottom, the scan is restarted from the bottom. When the target line of this scan reaches to y c ? dy c , the processing of the pixel is resumed. This is because the pixel will continue to move upward with high probability. When the third case happens, the pixel is put in queue-d, and the processing is suspended. The processing of the pixel is resumed when the target line reaches to y c ? dy n ? L/2. This is because the pixel will continue to move downward, and if we resume the processing of the pixel immediately, the pixel will move out of the L lines again.
System architecture
Another problem of this implementation is the long delay caused by the long feedback in the mean shift unit. In the mean shift filter, we need to sum up values of w s 9 w s The computation method pixels to calculate r n , dx n and dy n , and the depth of the mean shift unit becomes deeper than log (w s 9 w s ). Figure 5 shows a block diagram of our circuit. Suppose that pixels of L ? 1 lines have been buffered in the memory array. Then, the pixel at (0, 0) is processed first by the mean shift unit. w s 9 w s pixels centered by (0, 0) are given to the ABS&SUM unit array (pixels out of the image are masked in the ABS&SUM unit array), and r n , dx n and dy n for the pixel at (0, 0) are calculated. In this case, r c [namely, the pixel at (0, 0)] is given from the memory array to the mean shift unit. It takes d [d [ log (w s 9 w s )] clock cycles to calculate them. In order to hide this long latency, the computation of the next pixel [the pixel at (0, 1)] is started immediately. Then, pixels to (0, d -1) are processed continuously. At clock cycle d, r n , dx n and dy n for the pixel at (0, 0) come out of the average unit, and they are given to the ABS&SUM unit array and the memory array [to read out w s 9 w s pixels centered by (0 ? dx n , 0 ? dy n )] to trace the movement of the pixel at (0, 0). At the next clock cycle, r n , dx n and dy n for the pixel at (0, 1) come out of the average unit, and is processed in the same way. In this way, d pixels are continuously processed in the mean shift unit. When one of them converges, or moves out of the L lines (it is put in one of the queues), the pixel at (0, d) is newly processed. By repeating this sequence, the mean shift unit can always be filled by d pixels. The pixels in the queues are processed before starting the calculation of the pixels on the target line.
A method to reduce the memory size
The problem of the implementation described in the previous section is the large number of the on-chip memory banks used for the memory array. When we consider to implement the circuit on Xilinx FPGAs, 15 9 15 block RAMs are necessary when w s = 15, and 961 when w s = 31.
Block RAMs in Xilinx FPGAs can be configured as 512 9 36b. Therefore, using two block RAMs, we can store pixels of three columns (the data width of a pixel is 24 b). In this case, however, three pixels on the two banks have to be read as one set. As shown in Fig. 6 , w s is expanded to a multiple of three (18 when w s = 15, and 36 when w s = 31). Figure 7 shows how to access w s 9 w s pixels with this implementation method. Suppose that we are going to read 15 9 15 pixels centered by (x, y). These pixels lie on four memory plains of the memory array (addr = 1, 2, 5, 6), and three pixels in the array have to be read as one set. Therefore, 18 9 15 pixels are read out from the memory as shown in Fig. 7 , and three columns are masked using dx table (in Fig. 7 , X shows the mask, and other values show dx). In order to give the addresses to the memory array, we need to divide x by 18 and y by 15. In order to avoid these divide operations, the addresses are stored in the x and y tables as well as in the dx and dy tables. Figure 8 shows the actual data mapping on the memory array. The memory array consists of 6 9 15 block RAMs when w s = 15. The first six column of each memory plain are stored in the first 2 9 15 block RAMs (the first three columns are stored in the first half of the block RAMs, and the next three columns (gray parts in Fig. 8 ) are stored in the second half), and these six columns are read out in parallel using the dual read of block RAMs. Other columns are also grouped by six columns and stored in the block RAMs in the same way. With this implementation, we cannot update the data in the memory array, while the pixels on the target line are processed accessing the memory array. In order to reduce the time for updating data in the memory array, shallow buffers are provided, and the pixels of the next line are downloaded to these buffers, while the pixels on the target line are processed. Then, the mean shift unit is stopped (it is still filled by d pixels while it is stopped), and the data in the buffers are moved to the memory array in parallel. The time for updating the memory array is very small (only X (90 9 1.5) clock cycles per line when w s = 15).
Region merging
In the image segmentation based on mean-shift algorithm, the image is once divided into many small regions (oversegmentation), and then, the small regions are gradually merged to obtain better segmentation. Many region merging algorithms have been proposed [18] [19] [20] . The merging algorithms can be summarized as follows.
1. For each region labeled 'u', make a list l c ('u') which holds the regions contiguous to 'u'. 2. List up all pairs of regions ('u', 'v') which are contiguous to each other. 3. Calculate the distance between the two regions in the pairs using a given function f. 4. Choose the pair which gives the minimum distance, and merge the two regions. 5. Update l c of the merged regions. 6. Repeat step 2 to 5 while the minimum distance is smaller than a given threshold.
In most software algorithms, a global function is defined, and the distances between two regions are calculated considering the global balance among the pairs. In some cases, this distance calculation may become the bottle-neck of the merging, and not be able to implement on hardware systems easily. In this paper, to give a general framework, we focus on the other steps than the distance calculation, (2) and we consider a simple function f which calculates the distances using only local information.
Our approach of region merging
In our implementation described in Sect. 3, the outputs by the mean shift unit are color images in which the number of colors is reduced from the original image. Therefore, first, we need to give a unique label to the pixels which are contiguous to each other and have same color. Our approach consists of the four steps below.
1. The image is scanned, and the unique labels are given to the regions (a region consists of the contiguous pixels of the same color). The regions which consist of only one pixel (one-pixel region) are put aside and are processed in the last step. 2. The image is scanned again, and for each region 'u', the list which holds their neighbor regions l c ('u') is generated. At the same time, the distances between the two contiguous regions are calculated, and the pairs of the two regions are sorted according to their distances. 3. Two regions are repeatedly merged according to their distances. l c ('u') is updated when 'u' and 'v' ('u' \ 'v') are merged. 4. The one-pixel regions are merged to their larger neighbor regions.
The one-pixel regions are processed separately because the number of the regions is too many to manage them using the table in one FPGA, and the final results are not significantly affected by those small regions.
The first scan
First, we need to give a unique label to each region. In our approach, the image is scanned from (0, 0) to (Y -1, X -1). In this scanning, the color of the pixel at (y, x) (I(y, x)) is compared with the color of its four neighbors as shown in Fig. 9a . If the color of I(y, x) is different from its all four neighbors, a new region label is given to I(y, x). If the same as one of its neighbors, its region label is copied to I(y, x) [in Fig. 9a , if I(y, x) has the same color as the pixel 2, the region label of the pixel 2 is copied to I(y, x)]. Then, the region label of I(y, x) is output to the off-chip memory bank. During this first scan, the regions which consists of only one pixels are also detected, and their addresses (y, x) are stored in the queue in the off-chip memory bank. According to our experiments, the number of those onepixel regions is 1-13 % of the total pixels in the image when w s = 15 or 31 for all tested benchmark images. In our current implementation, these regions are left unprocessed in the queue, and after other regions are merged, the one-pixel regions are merged to them, to improve the performance of the system. It may seem that 1-13 % of the image is not negligible, and those one-pixel regions should be merged from the beginning. However, those regions usually exist among larger regions (namely, around the edges of the objects in the image) and do not have the significant effect on the final result. Suppose that there is a u-shaped region in the image as shown in Fig. 9b . Then, the region label 'm' is given to the pixel 'a', and 'n' is given to 'b', because their color is not same as their four neighbors, though 'a' and 'b' are parts of the same region. When comparing the color of I(y, x) with its four neighbors, the color of the pixel at '1' and '3', and '4' and '3' (see Fig. 9a ) are also compared, and if they are equal, they are recognized to belong to the same region. In Fig. 9b , when comparing the color of I(y, x), it is found that the label 'm' and 'n' were given to the same region. The region table shown in Fig. 10 is used to record the equality among the region labels. This table has seven fields, and is used to store other informations about the regions.
weight & color The weight field shows the number of the pixels which belong to the region. When a new region label is assigned to a pixel, the weight is set to one, and the color of the pixel is copied to the color field.
index When we notice that two labels ('m' [ 'n') were given to the same region, a pointer is set in the index field of 'm' to record that 'm' and 'n' are in the same region as shown in Fig. 10 . The weight of 'm' is added to the weight of 'n'. addr and len The addr field is used to hold the address of a block which stores the labels of the regions which are contiguous to this region. In the first scan, the len field is used to estimate the maximum number of the regions which are contiguous to the region [namely, the length of l c ('u')]. Then, in the second scan, it shows how many labels have been stored in the block pointed by addr.
tag The tag is used to remove the redundancy in l c . When the first image scan is finished, the index in the region table is scanned from the top, and the pointers are With this dereference, every index directly points to the true region label.
During the image scan, if the label given to I(y, x) is different from the labels of its four neighbors, the len of the label is counted up to estimate the length of l c ('u'). Suppose that the labels of its four neighbors are 'a', 'a', 'b' and 'c', and the label of I(y, x) is 'c', the len is counted up by 2 ('c' is contiguous to 'a' and 'b'). However, if we repeat this counting up on every pixel, the len is counted up many times by the same pairs of the labels. For example, if I(y, x) and I(y, x ? 1) belong to the same region labeled 'c', and they are contiguous to 'a', then, the len of 'c' is counted up twice. In order to prevent this redundant counting up strictly, we need to manage l c ('u') strictly using list structures. However, this strict management of l c is not easy on hardware systems. Instead, we have used the unit shown in Fig. 11 to prevent the redundant counting up as much as possible. In Fig. 11 , suppose that the label of I(y, x) was set to 'c', because its color is the same as I(y, x -1). Then, I(y, x) is contiguous to 'a' and 'b', and the len of 'c' has to be counted up by 2. However, I(y, x -1) is also contiguous to 'a' and 'b', and the len of 'c' has already been counted up for 'a' and 'b'. In Fig. 11 , in order to prevent this redundant counting up, four buffers and the comparator unit are used. Here, we denote a pair of the two contiguous regions as ('a', 'b' ). When the len of 'c' is counted up by ('c', 'a') at (y -1, x) , it is recorded at x -1 of one of the four buffers. When ('c', 'a') is given from the upper-left pixel, it is written into buf0. In the same way, buf1,2,3 are used for the upper, upper-right and left pixels. These pairs are kept in the buffers until they are overwritten. The pairs in the four buffers are read out onto the register array (its size is 5 9 4, and the pairs which were detected at x -2 to x ? 2 are held on it), and are compared with the four new pairs in the comparator unit in parallel. In Fig. 11 , the four pairs ('c', 'a'), ('c', 'b'), ('c', b') and ('c', 'c') are compared with the 20 pairs on the shift registers [('c', 'c') is discarded immediately]. If a new pair is equal to one of the 20 pairs on the registers, the new pair is discarded. The four new pairs are also compared with each other, and the redundant pairs are discarded. According to our experiments, the average length of l c is 7-10, and about 20 % longer than when strictly managed. This is short enough for our purpose. When two regions ('a' \ 'b') are merged, the len of 'b' is added to that of 'a' as well as the weight. Note that the length of l c is evaluated in this phase, but l c itself is not constructed.
The second scan
In the first scan, the region label of (y, x) is stored in the off-chip memory bank. In the second scan, those labels are read back, and dereferenced using the region table to obtain the true labels. Then, the unit shown in Fig. 11 is used again to detect which region is contiguous to which region. Suppose that 1. the pair of two contiguous regions ('c', 'a') is detected, 2. this is the first pair for 'c', and the third for 'a', and 3. b_tail is an address to the other off-chip memory bank, which is initialized to init_addr [ 0.
Then, the following operations are applied to 'c' and 'a' (Fig. 12 shows what happens by the operations).
1. Because this is the first pair for 'c' (this can be checked if its addr is zero or not), its addr is set to b_tail, and b_tail is incremented by its len?d (d is a margin for the estimated length of l c , because in some cases, it Fig. 11 The label pair comparator cannot be estimated correctly in the first scan). By allocating a block of len?d words in advance, we can store all contiguous regions to 'c' in this block. Then, 'a' is stored in the off-chip memory bank using its addr as the address of the off-chip memory bank, and its len is reset to one. 2. As for 'a', 'c' is stored in the off-chip memory bank using its addr?len as the address, and len is incremented. 3. At the same time, the distance d between 'c' and 'a' is calculated, and if the distance is smaller than a given threshold MAX_DIST, the pair ('c' and 'a') is sorted in the queue pointed by d. The queue are managed using their heads and tails in the dist.table.
As described in the previous subsection, the pairs of the contiguous regions are not managed strictly. Therefore, some same pairs may be stored in the queue of the same distance.
Region merging
In this step, first, the dist.table is scanned from d min (the current minimum distance), and a pair with the minimum distance is read out from its queue. Suppose that the pair is ('c', 'a') and 'a' \ 'c'. Then, the following operations are applied to 'a' and 'c' (Fig. 13 show what happens by the operations).
1. The weight of 'c' (w c ) is updated to w a ' = w a ? w c 2. The color of 'a' is updated to c a ' = (c a 9 w a ? cc 9 w c )/(w a ? w c ) 3. The weight of 'c' is changed to zero, which means that this region is discarded (out-of-date). 4. Each region label in l c ('a') and l c ('c') is read out sequentially from the off-chip memory bank, and written into the temporal buffer, (a) if the region label is not 'a', not 'c' and not outof-date (its weight is not zero) and (b) if the tag of the region is zero.
Otherwise, the region label is discarded. When the region label is copied into the buffer, its tag is set to one. The tag is used to prevent copying the same region more than once.
5. When the region label 'u' is copied into the buffer, the distance d between 'a' (its color has been updated) and 'u' is recalculated. In this case, the pair ('a', 'u') may already be in one of the queues according to the distance which was calculated using the previous color of 'a'. Normally, the pair ('a','u') should be removed from the queue, and put in queue [d] which corresponds to the new distance d. However, in our approach, the old ('a','u') is not removed, and the approach, the old ('a','u') is not removed, and the new ('a','u') is just put into the queue [d] . This means that two ('a','u') exist in two different queues (or in the same queue if the color of 'a' is not changed by this merging). 6. The len of 'a' is set to the number of the regions which are copied into the temporal buffer (8 in Fig. 13 ). 7. A new block whose size is the len of 'a' is newly allocated in the off-chip memory bank, and the addr of 'a' points the block. Then, the region labels in the temporal buffer is written back into the block. During this copying, the tag of the regions is reset to zero. 8. Then, the dist.table is scanned from d min again, and a pair with the minimum distance is read out from one of the queues. As described above, the out-of-data pairs are not removed from the queues in our approach. Suppose that the pair ('a', 'b') is given from queue [d] .
In order to verify that the pair is valid one, (a) the weight of 'a' and 'b' is checked if they are zero or not, and (b) the distance between 'a' and 'b' is recalculated, and checked if it becomes d.
If the pair is not valid, it is discarded, and the next candidate is read out from the queues. In our implementation, several candidates are read out, and they are checked if they are valid or not on the pipelined unit. Then, the merging procedure is repeated from the step 1.
The two blocks which were used to hold l c ('a') and l c ('c') become unnecessary when they are merged, but they are not garbage-collected in our approach. According to our experiments, their total size is less than 512K words, and can be easily stored in an off-chip memory bank. The merging procedures above are repeated while pairs whose distance is less than MAX_DIST exist.
Region merging of the small regions
After merging regions whose size is greater than one, the small regions which consist of only one pixel are merged to their neighbor larger regions. First, the address of a small region is read out from the buffer in the off-chip memory bank. Then, its color is read out from the image data in another off-chip memory bank. At the same time, the region label of its eight neighbors is read out from the region label array (label '0' is given to the small regions during the first scan) sequentially, and the distance to it is calculated on the pipelined unit. The colors of the neighbors are given from the region table. Then, the closest neighbor is chosen, and the small label is merged to the neighbor.
Experimental results
We have implemented the circuits for the mean shift algorithm and the region merging on Xilinx XC4VLX160 on RC2000-4 FPGA board. In this implementation, the data widths of the registers are wide enough (up to 18b), and no overflow nor truncation happens for summing up the pixel values. The divide operation (1/m in the equations in Sect. 2) is executed using a multiplier by multiplying 1/m which has been calculated in advance and stored in a table (18b width) on the FPGA. These data widths are wide enough, and the accuracy loss can be ignored. The circuit for the mean shift algorithm uses 23.3 KLUTs and 96 block RAMs when w s = 15 (90 as the memory array). When w s = 31, the circuit size becomes almost fourfold, and the number of block RAMs required for the memory array becomes 372. The circuit for the region merging uses about 9.5K LUTs. 92, 183 and 274 block RAMs are used when the size of the region table (N rt ) is 16K 9 1, 16K 9 2,and 16K 9 3 (one block RAM is used for dist.table). Because of the limitation of the number of block RAMs, the combination of the circuits which can be implemented on one XC4VLX160 is limited. The possible combinations are w s = 15 and N rt B 32K. The performance of the filter circuit for w s = 31 was evaluated using a software simulator, and that of the merging circuit for N rt = 48K was evaluated separately from the filter circuit. The operational frequency of the two circuits is 138.8 and 166.6 MHz, respectively, and 138.8 MHz was used for the performance evaluation. Table 1 shows the performance of the circuits for four bench mark images [21]: 'average' and 'max' show how long each pixel is moved by the mean shift filter until it stops, and N r shows how many times the mean shift is applied to each pixel in average (the performance is almost proportional to N r ). 'In' and 'out' show the number of the regions before and after the region merging, and 'execution time' shows the ratio spent in the four steps described in Sect. 5. 'fps mf ' and 'fps rm ' show the processing speed of the mean shift filter and the region merging, respectively, and 'fps t ' shows the total performance when the two circuits are applied sequentially to the images.
In all tested cases, all pixels converge during the first scan of the mean shift filter. Table 2 shows the number of the lines (L) which can be stored in the memory array. L becomes smaller as X becomes larger, because the number of the block RAMs is kept constant. When X = 768 and w s = 15, L is 75 (l is 5), and this is larger than the maximum move along the y axis (40). When w s = 31, the maximum becomes 80, but L = 372 in this case. 'fps mf ' becomes almost half by enlarging w s , because N r becomes larger, but the number of the regions generated by the filter ('in') becomes smaller, and 'fps rm ' becomes faster. As the result, 'fps t ' is fast enough for real-time applications when the image size is not larger than 768 9 512. As far as we know, no hardware systems which apply the mean-shift filter on all pixels in the image, nor no systems for region merging have been proposed as described in Sect. 1. Therefore, we cannot compare the performance of our system with other works, but our system is the first realtime segmentation system based on the mean-shift filter. Figure 14 shows the original, mean-shifted, and regionmerged images (w s = 31) for two benchmarks, peppers and tulips.
We need an FPGA with 400 block RAMs (the block RAMs can be shared between the two circuits, though in this experiment, we did not share them for the design simplicity) for achieving all combination of w s and N rt on one FPGA. Recent FPGAs have more than 1,000 block RAMs, and are large enough.
Conclusions
In this paper, we have described an image segmentation method on FPGA using the mean shift algorithm. In the image segmentation based on the mean shift algorithm, the image is once over-segmented, and then the small regions are merged. By caching L lines of the given image, and by processing the pixels in them in proper order, we can apply the mean shift filter to all pixels in the image efficiently. The region merging is an inherently sequential process, and the acceleration by hardware systems is not easy. We have shown that we can achieve real-time processing by relaxing the data management and enabling the pipeline processing of the data. Our implementation requires a number of block RAMs (about 400), but recent FPGAs support more number of block RAMs, and our implementation is feasible for those FPGAs.
In our current implementation, the distance between two regions is calculated using only local relation of the two regions. We need to calculate the distance by considering the global relation of the regions to obtain better segmentation. That is our main future work. 
