INTRODUCTION
The Euclidean distance map on a binary image has been widely used in image processing, morphology, pattern recognition and artificial intelligence fields. Euclidean distance has the actual physical meaning as well as the accuracy. Compared with other commonly used distance definition, Euclidean distance transformation (EDT) has a higher computational complexity with a relatively high time complexity.
There are many methods to improve EDT computational complexity. When the accuracy requirement is not high, city distance and chessboard distance are commonly used as Euclidean distance approximation. For the chessboard and city-block distances, a raster-scanning algorithm works well and its time complexity is ) ( 2 N O [1] . Some algorithms are improved and optimized directly on the EDT algorithm. For a N N × binary image, the rasterscanning algorithm can not get precision Euclidean distance values for all pixels [2] . The order-communication algorithm can achieve the Euclidean distance for whole image, but some pixels will be calculated many times. So its time complexity can reach ) ( 3 
N O
in the worst case [3] . The divide-and-conquer algorithm computes the distance transform column by column and its time complexity reaches ) log ( 2 N N O [4] . The algorithm based on the construction of the Voronoi diagram reduces its time complexity to ) ( 2 N O [5] . The independent-scanning algorithm is suitable for Euclidean distance, city distance and chessboard distance. It firstly processes column scanning to acquire each distance of the column and row. Then it processes row scanning to obtain the Euclidean distance. Its time complexity is also ) ( 2 N O [6] . For the proposed algorithms in [4] [5] [6] , their time complexity and input binary image are in the same order of magnitude, so they are named linear Euclidean distance transform algorithms. These algorithm can meet the needs of most applications, but in some high real-time image processing fields, such as pattern recognition and artificial intelligence, computing time is the shorter, the better. Such requirements prompt some parallel EDT algorithms. The parallel algorithm is based on hypercube computer and the grid interconnection tree network, and its time complexity reaches ) ( N O [7] . The algorithm based on twodimensional torus connected processor array makes the time complexity to be ) (N O [8] . But these two algorithms require a lot of independent processors, so the circuit size is large to go against system integration. In order to reduce circuit size and achieve real hardware implementation, there are also some researches based on VLSI. The algorithm proposes a hardware architecture in which the processing is divided into row processing and column processing. Each processing is performed by N processors and the time complexity is ) ( 2 N O [9] .The algorithm proposes a N N × processing unit array and reduces the computing time to ) 4 5 ( − N clocks [10] . The construction of this paper is as follows: In Section 2, the independent-scanning linear EDT algorithm is briefly reviewed. The improved hardware algorithm is described in Section 3. In Section 4, some hardware system architecture improvement is detailed. Finally, some design details and the verification based on FPGA are showed in Section 5. In this paper, we can successfully avoid the use of multipliers and improve the system architecture to make the computing time to be
II. INDEPENDENT-SCANNING LINEAR EDT ALGORITHM
Euclidean distance is the important index to measure location relationship between different binary image pixels. It's widely used in contour extraction, skeleton extraction and other image processing. 
A. EDT applied to binary images
EDT has a high time complexity. For the ordercommunication algorithm, as the size of image increases, the computing time will increased nonlinearly and the real-time performance is poor. So the linear EDT algorithms are proposed. One is the independent-scanning algorithm proposed by Hirata, the other is the algorithm based on the construction of the Voronoi diagram. Their time complexity are both ) (
B. Hardware algorithm selection and analysis
A hardware algorithm need to take advantage of parallel computing and get rid of complicated calculation which is not practical for hardware. Compared with the algorithm based on the construction of the Voronoi diagram, the independent-scanning algorithm can be separated into relatively independent stages with few complex calculation. The hardware algorithm is based on the independentscanning algorithm.
The basic idea of independent-scanning algorithm is to divide the EDT into two relatively independent stages. In the first stage, 1
In the second stage, 2 T will add the squared values of row distance to the corresponding ones of column distance and obtain the Euclidean distance expressions. Draw the expressions as parabolic functions of row coordinates, then the lower envelope of these functions gives the Euclidean distance for each pixel of the row. These two stages have a sequential hierarchical relationship without direct correlation between the data. The characteristics are useful for realizing the hardware pipeline architecture.
The column distance can be calculated through the 
III. THE IMPROVEMENT FOR HARDWARE
The independent-scanning algorithm has the advantage of independent stages to operate parallel calculation. However, it has certain limitations in hardware realization, so it is necessary to improve the algorithm according to the characteristics of the hardware design. In accordance with the independent-scanning algorithm, the improved hardware algorithm is also divided into two stages, 1 T and 2 T .
A. The improvement on 1 T
In the independent-scanning algorithm, the column distance is the smaller one of the forward scanning result and the reverse scanning result. The calculating process is showed in Fig. 1 . In Fig. 1 , 'U' means that the column distance is undefined in the scanning. In the actual processing, 'U' can be replaced by a reasonable maximum. The data width of the column distance increases as one dimensional size of the image increases. This will take a lot of system storage resources. Furthermore, the method consists of the forward and reverse scanning and has a complicated calculation process. This will degrade the realtime performance. In order to reduce the time complexity, the bidirectional scans can be replaced by the unidirectional scan. Its queue structure is showed in Fig. 2 . In the structure, each transfer cell has the construction shown in Fig. 3 . Each rectangle represents a register. Registers 1 MB and 2 MB are 1-bit registers used for storing the pixel value. Register DIFF is a 2-bit register used for storing the column distance increment. In the processing of row data, the column distance increment of the data before the first pixel of value 0 is set to 1 − , and the column distance increment of the data after the last pixel of value 0 is set to 1 + . For the pixels between the two pixels of 0 shown in the dashed box of Fig.  1 , we observe the fact that the first half of these pixels get their column distances by the forward scanning and the others get the column distances by the reverse scanning. So we can distinguish the latter half pixels and modify the corresponding column distance increments by 1 MB and 2 MB . The unidirectional scan for a N -pixel row is showed in After modification, 1 MB and 2 MB of all transfer cells will be set to 0 . Then the column distance increments will be transferred to the increment processing cell to obtain the column distances. By the improvement of scanning method, calculating the column distance is changed to calculating the column distance increments to the adjacent pixels. The column distance increment value can be 1 + , 0 or 1 − , so only twobit memory will be used to store the column distance increments. It reduces the cost of the system storage resources.
B. The improvement on 2 T The computation of 2 T is to do column scanning for B . In 2 T stage, the Euclidean distances are described as parabolic functions of row coordinates, then the lower envelope of these functions gives the Euclidean distance for each pixel of the row. In order to simplify the calculation of the intersections, the Hirata algorithm eliminates the row squared items of the original expressions and calculates the intersections for straight lines for further calculation. In this way, the nonlinear operation is converted to linear operation. Then in the lower envelope consisted of the intersections and parabolas, the Euclidean distance squared values can be calculated according to the corresponding row coordinates. Due to the discrete characteristic of the binary image, all the parabolas can be discretized according to the row coordinates. By comparing the values of every discrete parabolic function in the row coordinate, the minimum value is the Euclidean distance squared value.
According to the parallel processing ability of the circuit and hardware algorithm features, taking the hardware architecture into consideration, the improvement of 2 T is described as follow: So the minimum distance squared value of the corresponding pixel is the Euclidean distance squared value of the pixel. By comparison level by level, the output is the Euclidean distance squared value of the pixel.
The hardware architecture of 2 T based on the above two ideas is showed in Fig. 5 . The arrows in the figure indicate the data flow of the comparison results. The value of the rectangle in the arrow connection is the row distance squared accumulation value for the next level. Since the output of 2 T is the distance squared value, the increment processing cell in 1 T need to be adjusted to output the column distance squared value for hardware simplification. The binary image data are independent. The row distances and column distances are separately calculated. By taking the advantage of these two characteristics, if we use p processing units for parallel computation, the time complexity can be reduced to
IV. HARDWARE SYSTEM ARCHITECTURE DESIGN
Based on the improvement of the independent-scanning algorithm and the design of hardware structure, a complete hardware calculation system can be designed. We can just connect 1
T and 2 T directly to form a complete hardware system shown in Fig. 6 In this architecture, a frame of image pixels has to be read row by row for EDT. So in the hardware system shown in Fig. 6 T and 2 T of the hardware system architecture shown in Fig. 6 work in a time-shared method. It is ineffective. In order to improve the efficiency of the hardware system, the hardware system architecture is adjusted. In the new architecture shown in Fig.7, 1 T consists of two data queue.
One is the transmission queue of column distance increment, the other is the storage queue of column distance increment. The column distance increment storage queue will be refreshed in the control of the frame synchronization signal, Flag .When a new frame comes into the system, the system give a Flag signal. Then in 1 T , the column distance increments will be transferred to the storage queue of column distance increment from the transmission queue of column distance increment. The transmission queue of column distance increment will be ready for the new frame. At the same time, 2 T and the storage queue of column distance increment can be worked for the previous frame. All these above forms pipeline architecture. The adjusted architecture of 1 T is showed in the dashed box of Fig. 7 . Figure 7 . The optimized architecture of computing system.
V. SOME DESIGN DETAILS AND ANALYSIS
Some design details are listed as follow based on an 8 8 × hardware system.
(1) The layout of the pipeline architecture As shown in Fig. 2, Fig.5 and Fig. 6, for a increment processing cell initialize. At the same time, Flag can be the sign of the increment transfer procedure and notifies the stage switch. In order to finish the process of the previous frame before Flag arrives, the clock frequency in the second stage must be twice as that in the first stage. (3) Replace the multiplier by the adder and shift register As analyzed in 3.1, the input of increment processing cell is the column distance increment and the output is the column distance squared value. So the multiplier is need in the increment processing cell. The multiplier will not only degrade the real-time performance, but also occupy a lot of logic cells. For optimal performance, the multiplier can be replaced by the adder and shift register. Suppose the column distance and the squared value of the current pixel are m is N . When there are more than one continuous pixels of value 0 in the row pixels, the column distance in the increment processing cell will be 0 for the first pixel of value 0. Because the register can only store unsigned number, the succeeding pixels of value 0 will produce continuous 1 − increments. Then the underflow will happen. When the row pixels are all 1-pixels, all the increments will be 1 + , and the column distance will be beyond the reasonable range. These two conditions will cause unexpected errors. So some check and correction has to be done before the output. An extra 1-bit sign bit register is added as the high bit for the column distance, so the underflow can be checked by the register. Once underflow happens, reset the sign bit register and set the column distance to be 0. Set
to be the threshold value. When the column distance is beyond the reasonable range, the output will be set above the threshold value.
We establish an 8 8 × hardware system on the EP2C5Q208C8 development board and finish the verification. To display the timing diagram clearly, the waveform of time order simulation for a 4 4 × hardware system is showed in Fig. 8 . In Fig. 8, clk1 is the clock of the first stage and clk2 is the clock of the second stage. The rising edge is valid for the system. The in1, in2, in3 and in4 are the inputs of the row pixels (1101, 1010, 1011, 1111) . The flag are the frame synchronization signal. The out1,out2,out3 and out4 are the output of the squared values of Euclidean distance. At the first clk1 clock, flag is set to 1 and all cells in the system are initialized. Then flag is set to 1 and the column distance increments are obtained after 4 clk1 clocks. When the 6 th clk1 clock arrives, the second frame will come. So set flag to be 1 and initialize the first stage for the second frame. At the same time, the column distance increments are transferred to the storage queue of column distance increment and the second stage begins to calculate for the first frame. In order to evaluate and analyze the relation between hardware resources occupancy rate and image scale, we build six hardware systems in the EP2C5Q208C8 development board. The one dimensional size of the systems ranges from 4 to 10. The relation between the resources occupancy rate and the image scale is showed in Fig. 9 . We can observe that the resources occupancy rate is not in strictly linear relation with the image scale. Along with the growth of image scale, not only the quantity of the processing cells grows, the bit width of the processing cells and other memory will also increase gradually. It leads to the nonlinear relation between the resources occupancy rate and the image scale.
The relation between image scale and resources occupancy rate Figure 9 . The relation between resources occupancy rate and image scale.
VI. CONCLUSIONS
A hardware algorithm for the Euclidean distance transform has been proposed in this paper. The algorithm consists of the row scanning 1 T and the column scanning 2 T . Due to the introduction of signal 0 BC , the algorithm performs the unidirectional scan instead of the bidirectional scan, and this improves the computing time. Also, the multiplier has been replaced by the adder and shift register in 2 T . With the adjustment to the system architecture, the hardware system works in the pipelined way and computes the Euclidean distance map for a N N × binary image in ) 1 2 ( + N clocks. It is a great improvement on the computing speed.
