Abstract-Hough Transform is a widely used shapebased algorithm for object detection and localization [6] , this technique can be generalized to parametric curves as circles. For a real time execution and embedded integration, several optimizations are necessary due to the large memory and computational requirements. This paper presents an efficient real-time pipelined architecture with a FPGA implementation of our Hough Transform for multi-circles detection. The computation of center candidates was improved. A three stages pipeline architecture was designed in order to reduce the processing latency and cadence. The architecture has been integrated into a Xilinx Zynq-7000 XC7Z020 containing a FPGA Artix-7. The global system uses 78.5 BRAMs, 153 DSP slices, 21638 LUTs. Our global system can support a maximum clock frequency of 128.89 MHz. We validate our architecture using a 125MHz clock frequency and we obtain a latency of 33.214 ms and an interval between two images of 16,607 ms for a 1920x1080 pixels image. According to our results, our architecture offer a throughput more than 4 times better than the faster state of the art architecture.
I. INTRODUCTION
Most of 95% [4] of colorectal cancers begin as a growth on the inner lining of the colon or rectum called as polyp.
To reduce the incidence of Colorectal cancer, authors in [14] , [17] proposed a new paradigm of Wireless Capsule Endoscopy [7] that can automaticaly recognize polyp in situ.
They have proposed a specific processing chain embedded in a System on Chip integrated inside a capsule.
In this chain they use the Hough transform to detect circles in HD images as Regions candidates to contain a polyp. It is a widely used technique, for object localization since 1962 [6] , that can be generalized to parametric curves as circles [2] . Once regions are selected they use a learning algorithm to decide if there is a polyp in the region.
In this article, we have studied only the Hough transform part to determine if it can be embedded in real time, this can be a first step in order to determine if it can be embedded in the next generation of capsule that will embed a HD images. We focused on it because we have measured its execution on an embedded processor, an ARM Cortex A9, using OpenCV library and running at 667 MHz and see that it takes 998.260 ms for an image of size 1920x1080. This execution time is too high, 25 times higher than the expected goal that is to proccess images with the same video quality of an endoscope, that acquire a 1920x1080 pixel image every 40 ms. This result excludes an optimized software embedded implementation and obliges us to a custom digital hardware implementation in FPGA.
II. STATE OF THE ART
A state of the art has been realized to analyse the implementation of the Hough transform in a FPGA considering timing contraints. Here, we present the highlights found, a survey of Hough transform methods can be read in [2] and [13] . We can notice some facts:
1) all the FPGA implementations [1] , [3] , [19] , [16] , [8] , [11] use embedded internal memories or BRAMs (Block RAMs), 2) all the works consider only the acceleration of the voting process of the Hough transform. In this process the goal is to accumulate intersection points in the Hough parameter space, corresponding to the number of possible circles. Then a vote is done to find the local maximum that are considered as real circle. This process is memory and time consuming, two different approaches emerged, the first approach uses the original Hough transform algorithm were the parametric equations of a circle are used to find the center and the radius of the circles [1] , [3] , [16] , [8] , [11] and the second approach uses a modified version of Hough transform called One dimensional Hough transform algorithm [19] . We describe below the works related to both approaches.
A. Original Hough transform Implementations
For the original Hough transform implementation, in [3] , authors demonstrate that using an external memory to store the voting procedure is limited due to the data transfer bandwidth. In this implementation an efficient internal memory structure is considered for the voting process, where the size of the Hough space is reduced. Computations are distributed in mathematical units, each unit has access to its own memory module with a double buffer technique to avoid external memory, this allows multiple parallel read/write operations at the same time.
In [1] , authors use a CORDIC algorithm to implement the Hough transform. Specificity of this work is that it detects only one circle owning of the target application, an iris detection. The same approach is used in [11] , where a FPGAbased hardware accelerator for iris localization is introduced.
978-1-5386-8237-1/18/$31.00 c 2018 IEEE
In [16] , authors adopt the scanline-based ball detection algorithm for the edge detection stage and edge-flag algorithm for the voting process.
In [8] , authors propose a Hough transform algorithm combined with a graph clustering algorithm for FPGA-based multi-circle detection.
B. 1D Hough transform algorithm Implementation
Goneid et al. introduce a modified version of Hough transform dedicated to detect multi-circles [5] in 1997. This method, called 1D Hough Transform multi-circles detection can successfully extract non-overlapping circles and ellipses in binary images, even in the presence of random noise. This method is easy to implement since each of the object's parameters is accumulated in its own one-dimensional parameter space. Zhou and al. [19] Steps [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] of algorithm 1, the more voted distance r become the radius of the center candidate (i, j); 3) If the value V r (i, j, r) is greater than f * 4 √ 2r then ||(x, y) − (i, j)|| = r becomes a circle. Steps [20] [21] [22] of algorithm 1. f can be adapted as the sensitivity threshold to detect circles. In [19] , authors propose an architecture based on FPGA implementation of algorithm 1. They choose a 9-bit integer words for the data in histogram V x and histogram V y and a 17-bit integer words for the histogram V r . They use multiple BRAMs to implement these histograms by 100 voting modules in parallel. The number of x-coordinate or y-coordinates of center candidates is set to 10, therefore, 100 center candidates are constructed. Finally the radius is coded with 13-bit integer.
C. Analysis of the state of the art
The implementations performance are shown in table I. As we can see in table I, there is no FPGA work using larger image sizes as 1920x1080. In table I we have extrapolated for each referenced work, using a simple size factor, the latency for a 1920x1080 image size. We can see that none of these state of the art method reach a latency less than 40 ms for this image size. Then it is necessary to design a new digital hardware architecture to accelerate the Hough transform computation.
Polyps can be show as protrusions and detected using the local curvature of the edge-image searching circular for each edge I(i, j) with j from 1 to W − 1 do
for each edge I(i, k) with k from j + 1 to W do 4:
end for 6: end for 7: end for 8: for each column i from 1 to W do 9: for each edge I(j, i) with j from 1 to H − 1 do
10:
for each edge I(k, i) with k from j + 1 to H do 11:
end for 13: end for 14: end for 15: for each local maximum V x (i) in V x do 16: for each local maximum V y (j) in V y do 17: for all edge I(k, m) in image do 18:
end for 20:
||(x, y) − (i, j)|| = r is a circle 22:
end if
23:
end for 24: end for or elliptical shapes [9] , [12] . Based on analyses of polyps images from colon examination [15] , it was observed that polyps do not always have a regular circular or elliptical shape, it depend of the noisy level, quality and resolution of the image. In addition, in [19] the key technique for accelerating Goneid algorithm is an efficient usage of DSP slices and block RAMs. Based on these constatations we have choose to investigate Goneid algorithm to realize an optimized version that can compute Hough Transform in less than 40 ms for a 1080x1920 image size.
III. PROPOSED METHOD
Our FPGA based architecture for a real-time implementation of the 1D Hough transform multi-circles detection is focused on the acceleration of the construction of histograms V x and V y, [1 -7] and [8] [9] [10] [11] [12] [13] [14] respectively to the algorithm 1. Our solution significantly reduces the use of memory and the latency execution. First, we propose an equivalent algorithm, algorithm 2, which gives the same results, that is to say the same x-coordinate and y-coordinate histograms as in steps [1 -7] and [8] [9] [10] [11] [12] [13] [14] of the algorithm 1.
Algorithm 2 is obtained first by rewriting the steps [1] [2] [3] [4] [5] [6] [7] of the algorithm 1 as shown in algorithm 3, taking into account all the points of the image instead of just taking the contour points.
In second, we change the order of the for-loops and obtain a new formulation of the steps [1] [2] [3] [4] [5] [6] [7] shown in algorithm 4.
Finally, we can rewrite the i-loop as a sum, and rewrite the steps [1] [2] [3] [4] [5] [6] [7] as proposed in steps [1] [2] [3] [4] [5] of the algorithm 2. for each column k from j + 1 to W do 3:
end
end for 10: end for 11: for each local maximum V x (i) in V x do 12: for each local maximum V y (j) in V y do 13: for all edge I(k, m) in image do for each point I(i, k) with k from j + 1 to W do 4:
end for for each column k from j + 1 to W do 3: for each row i from 1 to H do 4:
end for 6: end for 7: end for
The advantage of our histogram construction process show in algorithm 2 is that we obtain the accumulation value for each coordinate of the x-coordinate each two j-loops, that means that we obtain an x-coordinate accumulation value every two columns read and a y-coordinate accumulation value every two rows read. It has a singnificant impact in latency computation and resources consommation as it is explained in section IV-A.
We have validated our algorithm on image of closed contour as show in figure 1. In this image f correspond to the sensitivity threshold visible in step [16] of our algorithm 2. A green circle indicates that we have localized a closed contour. In the next section we describe our digital hardware architecture that implement our algorithm.
IV. CHT ARCHITECTURE
In figure 2 , we introduce our architecture to implement the algorithm 2. The overall architecture is composed by five modules described below:
• x-coordinate and y-coordinate computation modules, these modules compute N c x-coordinates and N c ycoordinates. They produce N c 2 center candidates from combination of each x-coordinates and y-coordinates;
• radius computation, this module builds for each center candidate a histogram using Euclidean distance between this center candidate and each edge point. Once the histogram is built, this module assigns the most accumulated Euclidean distance as the radius. This module selects as a real circle the center candidates and radius where the accumulation value is > f * 4 √ 2r ; • Registers module, this module register the N c xcoordinates and the N c y-coordinates of centers candidates and the N c 2 circles in parallel. In the next sections we describe each module of our architecture. A. x-coordinate and y-coordinate computation module Dm 2 DSP48 slices are used to add the votes between the input column and every two columns. The total votes are accumulated in the blue registers to be added each H cycles (when the input column is totally read). With this architecture, we compute a x-coordinate histogram value every column read.
In the second stage, the vote values are filtered in order to find the local maximum. A sliding window of size F , that corresponds to minimum Euclidian-distance between two centers candidates, is used to compare a vote value with the separated by an Euclidean-distance less than F pixels in the image, only the circle with more edge points will be detected.
Finally, in the third stage, all the local maximum histogram values are stored in registers R i and C i following the next rules:
else if R i > C i+1 and R i < C i then the register C i = R i else C i = C i . Hence, the larger values will be gradually transferred to the right side through the registers C i . This process is executed until all the columns in the image are read. Finally the largest local maximum histogram values are stored in C i and their respective coordinates correspond to the N c x-coordinates of the circles candidates. Using a similar architecture, N c y-coordinates are calculated.
For an image of W xH size,
BRAMs of 36Kbits and D m DSP48 slices are necessary to calculate the x-and y-coordinates in W * H + 2 +
Each one x-and y-coordinates are combined to obtain N c 2 center candidates.
B. Radius computation module
In figure 4 , we present our fully pipelined module proposed for the radius computation. In this module, in the first stage, an Euclidean-distance histogram is built in parallel for each center candidate. This stage computes the Euclideandistance between one center candidate and all edge points. Each add and substract computations are performed in 2 cycles, the multiplication computation is performed in 3 cycles. Each operation is executed using one DSP slice. We use the Xilinx CORDIC IP core [18] to compute the square root of the Euclidean-distance with a 16 bits integer number in 8 cycles. To vote the Euclidean-distance we use a memory, we propose the architecture illustrated in figure 5 , in this architecture each memory is implemented in one BRAM of 18Kbits that enables a simultaneous read and write in one cycle and avoid the collisions.
Once all edge points of the image are read, in the second stage, for each center candidate, all the values of the memory are read in order to find the most voted Euclidean-distance in N 2 cycles. This Euclidean-distance r becomes the radius of this center candidate. We compare this radius r to a threshold of 4 √ 2r [10] to determine if it corresponds to a true circle. It is possible to modify this threshold in order to make the verification more sensitive.
Each radius that corresponds to a true circle and the corresponding center become inputs to the shift registers.
In total, 4N c + N c 2 DSP slices and To validate our architecture, we prototype it on a SoC-based system, the Digilent ZedBoard Zynq-7000 ARM/FPGA XC7Z020 SoC Development Board. Zynq is not the final platform, as it cannot be integrated inside a capsule, we plan to integrate it in a new Artix7 from Xilinx, compatible with a capsule, that was not available at the time we make the experiments. Our first goal was to validate our Hough Transform IP and measure its execution time on a real SoC not to far to the final device.
In figure 6 , we illustrate the integration of our architecture, the Circle Hough Transform (CHT) IP in this SoC.
We have realised a pipeline of three operations that are: first, write an image into the DRAM memory and read computed circles, second, center computation and third, radius computation. The pipeline execution in the global system is shown in figure 7 .
As we can see, we use two address in DRAM (@1, @2) in order to read and write an image at the same time. The CHT IP uses one HP master AXI ports to read images to compute centers candidates and calculate the radius in parallel.
To store the image in a block memory we use BRAMs of 36 Kb that can shift As we can see in table II our Hough Transform can work with a maximum frequency of 149.16 MHz alone and 128.885 MHz in the global system. That is due to the distance between the furthest DSP slice and the AXI AMBA interconnection. Our digital architecture can process an image of 1920x1080 pixels in less thant 40 ms as it is expected and give a 62 fps throughput.
With table III, we can compare our Hough Transform architecture with the state of the art in table I. We calculate the processing time, BRAM and DSP of our architecture to realize a fair comparison with the same sizes of image. As we notice, our architecture is the better and offer a throughput 4 times better than the faster state of the art architecture.
VI. CONCLUSIONS
In this paper we proposed a efficient real-time pipelined architecture Hough Transform for multi-circles detection to help to localise polyps in gastrointestinal tract images. In the our architecture, an efficient method is implemented to significantly reduce internal use of memory and reduce time execution. Our architecture supports a maximum clock frequency of 149.16 MHz alone and 128.885 MHz in a global system to detect until 25 circles with a maximum circle diameter of 108 pixels. Our design has been validated on a Xilinx Zynq-7000 XC7Z020 using 78.5 BRAMs, 153 DSP slices, 21638 LUTs with a 125MHz clock. It obtains a latency of 33.214 ms and an interval between two images of 16,607 ms for a 1920x1080 pixels image. This architecture can process 62 images per second, and it offers a throughput 4 times better than the faster state of the art architecture.
