Abstract-An Field Programmable Gate Array (FPGA) based embedded vision system capable of recognizing objects in real time is presented in this paper. The proposed system architecture consists of multiple Intellectual Properties (IPs), which are used as a set of complex instructions by an integrated 32-bit CPU Microblaze. Each IP is tailored specifically to meet the needs of the application and at the same time to consume the minimum FPGA logic resources. Integrating both hardware and software on a single FPGA chip, this system can achieve the real-time performance of full VGA video processing at 32 frames per second (fps). In addition, this work comes up with a new method called Dual Connected Component Labeling (DCCL) suitable for FPGA implementation.
I. INTRODUCTION
Embedded vision systems can be used in a wide variety of applications from industry to commerce and from civilian to military, and therefore results in an increasing demand. In this paper, in order to increase the nut assembly accuracy and to reduce the rework in the auto-motive factory, a smart camera is designed to assist monitoring the process of fastening and unfastening nuts on an engine automatically. This application scenario is illustrated in Figure 1 .
Typically Micro Controller Unit (MCU) or Digital Signal Processor (DSP) is used for the system controller and algorithm processor. Field Programmable Gate Array (FPGA), however, is often only used as the glue logic because FPGA development concerns a lot of challenges, and requires designers simultaneously coping with both the high level (algorithm and system architecture, etc,) and the low level (logic circuit, memory management, time domain, etc.) design [1] .
FPGA has some unique features [2] , [3] , which make itself stand out from other processors: real hardware parallel processing capabilities enables FPGA technology to have higher data throughput than MCU or DSP; reconfigurability makes it far more flexible than custom designed ASIC; abundant logic and I/O resources make an FPGA the perfect platform for developing System on Chip (SOC).
In previous research, lots of works have been done to implement image recognition algorithms on an FPGA. However, many of them only focus on some simple algorithms which can be finished in one pass and do not require the aid of external memories. For instance some neighborhood operations, which typically include median filter, Sobel, Prewitt, Laplacian, Gaussian, Canny [5] , [6] and Harris corner detector as well as stereo vision algorithms [7] . In addition some researchers also exert themselves to tailor their algorithms to eliminate the introduction of off-chip memory, for example [12] . However, many image processing algorithms are by nature iterative operations and cannot work without the aid of offchip memory. For example, one basic image processing algorithm named Connected Components Labeling, which also belongs to neighborhood operator, cannot be completed in one pass. In [4] , a single pass connected components algorithm is presented, but it is difficult to be used in real applications since it consumes too many on-chip memories. Optical flow is another example that needs the introduction of external memory, and a solution is described in [8] , which unfortunately can only deal with a QVGA-size (320x240) video image.
As a matter of fact, it is better to introduce off-chip memory when implementing complicated algorithms on an FPGA for the purpose of reducing costs and making the system flexible. One solution described in [11] is a negative example. It requires a bigger FPGA when dealing with higher resolution images, making it unaffordable and loosing the expandability.
In recent years, CPU integrated in an FPGA enhances the processing capability of the FPGA technology, and making FPGA become a promising platform to develop SOC. In [9] , authors describe an FPGA-based people detection system, which adopts a 32-bit soft processor Microblaze. However, not only the hardware logic circuit on the FPGA is used, but also the embedded soft CPU is involved in computation, which jeopardizes the system performance deeply: only a low system speed of 2.5 frames per second (fps) is reached. It is difficult to use in real-time application. In [10] , another FPGA-based vision system adopting integrated CPU is described. In this work, all the algorithm operations are performed on hardware logic of FPGA, while the soft CPU is only used to sequence the general operation. However, the system only achieves a low speed performance of processing VGA (640x480) at 10fps.
Based on these considerations, this work presents an expandable FPGA-based vision system integrating both hardware and software on a single FPGA chip. Multiple IPs are called as a set of complex instructions by the embedded Microblaze CPU to perform the whole algorithm. By virtue of the hardware parallel architecture, a real-time performance of full VGA processing at 32 fps is achieved. In addition, this work comes up with a new method called Dual Connected Component Labeling (DCCL) suitable for FPGA implementation.
The following content is organized as follows. Section II states the problem of blob face identification. Section III describes the proposed system architecture. In section IV, the proposed algorithm and the design of IPs are ad-dressed. The experiment results are described in section V. Finally conclusion is given in section VI.
II. PROBLEM STATEMENT
The goal of this research is to recognize a certain set of blob faces on a cubic target shown in Figure 2 . Each face contains a unique blob pattern that looks like nested squares. A black square located in the very center of the face is named heart block, which may contain up to 4 white dots and is surrounded by 12 small black dots. The 12 black dots are enclosed by another black square again. Among the 12 small black dots, the largest one is defined as the origin of the face. The relative position between the origin and the white dots inside the heart block determines the ID of the face according to the following equation:
where, W x = 1 (x=1, 2, 3 or 4) only if a white dot appears on the position of P x (x=1, 2, 3 or 4). P x is determined according to the relative position between the heart block and the origin in a clock-wise direction, which is shown in Figure 3 . Hence fully parallel operation can be achieved as long as there is enough memory bandwidth.
The IP named Blob Recognition is responsible for the work of face ID recognition. It is connected to two VFBC ports for the purpose of simultaneously reading and writing the off-chip memory. Its detailed architecture will be illustrated in the following section.
IV. RECOGNITION ALGORITHM AND IP DESIGN
The blob recognition algorithm can be roughly divided into three major steps: image down scale, candidate location and blob recognition. The first step is to downscale the input VGA image to QQVGA (160x120). Secondly, the possible locations of target are to be found out. The last step is to identify the ID of the candidate. Detailed flow chart is shown in Figure 6 .
A. Smooth and Binarization
After downscale operation, the resultant QQVGA image is blurred with a 5x5 Gaussian Smoothing operator, and then is binarized by an adaptive threshold operator that adopts a 11x11 average filter and segments the image according to the following equation:
where, (0, 0) is the center of the mask, and BW (0,0) is the value assigned to the center pixel of the mask; Int(0,0) represents the intensity of the central pixel, and Int a vg is the average intensity of the neighbourhood; Delta is a constant parameter determining the threshold. Figure 7 shows an example of image blur and segmentation. Note that the heart block of the blob face is segmented from its background, which is the key to successfully selecting a candidate. 
B. Dual Connected Component Labeling
In order to search for the possible candidates, a new method named Dual Connected Component Labeling (DCCL) is proposed to group and label in a binarized image each component, whose dimension is then measured for candidate selection including maximum x axis, maximum y axis, minimum x axis, minimum y axis and the center of the component.
Like the conventional Connected Component Labeling (CCL) method [14] , DCCL adopts a 2x3 mask and an equivalent table to scan an image and groups its pixels into components based on pixel connectivity. But unlike CCLs only dealing with one type of pixel at one time, DCCL changes its data structure and adopts another equivalent table named BW EQ table that stores the connectivity information between black and white components, so that DCCL is able to handle both black and white pixels simultaneously for the convenience of the subsequent processing steps. Figure  8 displays its architecture. Two Dual Port RAM (DPRAM) are used for forming the 2x3 mask operation in Figure 8 . The reason why DPRAM is used instead of FIFO is that DPRAM is more flexible than FIFO to handle images in different sizes Here, each pixel in DPRAM has N -bit data, and the N -th bit indicates the type of pixel: 1 is black, and 0 is white. The remaining N -1 bits represent the label value. Through the 2x3 mask operation, each pixel is assigned a triplet including a label, an equivalent label indicating the connectivity with the same type of component, and a list of bw eq label indicating the connectivity with different type of components. The output of DCCL is further processed by the Parameter Extraction circuit to finish the Candidates Search and Store function, whose diagram is shown in Figure 9 .
C. Parameter Extraction
Parameter Extraction block only records the maximum, minimum and central coordinates of each labeled components as well as their corresponding equivalent label, so that memory recourses can be saved. The maximum and minimum axis on x and y direction roughly indicates the border of a component, and are only updated by comparing the coordinate of the incoming labeled pixel with the labeled component border. So this operation only uses a comparator.
The calculation of central coordinate will involve addition and division according to the following equation:
where (x c , y c ) is the coordinates of the center of a labeled component; N is the pixel number of a component; and (x i , y i ) is the coordinates of the i-th pixel of a component. In order to save FPGA resources, only one divider is used and shared for x and y central coordinate computation.
D. Normalization
Image normalization facilitates the operation of ID identification. In this work, the normalized size is 96x96, and only a certain set of candidate size will be accepted for normalization, and also only image downscale is used since the minimum candidate size is 96x96.
Conventional bilinear interpolation method is adopted for downscale operation: the incoming image data come through a filter, and the output of the filter is a weighted average of pixels in the nearest 2x2 neighborhoods. The parameters of the bilinear interpolation are selected carefully to reduce FPGA resources. Only the following 4 sets of parameters are used, (1,0) 
As a result, the whole normalization operation is simplified to only involve addition and shift operations, and multiplier is not used.
E. Recognition of ID
An intuitive ID recognition algorithm inspired by the feature of the blob face is proposed. And the whole flow chart is illustrated in Figure 10 . 1) One fact makes it easy to find the heart block: the center of the heart block is just the center of the cropped candidate image since the candidate image is cropped based on the dimension of heart block.
2) Two characteristics facilitate searching for the white frame: a) The center of the white frame must be very close to the center of the heart block; b) The white frame completely encloses the heart block. The purpose of obtaining the white frame is to find the 12 black dots surrounding the heart block.
3) The 12 black dots surrounding the heart block can be easily screened out by using the white frame and the BW EQ table. Besides, the face origin can be selected by simply comparing the dimension of each black dot. 4) With the aid of BW EQ table as well as the measured geometric dimension of each labeled component, it is easy to get all the white dots inside the heart block.
5) The face ID is calculated by locating the relative position of the face origin and the white dots inside the heart block. In this work, a fine division of the blob face is illustrated in Figure 11 . It can be seen that the origin and the white dots are defined by 8 zones respectively according to their geometric relationship.
Just by comparing the coordinates and dimensions of the origin, white dots and the heart block, it is easy to fit the origin and white dots into a specific location zone, and then to calculate the face ID.
V. EXPERIMENT AND DISCUSSION
The proposed solution is entirely integrated into an AVNET Xilinx XC5VLX110 Evaluation Kit shown in Figure 12 . And an Omnivision OV10121 camera is manually wired to the FPGA board to capture video stream. The The consuming time of the Blob Recognition IP can be estimated by adding up the processing time of the three steps together. The first step, image downscale, is just a data decimation operation, and only increases one clock delay, which can be ignored here. And the second step, candidate location, will consume about 4.1 milliseconds. But the consuming time of the third step, ID recognition, is difficult to predict since it depends on the number of candidates as well as the size of candidate image. The larger the number of candidates or the larger the candidate's image size, the more consumed time. However, the candidate number and the candidate size will affect each other. If the candidate size is very big, there must be small number of candidates. Otherwise, a lot of candidates may exist there. In addition, the processing time of normalization circuit varies from 98 microseconds to 2.3 milliseconds when the candidate size changes from 96x96 to 480x480. The remaining processing time of the third step is 450 microseconds if the candidate is a true blob face; otherwise the computation time decreases accordingly.
To take a common case where there are 7 candidates with medium size as an example, the third step may finish within 10 milliseconds. Hence, the total consuming time for a common case will be at maximum 15 milliseconds, which means that this Blob Recognition IP can process a VGA video stream at 66.7 fps. However, owing to the bandwidth of USB2.0 and the performance of the camera, as well as the possible computation delay in complicated environment, this system is designed to process VGA image at 32 fps.
Experiments demonstrated that, at 32 fps, the system can reliably process VGA video stream and recognize, within the distance from 0.2m to 0.7m, each blob face over a large range of lighting conditions as long as the target is clearly visible in the video image. Figure 13 shows some experiments in which the detected blob is highlighted. In this case the huge difference in operating conditions cannot fail the work of the system, and thereby demonstrates its robustness. Please note that the 9 sub images in Figure 13 represent 9 different test cases, which are arranged aiming at four variables: the angle of the target, the position of the target in a image: close to boundary or center, the background objects and the illumination conditions. In these cases, the huge difference of the variables cannot fail the work of the system, and thereby demonstrates its robustness.
The consumed FPGA resources are listed in Table I . It can be seen that only half of the FPGA resources are occupied, which makes room for integrating more complicated algorithms or adding additional interfaces into the FPGA such as an Ethernet interface. 
VI. CONCLUSION AND FUTURE WORK
The main contribution of this paper is to develop an expandable FPGA-based embedded vision system, which can process VGA video and identify the blob faces at 32 fps. In addition, differing from many other FPGA platforms that introduce the processors for algorithm implementation, this system realizes all the processing on the hardware logic circuit. Also the architecture of this FPGA system is designed in such a way that each IP block owns the capability to access external memory so that the parallel processing capability of the FPGA can be fully exploited. Furthermore, each IP is tailored carefully for reuse and for saving FPGA logic resources. In short, the highly parallel architecture and the compact logic circuit size are the two highlights of this FPGA system. However, it is worthwhile in the future to add in more external memories for enhancing parallel processing capabilities. Actually, in this work, the limited memory bandwidth impacts the system performance since there is only one external DDR2 memory.
Furthermore, it is possible to extend the current algorithm to detect the orientation and the 3-D location of the blob target within a given space by using one or more cameras. This will concern 3-D object identification and tracking, as well as the resultant camera calibration. Based on the current FPGA system, the period of further development can be shortened since only the Blob Recognition IP is required to be updated for new application. 
