ABSTRACT 3D shape information is one of the very important clues in image processing and computer vision. Unlike traditional multi-input depth from defocus (DFD) technique, monocular DFD (MDFD) algorithm proposed by Hu and Haan can reconstruct 3D shape only from a single monocular defocus image with low computing complexity. In this paper, we present a real-time MDFD system implemented on the FPGA device. In order to reduce the FPGA design cost, vivado high level synthesis (VHLS) is applied to design the MDFD system. The system architecture on the basis of FIFO based convolution is first designed through C/C++ code that is further converted to the FPGA design by VHLS. Then the PIPELINE, LOOP_MERGE, and ARRAY_PARTITION directives are used to optimize the latency and interval of the proposed system. The performance and resource utilization of the whole system are evaluated by processing defocus images from the real scene with 640×480 pixel size. The system can process about 22 images at 20 MHz working frequency and keep the 93.29% depth accuracy on the 3D objects test, which achieves a real-time state-of-the-art MDFD system by comparing to other recent works.
I. INTRODUCTION
Image 3D information extraction is an important part of computer vision systems. In the increasingly mature artificial intelligence system, the 2D information of the scene can no longer meet the needs of researchers, especially in the related fields of robotic arm control [1] , SLAM navigation technology [2] and other fields [3] in need of three-dimensional information.
In order to rebuild the 3D shape from 2D images, Professor Pentland first proposed Depth from Defocus algorithm to extract 3D information in the 1980s [4] , which infers depth information by measuring the degree of defocus existing in the image. He presented that his method provided 64×64 3D maps at the speed of 8 frames per second (fps). Based on Professor Pentland's research, a variety of image 3D information extraction techniques have been produced in the The associate editor coordinating the review of this manuscript and approving it for publication was Naveed Akhtar. past 20 years, including the S-transformation of Subbarao and Surya [5] , the use of rational filters by Watanable and Nayar [6] , and the statistical methods of Rajagopalan and Chaudhuri [7] . These methods require two different defocus images acquired from a single viewpoint, resulting in an increase in computational complexity and almost impossible to be implemented on real-time image processing system. Hu and Haan [8] proposed the monocular DFD (MDFD) technique. This technique uses a point spread function to blur a single defocus image twice to obtain two different blurred images. The 3D depth information is obtained by calculating the difference ratio between the two defocus images. This technology only needs a single defocus image as input, which simplifies the calculation data volume of the system and lays a solid theoretical foundation for real-time 3D information extraction technology. Zhou et al. [9] used the Laplacian interpolation to refine the sparse depth map to achieve the accurate depth map. This method requires only one defocus image as well but with large computing complexity.
Other 3D shape recovery implementations from more than one defocus image include Ghita and Whelan [10] who extended Watanabe's method for a artificial robot that provided 256×256 depth maps at 10 fps. Nayar et al. [11] implemented the customized sensor: Datacube MV200 to detect the 3D information at the speed of 30 fps. Favaro [12] used shift variant schemes to calculate 640 × 480 depth map on Matlab, Mac2, computing at 2.4 GHz, in 10 min. Ben-Ari and Raveh [13] implemented Favar's scheme on GPU and achieves a 640×480 depth map at 5-15 fps. A coded aperture model was proposed by Zhou et al. [14] . It was tested on Matlab and provided 1024×768 3D maps in 15 s.
Field Programmable Gate Array (FPGA) [15] have demonstrated their high performance and energy efficiency potential in a variety of applications compared to other computing platforms [16] , [17] . Joseph-Raj and Staunton [18] implement a video-rate DFD system on FPGA, which calculates 3D shape maps from a pair of defocus images with 400×400 pixel size at 76 fps. However, traditional FPGA designs are usually implemented in low-level hardware languages such as Verilog and VHDL. This development method is time-consuming and inefficient, which hinders the promotion of FPGA applications. Xlinx's recent advances in Vivado High-Level Language Synthesis (VHLS) [19] allow developers to program hardware logic gates in high-level languages such as C/C++ and OpenCL, which significantly improve FPGA development efficiency. Recently, VHLS has been widely used in hardware acceleration for some complex tasks [20] - [22] . This paper systematically demonstrates the design of real-time MDFD system on FPGA through Vivado HLS. Here is the remainder of this paper. Monocular depth from defocus algorithm is discussed in Section II, and the FIFO based convolution FPGA design is detailed in Section III. In Section IV, we demonstrate the FPGA development process through novel VHLS tool. System optimization and performance are discussed in Section V. The 3D reconstruction experiment from real defocus image is shown in Section VI. The limitations of the proposed system will be discussed in Section VII. Section VIII makes a brief conclusion of the whole development process and the real-time MDFD system while discussing the future development and implementation of the proposed MDFD system.
In this paper, our contributions of the proposed MDFD system are, as follows:
1) To achieve a real-time monocular 3D reconstruction system at the speed of 22 fps on FPGA platform while keeping 93.29% depth accuracy on the real 3D objects test. 2) To show the development efficiency of the novel VHLS tool and the convenience of the optimization directives provided in VHLS. 3) To show the great adaptation of the 3D reconstruction from the real scene, which prove the engineering value of the proposed system.
II. MONOCULAR DEPTH FROM DEFOCUS
This paper implements MDFD technique proposed by Hu and Haan [8] to reconstruct 3D shape from a defocus image. In computer vision, image signals can be viewed as multidimensional matrices. In this paper, we only consider the gray-scale defocus image as the signal input, that is, the two-dimensional matrix signal. The MDFD algorithm achieves 3D shape recovery by inferring the degree of defocus of each region in the defocus input image. Figure 1 presents the example defocus gray scale image of a cup. From Figure 1 , the part of image on the focused plane is clear and sharp and the part of defocus plane is blurred and vague, which indicates the difference of the defocus degree. According to [8] , in order to infer the defocus degree, we use two Gaussian convolution kernels with different standard deviations to perform Gaussian blur preprocessing on defocus input image. The convolution process for Gaussian blur of the original defocus image I(x, y) can be expressed by:
where I Gaussian1 (x,y), I Gaussian2 (x,y) are the images after Gaussian blur preprocessing. G(x,y,σ 1) and G(x,y,σ 2) are the Gaussian blur function with different standard deviations σ 1 and σ 2. By computing the pixel-wise difference ratio between I(x,y), I Gaussian1 (x,y) and I Gaussian2 (x,y), the rough 3D map can be obtained. The difference ratio equation shows as below:
where the I 3D (x,y) is the rough 3D map of the input defocus image with lots of noise and instability. In [8] , Hu and Haan implemented the patch-wise max filter to post-process the rough 3D map. To further simplify the computing complexity to meet the requirement of real-time 3D reconstruction system, we implement 11×11 average kernel convolving with I 3D (x,y) to remove the noise to obtain the more stable 3D VOLUME 7, 2019 FIGURE 2. The MDFD algorithm flow.
shape map I 3Davrg (x,y).
Then we can estimate the blur degree σ of each region using the following equation proposed by [8] , (6) as shown at the top of the next page, where σ (x,y) is the degree of defocus map that can be seen as clear 3D map of input defocus image. According to Pentland [4] , MDFD can estimate the actual depth if we know all the camera's parameters:
In this formula, D is the distance from the lens to the region of interest. v 0 the distance between lens and the focal plane, F the focal length, f the the aperture number of the lens and σ the degree of defocus. In practical application, we have to define k for the system. (7) shows the relation between the estimated defocus degree and the estimated actual distance from lens to objects. The MDFD algorithm flow is shown below in Figure 2 .
III. FIFO BASED CONVOLUTION FPGA DESIGN
From Figure 2 , there are generally 4 steps of MDFD algorithm: Gaussian Blur, difference ratio, average filter and blur estimation. Difference ratio and blur estimation are pixel-wise operation which only needs the add, subtraction and division to combine input signals pixel by pixel. Here we put emphasis on the rest two steps: Gaussian blur and average filter. The essence of these two operations is convolution. The mathematical representation of 2D convolution is given by (8) :
where, x represents the input image to be convolved with the kernel h to result in a output image matrix y. Here, m and n are the offsets of the image matrix with respect to the kernel matrix. If the size of the convolved kernel is 3×3, the m and n range from -1 to 1. The expansion of (8) results in (9): 
From (9), there are 9 multiplications to be computed for each pixel. To keep the output image size same as the input image, zero-padding is applied before convolution. Pictorial representation of zero-padding as shown in Figure 3 (a) while the zero-padding for whole image as shown in Figure 3 (b). In our example, zero padding is implemented on the pixels which lies on the first or last two rows and columns.
Unlike designing convolution on GPU [23] and CPU [24] , when designing the hardware circuit on the FPGA, developers can not reuse the previous data directly since the FPGA does not have large-scale memory. This means that all the FPGA system need to be designed as assembly line. In recent years, many FPGA designs have been proposed to reduce the convolution time. A novel real-time image acquisition system proposed by [25] made use of discrete linear convolution of two finite length sequences (NXN) to reduce the convolution processing time. Moreover, FIFO (first in first out) based convolution designs are widely used in other hardware acceleration system [26] - [28] . In this paper, we implement a FIFO based convolution design with fixed depth inspired by [29] .
FIFO [30] is a special memory that conforms to the data first-in first-out law strictly. With a 640×480 8-bit image as the input, the FIFO based 3×3 Gaussian convolution design on FPGA is shown in Figure 4 . First, the input image after zero-padding needs to be converted into AXI stream. The flow direction of the data (i.e. the pixel value) in the figure is indicated by the arrow. Every clock cycle, the data flows from the current logic unit to the next logic unit. The cells in the convolved pixel area represent 9 D flip-flops and the stored data in the D flip-flop can only be kept for one clock cycle. The 3×3 Gaussian convolution kernel is used in this paper, so the convolution design needs to use 3 FIFO memories with a depth of 640-3=637 each, which means the first pixel value entering the FIFO will not be ejected from the FIFO until 637 clock cycles past.
The process of image convolution is completed by multiplying every pixel values between the convolved region and kernel region first, then adding all the multiplied results together as the output of the central pixel in the convolved region. The Gaussian blur hardware design uses 32-bit unsigned floating-point data stream to store convolution operation results to ensure data accuracy, and in order to reduce system delay, the system adopts parallel design for two Gaussian blur processes to operate them simultaneously. The average filter on FPGA uses the same FIFO based hardware design and the only difference is the average filter uses 11×11 average kernel to convolve input image which indicates that it needs 11 FIFO with a depth of 640-11=629 each.
FIFO based convolution design do not need the use of memory, which saves the data transfer time between FPGA and RAM. Moreover, it takes up limited FPGA unit resources which will be detailed in Section V. In the following section, we illustrate how to use VHLS to develop FPGA in an efficient way.
IV. FPGA DESIGN OF MDFD USING VHLS
VHLS [19] is an FPGA development software developed by Xilinx that allows developers to program Xilinx FPGA devices in C, C++ and System C languages. Users do not need to generate RTL through HDL hardware language instead of automatically converting advanced computer language into HDL and generating IP core directly through Vivado High Level Synthesis. Using VHLS for FPGA programming, users can skip the cumbersome underlying hardware language, which can significantly improve the development efficiency of FPGA. The working flow for FPGA development using Vivado HLS is shown in Figure 5 . 
V. OPTIMIZATION AND PERFORMANCE
VHLS provides 31 directives [19] to help developers to optimize the hardware design to generate diverse architectures which has different latency and resource utilization. Due to various requirements, we can quickly configure the hardware design by using the optimization directives to generate the RTL model with great flexibility. To achieve a real-time processing system, the directives have to be used to optimize the processing rate.
Here we implement PIPELINE directive to improve the performance of MDFD system first. The detailed explanation of the PIPELINE directive is shown in Figure 6 . Without the PIPELINE directive, the multiple operations of the system are executed in strict sequence. Executing a task in the PIPELINE way implies that the next operation of the system can begin to run before the current operation is completed, which reduces the preparation interval and allow the concurrent operations in a loop or a function. Further the LOOP_MERGE and ARRAY_PARTITION directives are used in the proposed design. ARRAY_PARTITION directive can divide large arrays into multiple smaller arrays or into individual registers, improving access to data and remove block RAM bottlenecks. Through LOOP_MERGE directive, multiple loops could be executed parallelly which can reduce overall latency and interval, increase resources sharing and optimize logic design. Table 1 . and Chart 1. show the resource utilization and performance of the MDFD design on FPGA before and after optimization. Through PIPELINE directive, the latency and interval decrease 40% and 59% respectively while resource utilization almost changes nothing.
The target FPGA device Xilinx XC7K420TIFFV901 is selected in the HLS design module and the clock period is set to 50ns. Through computing the product of latency and clock period, it takes 0.046s for the system to process a defocus image with 640×480 size at the operating frequency of 20 MHz, which implies this system can process about 22 images per second. Table 2 shows the resource utilization of processed by the proposed processing system on the target FPGA device and all the resource is within the acceptable limits. A comparison between the proposed MDFD system and previous real-time methods have been presented in Table 3 .
As usual, we take the system with the speed more than 20 fps as a video-rate real-time system. According to Table 3 , only [11] , [18] and proposed system can satisfy this standard. However, only our system can reconstruct 3D shape from a single monocular defocus image and achieve a state-ofthe-art MDFD real-time system. Meanwhile, we implement the MDFD method [6] on our workstation with a Intel Core i5 4210H CPU platform. As shown in Table 3 , the MDFD running time on FPGA is 33 times faster the traditional CPU platform, which shows the performance superiority of FPGA compared to CPU.
VI. TEST ON THE REAL DEFOCUS IMAGE A. 3D OBJECTS ON THE FOCAL PLANE
We used a PULNIX TM-765 monochrome camera with a 50mm manual lens and 6.5 mm external aperture diameter to obtain defocus images of different real objects (such as sponges, buttons, wooden boxes, etc.) [31] . The objects were placed at the distance of 744mm where the focal plane was. The aperture number of the lens is given by (10):
where F the focal length and d the aperture diameter. In this paper, the aperture number of the lens f is 7.692.
B. 3D SHAPE RECONSTRUCTION
As the experimental results in Figure 7 , the proposed system can reconstruct the 3D shape map by using only the monocular defocus image, which shows great adaptation to the real scene object. Since the FPGA strictly follows the first-in, first-out principle, the output image is mirrored to the left and right, that is, the pixels that are put into FPGA first are processed first to be the first output. The 2D power spectral density (PSD) plots are shown in Figure 7 (d). In the proposed system, all the defocus images that can be completely reconstructed 3D shape have the 
FIGURE 8. (a)-(c)
The selected regions to determine k value were in the red boxes. The test regions were in the black boxes with labeled name each.
single-peak-value PSD, which show large variation of pixel intensity values within a short spatial span.
C. DETERMINE K TO PREDICT THE ACTUAL DEPTH
According to (7), k must be determined to connect the degree of defocus to the actual depth distance. To determine the proper k value for the proposed system, we select 3 regions on the focal plane randomly from the blur estimation result of sponge, button and wooden box as shown in the red boxes of Figure 8 . The mean blur values of these 3 matrices are 0.592, 0.598 and 0.596 respectively. Through (7), the actual depth distance can be calculated with k, blur degree and camera parameters. All the objects were placed 744 mm away from the lens as mentioned in part A. Root mean square error (RMSE) between the predicted and actual depth was computed and RMSE curve was plotted with the change of k in Figure 9 . Here we determined the k value to 140.5 due to the minimum RMSE value in the curve. Then we selected other 9 matrices manually (3 matrices each object, refer to the black box in Figure 8 ) next to the previous selected matrices to evaluate the depth accuracy with the determined k value. The evaluation result on the 3D objects was shown in Table 4 . Because not all the selected test matrices had the same size, the weighted mean was used in the evaluation process. In general, the proposed MDFD system achieved 93.29% depth accuracy on the real 3D objects test.
VII. LIMITATION AND DISCUSSION

A. SINGLE-PEAK-VALUE PSD REQUIREMENT
In Section VI, the proposed MDFD system can reconstruct the 3D shape from a single defocus image taken by PUL-NIX TM-765 monochrome camera. However, not all the defocus images are the ideal input of the proposed system. Figure 10 shows a uncompleted 3D reconstruction result of a defocus image of a cup with a multi-peak-value PSD. The defocus image with multi-peak-value PSD represents low pixel intensity and results in a disconnected 3D shape as shown in Figure 10 (b). Zhuo [9] perform a matting Laplacian interpolation to deal with a sparse defocus image, which consumed large scale computing resource and required 15 times processing time compared to Hu and Haan's method [6] as shown in Table 3 , so it can not be used in a video-rate real-time processing system at this point.
B. HIGH RESOURCE UTILIZATION
VHLS is a algorithm-level FPGA development tool, which means developers do not need to focus on the low level circuit design. In this way, we can design a complex FPGA system in a short period of time but with high resource utilization. The system proposed by [18] is developed by traditional VHDL using Xilinx ISE 10.1. It needs a pair of defocus images as the input and the processing time is 1.8 times faster than ours. The filp-flop and LUT use of the our proposed system are 39369 and 207345 more than the system in [18] respectively, which may influence the system stability and increase the energy consumption.
VIII. CONCLUSION
This paper detail how to design MDFD on FPGA to reconstruct the 3D shape from a single monocular defocus image. The FIFO based MDFD design is implemented on Xilinx XC7K420TIFFV901 FPGA on-chip platform by converting the C/C++ code to hardware logic gate circuit through VHLS. The FPGA design is further optimized by 3 different optimization directives to reduce the latency and interval. The proposed real-time MDFD system can process 22 defocus images with 640×480 size per second while the resource utilization within acceptable limits on target FPGA device. After determining k value to connect the blur degree to the actual depth, our system achieves a 93.29% depth accuracy on the real 3D objects test. By comparison to other related 3D shape detection systems, this MDFD system achieves the stateof-the-art level in consideration of processing time, depth accuracy and low input data volume. Designing FPGA system on the basis of VHLS greatly increases the development efficiency and accelerates the implementation process from the software code to hardware platform. The design process of the MDFD system illustrates the effectiveness of VHLS as a powerful development software for FPGA implementation.
In the future, we need to improve our work as follows: 1) develop a more efficient and accurate algorithm to reconstruct 3D shape from a single defocus image without strict limitations; 2) further explore the use of optimization directives provided by VHLS to accelerate the current proposed system; 3) combine the VHLS and traditional VHDL to lower the resource utilization to enhance the system stability; 4) implement the proposed MDFD system in robotics or other relative fields. His current research interests include the design and analysis of sequential, parallel, and distributed algorithms for various communication and optimization problems in wireless communication networks, and cryptography and digital currencies including quantum money. Moreover, he also investigates the combinatorial optimization problems with applications in bioinformatics, data mining, and space research. He is serving on Management Committee Board of Denmark for several EU ICT projects. He has been very actively involved in the services for the community in terms of acting (or acted) on various positions (e.g., a Session Chair, a Member of Technical Program Committee, a Symposium Organizer, and a Local Organization Co-Chair) for numerous international leading conferences in distributed computing, wireless communications and ubiquitous intelligence, and computing, including the IEEE MASS, the IEEE LCN, ACM SAC, the IEEE ICC, the IEEE GLOBECOM, the IEEE WCNC, the IEEE VTC, IFIP NPC, and the IEEE Sarnoff. He is an Organizing Committee Chair for the 17th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2020, Torshavn, Faroe Islands). He also currently serves on the editorial board for more than ten international journals.
