ABSTRACT 2-D-to-3-D conversion is one way to make full use of 2-D contents to produce 3-D contents. Real-time 2-D-to-3-D conversion is required for 3-D consumer electronic devices, which demands fast processing speed especially for high-definition videos. In this paper, we propose a reconfigurable VLSI architecture for real-time 2-D-to-3-D conversion. Two different depth-retrieval methods are implemented in this architecture in order to support the choice of a best method or combination of different methods. The proposed architecture can also support different resolutions (4K, 1080p, and 720p) and the original view can be configured as either left view or right view. In order to overcome the problem of ''Memory Wall'', we propose a data reuse method to reduce memory traffic for the proposed architecture so that the overall performance is improved to realize real-time conversion. The experiment result shows that the implemented 2-D-to-3-D architecture can achieve state-of-the-art throughput (4K@30f/s).
I. INTRODUCTION
Three-dimensional video is becoming more and more popular and attracting more and more viewers to experience vivid stereoscopic visual effects. More and more common consumer electronics can play 3D video, such as 3D-TV, laptop, tablet and so on. However, it is costly and time-consuming to produce high quality 3D contents directly. To alleviate the problem of 3D material shortage, converting 2D to 3D is one way to make full use of the already existing 2D materials. The difference between 2D contents and 3D contents is that 3D contents contain the depth information which represents the relative distances between the objects. The generation of comfortable and natural depth is important for 2D-to-3D conversion. There are two steps for 2D-to-3D conversion. First, a depth map with the depth values of all pixels is extracted from a 2D picture. Second, depth-image-based rendering (DIBR) is applied to generate new views according to the original view and the depth map. Semiautomatic depth map generation [1] - [4] is very time-consuming. Therefore, algorithms for automatic depth map generation are proposed to reduce the time overhead [5] - [12] .
Different depth-retrieval methods may be effective in different scenarios, so it is necessary to support different depth-retrieval methods in a system for selection of a best method or optimal combination of different methods [10] . On the other hand, real-time 2D-to-3D conversion is required for 3D consumer electronic devices, which demand fast processing speed especially for high-definition videos. ''Memory Wall'' [13] is an obstacle for real-time conversion, which rises from the performance gap between computation speed and memory access speed. A few VLSI architectures are proposed to accelerate real-time 2D-to-3D conversion or view synthesis [14] - [17] , but these architectures did not consider the depth-retrieval part of 2D-to-3D conversion. A CPU+GPU platform is proposed to achieve 1080p@30fps 2D-to-3D conversion in [18] . After the optimization of the 2D-to-3D conversion algorithm, 1080p@36fps is achieved on an Intel Core i7 processor [19] .
In this paper, we propose a reconfigurable VLSI architecture for real-time 2D-to-3D conversion. The proposed architecture is a complete (including both depth generation and DIBR) 2D-to-3D VLSI architecture for high-definition videos. The real-time performance of the 2D-to-3D architecture can reach 1080p@120fps or 4K@30fps. The proposed architecture can support two different depth-retrieval methods, edge-based method and color-edge-based method, in order to support the choice of a best method or the best combination of different methods [10] . The proposed architecture can also support different input/output video resolutions, such as 4K, 1080p and 720p. The inputted original view can be outputted as left view or right view according to the habits of the viewers. The hardware modules are reused by different configurations to reduce the hardware overhead. In order to overcome the problem of ''Memory Wall'', we propose different levels of data reuse to reduce memory traffic so that the overall performance is improved to realize the real-time 2D-to-3D conversion. The proposed data reuse method is implemented in the on-chip memory design of the 2D-to-3D architecture. The contributions of this paper are as follows. (1) A real-time complete 2D-to-3D VLSI architecture is proposed and implemented.
(2) The proposed architecture is reconfigurable to support different depth-retrieval methods and different resolutions. ( 3)The proposed data reuse method is useful to reduce the memory traffic not only for 2D-to3D conversion but also for other similar applications.
The rest of the paper is organized as follows. The 2D-to-3D algorithms supported by the proposed architecture are introduced in Section II. The proposed 2D-to-3D VLSI architecture is presented in Section III. Experiment results are shown and analyzed in Section IV. Section V is the conclusion.
II. 2D-TO-3D ALGORITHMS
In this section, we present the two 2D-to-3D algorithms supported by the proposed architecture. The two algorithms are both divided into two parts, depth-retrieval and DIBR. For the two 2D-to-3D algorithms, the depth-retrieval parts are different while the DIBR parts are the same.
A. TWO METHODS TO RETRIEVE DEPTH MAP
Two depth-retrieval methods, edge-based method and edgecolor-based method, are implemented in the proposed architecture to generate the initial depth map. The two methods are presented in this subsection.
1) EDGE-BASED METHOD
The edge-based method uses the edge information in the image as the depth cue. Sobel operator is used for edge detection. The Sobel edge detection computes the edge direction of target pixel and checks whether the target pixel is on an edge.
The workflow of Sobel edge detection is described in Fig. 1 . The computation for edge detection utilizes the first two templates (matrices) in Fig. 2 to computes the gradients of the target pixel in X direction (Gx) and Y direction (Gy) respectively. This step needs eight reference pixels around the target pixel to multiply by the two templates. For example, Gx is computed as in (1), where f(x,y) is the pixel in a frame with the coordinate of x and y. The computations of Y direction gradient (Gy) and the threshold gradient (Gth) are similar to that of X direction gradient and two other templates in Fig. 2 are used. The three templates share the same input pixels. The final gradient G is computed according to (2) and compared with the threshold (Gth) to determine whether the pixel is on an edge. The Sobel operation results in the output of binary values 1 (on the edge) and 0 (not on the edge). Then the binary values 1 and 0 are transformed to values of 255 and 0 respectively and an initial depth map is generated for the input image [20] .
2) EDGE-COLOR-BASED METHOD
The edge-color-based method considers two depth cues, edge and color [18] . There are three steps to generate the initial depth map. First, we use Sobel edge detection to judge whether a pixel belongs to an edge, which is the same as the edge-based method. The outputs of this step are 1 (on the edge) and 0 (not on the edge), which are sent to the next step.
Second, using the edge information, a global depth for each row is obtained from the cumulative horizontal edge histogram, according to (3) . Depth global is the global depth for a row, which means that all the pixels in one row have the same global depth. Edge count is the number of edge pixels for the current row. Edge sum is the total number of edge pixels in the previous frame. The previous frame is used because the VOLUME 5, 2017 number of edge pixels in the current frame is not known until the end of the frame for our hardware implementation and the number of edge pixels is usually very close between two adjacent frames. Therefore, Edge sum needs to be updated at the end of each frame.
At last, the combination of Y , C b and C r color channels is used to refine the global depth to achieve a local depth according to (4) , where f 1 , f 2 and f 3 are linear functions. The local depths of all the pixels in the current frame compose the initial depth map.
Depth-image-based rendering (DIBR) is used to generate the new view from the original 2D video sequence and the extracted depth information (depth map). DIBR includes four parts ( Fig. 3) , pre-processing of the initial depth map, converting depth to disparity, pixel shifting and hole filling [21] , [22] . First, as a pre-processing step, a Gaussian filter is used to smooth the initial depth map and the final depth value for each pixel is generated. The Gaussian filter employs a 5 × 5 template (Fig. 4) , which needs 24 reference pixels around the target pixel. Second, the smoothed depth value is converted to the disparity between the position of the pixel in the original view and that in the new view, according to (5) . D represents the viewing distance from the display. t c is the interpupillary distance which is the human eye separation. Z is the normalized depth value for each pixel.
Third, the pixels of the new view are generated according to the disparity and the original view. However, there may be some holes between pixels of the new view because some of the pixels fall in the same position after pixel shifting. Therefore, at last, hole-filling is applied to fill color into these holes if there are still holes.
III. PROPOSED ARCHITECTURE FOR 2D-TO-3D CONVERSION
In this section, we first present the overall architecture ( Fig. 5 ) for real-time 2D-to-3D conversion, which supports the choice of a best method or optimal combination of different methods. Then we explain each module of the proposed architecture in detail. 
A. OVERALL ARCHITECTURE
The 2D-to-3D architecture adopts a pixel-level pipeline. The 2D sequence is inputted pixel by pixel and outputted unchanged as the original view. The new view is generated after the processing of two modules, depth map generation and DIBR. The input buffer is designed as a line buffer to store several lines of the original image in order to be accessed by different modules of the proposed 2D-to-3D architecture at the same time. The configuration module can be configured to select a best depth generation algorithm for the depth map generation module, which supports two different depth map generation methods. The configuration module can also be configured to support different resolutions, such as 4K, 1080p or 720p. The depth map generation module is used to generate an initial depth map. After that, DIBR module processes the initial depth map and generates a new view. Depth map buffer stores several lines of the initial depth map, which will be pre-processed by Gaussian filter in the DIBR module.
B. RECONFIGURABLE DEPTH MAP GENERATION
The depth map generation module supports two depthretrieval methods, an edge-based method and an edge-colorbased method (Fig. 6) . The Sobel module detects whether a pixel belongs to an edge, which is shared by the two depthretrieval methods. The global depth module and local depth module are only used for the edge-color-based method to generate the global depth and the local depth respectively. The value conversion module is used by the edge-based method to convert the values 1 and 0 to 255 and 0 respectively to get an initial depth. The initial depth values generated by the best depth-retrieval method can be selected through configuration module (Algorithm selection) according to the content of the image, and then stored to the depth map buffer.
1) SOBEL MODULE
The Sobel module (Fig. 7) checks whether the target pixel is on the edge. It receives three pixels each cycle from the input buffer. The three pixels are from adjacent three lines in the input buffer. There is a register array for storing the input pixels because nine pixels (3×3) are needed each cycle for the computing units of Sobel module.
The fx module and the fy module computes the gradients of target pixel in x direction and y direction respectively, using the first two templates in Fig. 2 . The threshold module computes the threshold with the third template in Fig. 2 and reads the same nine pixels (3×3) as fx module or fy module from the register array. The fx module, the fy module and the threshold module all use multiple multipliers to finish the multiplications in parallel (Fig. 8) . After that, an adder tree is implemented to finish the additions and the gradients (Gx, Gy and Gth) are generated. This design ensures the fastest processing speed for Sobel edge detection.
Final gradient module computes the combined gradient G of x direction and y direction, according to (2) . After that, we can check whether the target pixel is on an edge by comparing G with Gth. 
2) GLOBAL DEPTH MODULE
The architecture of the global depth module is shown in Fig. 9 . It generates the global depth for each row. In this design, Row/Col Count module gives control commands to determine the end of a line or a frame, according to input signals from the Sobel module (Edge_ready) and the configuration module (Width/Height of the frame). The edge count of the current row is accumulated in the Edge_row module. The Edge_frame module keeps both the accumulated edge count in the current frame and the sum of edge count in the previous frame. The global depth of a row is computed by Depth_comp module when Row/Col Count module finds the end of this row, using the edge count of the current row in Edge_row and the edge count of the previous frame in Edge_frame according to (3) . Edge sum is supposed to be edge count of the current frame, but we use previous frame to enable the pipeline structure. Therefore, overflow should be avoided in the Depth_comp module.
3) LOCAL DEPTH MODULE
When the global depth of a row is inputted to the local depth module, as shown in Fig. 10 , local depths of all pixels in this row are computed one by one according to (4) . A depth register is used to store the global depth of the current row. 
C. DIBR ARCHITECTURE
The DIBR architecture includes four modules (Fig. 11) , Gaussian Filter, Disparity, Shift and Hole Fill. As a preprocessing module, Gaussian Filter smoothes the initial depth map from the depth map buffer and the final depth value for each pixel is generated. The final depth value is converted to the disparity by the disparity module. The pixels of the new view are generated by the shift module using the disparity and the original view. The hole fill module is applied to fill color into the holes generated by the shift module.
1) GAUSSIAN FILTER
The Gaussian filter reads the depth values from the depth map buffer and employs a 5 × 5 template (Fig. 4) to smooth the depth map. A highly parallel architecture is implemented to ensure real-time processing (Fig. 8) . The design of Gaussian filter module is very similar to that of fx in Sobel module except that the scale of the Gaussian filter is larger. First, a multiplier array with 25 multipliers is used to do the 25 multiplications in parallel. Then, an adder tree is used to implement the sums of 25 values in the lowest latency. 
2) DISPARITY, SHIFT AND HOLE FILL
The disparity module receives the smoothed depth value from the Gaussian filter one by one (Fig. 12) . The depth value is first normalized and then transformed to the disparity according to (5) . The Shift module first transforms the disparity to the address of the register array with a decoder. Then the pixels of the original view from the input buffer are stored into the register array according to the address. The L/R signal from the Configurations module is used to choose whether the conversion is from left view to right view or reverse. The Hole fill module is used to fill color into the holes which are generated after the pixel shifting. The pixel from the Shift module is first judged whether it is a hole or not. If it is a hole, the pixels near it are used to fill the hole. There is also a register array to store the line of pixels for hole filling. The output data from the hole fill module are the final data of the new view.
D. MEMORY TRAFFIC REDUCTION FOR REAL-TIME CONVERSION
We propose three levels of data reuse to reduce the memory traffic for the real-time 2D-to-3D architecture. Exploiting data reuse can improve the performance of memory access and help overcome the ''Memory Wall'' to ensure the realtime processing. The data reuse method is implemented in the buffers design for Sobel module and Gaussian module to reduce the both the off-chip and on-chip memory traffic in the proposed architecture. Sobel and Gaussian are two widely used methods in many applications, so the data reuse method proposed in this paper can also be used in other applications. Some basic concepts are explained in Table 1 . One frame or depth map is divided into many blocks (BK) with the block size N×N. A block strip (BS) represents a row 
1) DIFFERENT LEVELS OF DATA REUSE
For Sobel edge detection or Gaussian filter, a block-based processing mode is used in a frame or in a depth map. This processing mode contains at least two levels of data reuse, block (BK) level and block strip (BS) level. For the BK level data reuse (Fig. 13) , an (N−1)×N buffer is needed for reusable data between two neighboring blocks. For the BS level data reuse (Fig. 15), a W×(N−1) buffer is used for storing reusable data between two neighboring block strips. With no data reuse (Table 2) , the memory traffic is supposed to be N×N×W×H pixels per frame because W×H blocks with the size of N×N need to be loaded. In this case, no buffer is needed. The memory traffic is reduced to N×W×H and W×H for the two data reuse levels, BK and BS respectively. However, a more efficient data reuse level usually requires a larger buffer, (N−1) ×N for BK and (N−1) ×W for BS. There is another data reuse level between BK and BS, which is named as BK+ (Fig. 14) . In each step of BK+, N +n pixels are loaded into the buffer and there are n+1 target lines or block strips which are processed at the same time. The memory traffic of BK+ is (N+n)×W×H/(n+1) with the buffer size (N−1) × (N+n) ( Table 2 ). The different data reuse levels gives a good trade-off between the memory traffic and buffer size. According to Table 2 , we can choose a proper data reuse level for a given application scenario. For example, we can choose Level BS if the performance is important while the buffer size is sufficient.
2) IMPLEMENTATION OF DATA REUSE FOR SOBEL MODULE
According to the proposed data reuse methods, we design the input buffer to implement the BS level data reuse (Fig. 16) VOLUME 5, 2017 and a register array to implement the BK level data reuse (Fig. 17 ) so that a two-level data reuse method is implemented for Sobel edge detection module.
The input buffer contains four LineStorages (Fig. 16) , each of which can store a line of pixels and works as a shifting buffer. A shifting buffer means that the pixels in the buffer are shifted from the left to the right in the buffer. The first LineStorage receives pixels one by one from the input. The other LineStorages receive pixels one by one from their neighboring LineStorages. The pixels from the top two LineStorages are used by DIBR module and local depth module respectively. The length of LineStorage is designed to be 4096 pixels to support both the maximum resolution of 4K and other resolutions less than 4K.
For Sobel edge detection, the block size N equals 3. Two LineStorages are used for Sobel to store two lines of pixels. The input pixel (Pixel_in) is directly used as the third pixel line of Sobel edge detection. In this way, Level BS data reuse is realized in the input buffer for Sobel edge detection. The number of pixels loaded from off-chip is W×H when using the input buffer (SRAM) for the Sobel module, while the number of pixels from off-chip is 9×W×H with no data reuse ( Table 3 ). The input buffer size is 2×W to implement the Level BS data reuse for Sobel edge detection.
For Sobel edge detection, a shifting register array is also designed for implementing Level BK data reuse (Fig. 17) . Six (2×3) registers form a two-column register array. In each cycle, three pixels are received from the input buffer and shifted into the register array. Nine pixels are outputted to the computing logic of Sobel module at each cycle. In this way, the memory traffic between the input buffer and the register array is reduced to 3×W×H, which is 1/3 of the case with no data reuse.
3) DEPTH BUFFER IMPLEMENTATION FOR GAUSSIAN FILTER
The depth map in our design is produced by the depth map generation module on chip instead of storing the depth map in the off-chip memory. A depth map buffer is designed to store the depth map on chip (Fig. 18) . The depth buffer avoids storing the depth data to the off-chip memory and then loading them to the on-chip memory so it can reduce a lot of off-chip memory traffic and can improve the overall performance of 2D-to-3D. The structure of the depth map buffer is similar to that of the input buffer. The difference is that the 8-bit depth data are stored in the depth map buffer instead of 24-bit pixels. Because the window size of Gaussian filter is 5×5 in our design, four DepthStorages are used in the depth map buffer. Each DepthStorage contains a line of depth data.
In our implementation, Level BK data reuse is also employed. A 4×5 register array is designed to reduce memory traffic from depth map buffer to Gaussian filter module on chip, which is similar to the shifting register array for the Sobel module. The memory traffic and buffer size of different data reuse levels for Gaussian filter are listed in Table 4 with N=5 and n=2. If there is no depth buffer on chip (no data reuse), the off-chip memory traffic is at least 25×W×H bytes. The reason is that the depth data are needed to be loaded from off-chip memory. In our implementation, all depth data are stored and reused on chip and no depth data need to be loaded from off-chip memory. Level BK data reuse is used to reduce the memory traffic on chip between the depth map buffer and the 4×5 register array for Gaussian filter. If Level BK+ or Level BS is used on chip, the memory traffic is reduced and buffer size is increased, compared with Level BK.
IV. EXPERIMENTAL RESULTS
In this section, we first give the results (red-cyan images) of the two 2D-to-3D conversion algorithms which are supported by the proposed architecture. Then we describe the FPGA platform for implementing the 2D-to-3D architecture. At last, we give the synthesis results and comparison with other architectures and other platforms like CPU and GPU. 
A. 2D-TO-3D RESULTS
We give the red-cyan images which are generated by the two 2D-to-3D methods supported by the proposed architecture in Fig. 19 and Fig. 20 respectively. We can see that the edge-color-based method usually achieves better results than the edge-based method. However, the focus of this paper is not on the algorithm improvement but on the architecture innovation. The proposed architecture gives a chance for the selection of a best method or combination of different methods.
B. FPGA IMPLEMENTATION OF THE PROPOSED ARCHITECTURE
We have implemented the proposed 2D-to-3D architecture on a Xilinx Spartan-6 FPGA platform (Fig. 21) . The 2D sequence is inputted from the personal computer (PC) FIGURE 21. The 2D-to-3D platform with a Xilinx Spartan-6 FPGA chip, which is used to implement and verify the proposed 2D-to-3D architecture. through the HDMI interface (HDMI TX/RX) to the evaluation board. The 2D-to-3D module is on the FPGA chip with two HDMI interfaces. The output of the 2D-to-3D module is displayed on a 3D-TV through an HDMI interface. The HDMI chips and the FPGA chip are configured by the I2C bus which is connected with the PC through USB interface. Fig. 22 gives the photo of our FPGA platform.
C. ASIC IMPLEMENTATION OF THE PROPOSED ARCHITECTURE
To the best of our knowledge, there is not a complete 2D-to-3D VLSI architecture in literature. Therefore, we compare the implementation of our design with a VSRS design [17] and a DIBR design [23] (Table 5) .
We implemented the proposed 2D-to-3D architecture in synthesizable Verilog HDL. Synthesis results using TSMC 65GP technology are given in Table 5 . The gate count (809.0K) and area (1.76mm×1.79mm) demonstrate that our design costs a small hardware resource. Compared with the other two implementations, our implementation gets the state-of-the-art performance, 4K@30f/s. Fig. 23 gives the layout of the chip. In the proposed design, we employ module reuse between different configurations to reduce the chip area. For example, two depth-retrieval methods can share the Sobel module, the input buffer, the depth map buffer and the DIBR module. 2D-to-3D algorithm was also implemented on other platforms, such as CPU [19] , CPU+GPU [18] , and FPGA [14] . Table 6 gives the performance comparison between different platforms. In [19] , the 2D-to-3D application is implemented on a 4-core laptop system, with one Intel Core i7-2820QM processor running at 2.3 GHz. Each core is equipped with a 32KB L1 data cache, a 32KB L1 instruction cache and a 256KB L2 cache. The four cores share an 8MB L3 unified cache. In [18] , the system is implemented on a notebook computer. CPU of the notebook is a 1.60 GHz quad-core CPU with 6M cache, featuring simultaneous multithreading. The notebook also has a 1.375 GHz GPU with 7 stream processors inside. Each stream processor consists of 16 cores. As an FPGA implementation [14] , DIBR is implemented on a Xilinx Virtex-5 FPGA platform. We find that the proposed design can achieve higher throughput than the other three platforms.
V. CONCLUSION
In this paper, we propose a reconfigurable VLSI architecture for real-time 2D-to-3D conversion. The proposed architecture can support two depth-retrieval methods and different resolutions. We implement a novel data reuse method to reduce the memory traffic for real-time conversion. Experiment results show that our design achieves state-of-the-art performance, which is higher than the performance of implementations on various platforms (FPGA, ASIC, CPU and GPU). 
