This paper proposes a System-on-Programmable-Chip (SoPC) architecture to implement a stereo matching algorithm based on the sum of absolute differences (SAD) in a FPGA chip which can provide 1396×1110 disparity maps at 30 fps speed. The hardware implementation involves a 32bit Nios II microprocessor, memory interfaces and stereo matching algorithm circuit module. The stereo matching algorithm core is modeled by the Matlab-based DSP Builder. The system can process many different sizes of stereo pair images through a configuration interface. The maximum horizon resolution of stereo images is 2048.
I. INTRODUCTION
The stereo vision has been one of the most active research topics in computer vision and widely used in many application areas including intelligent robots, automated guided vehicle, human-computer interface, 3D scanner and so on [1] [2] [3] [4] [5] . Stereo matching algorithms have played an important role in stereo vision. They can be classified into either local or global methods of correspondence. Local methods match one window region centered at a pixel of interest in one image with a similar window region in the other image by searching along epipolar lines. The performance of local stereo matching algorithms depends to a large extent on what similarity metric is selected. Typical similarity metrics are cross-correlation (CC), the sum of absolute differences (SAD), the sum of squared differences (SSD), etc. SSD and SAD find correspondences by minimizing the sum of squared or that of absolute differences in W × W windows. The computational complexity for N×N resolution image pair, W×W window size and D disparity level is ( ) 2 2 O N W D . It can be decreased to ( ) 2 O N D by some kind of optimization tips [6] . So the stereo vision has limitations for real-time applications due to its computational expense. Many researchers have proposed their FPGA implementations of stereo vision algorithms in literature.
The circuit [7] is a miniature stereovision machine (MSVM-III) with three cameras for generating highresolution dense disparity maps at the video rate. The machine, running at 60 MHz could process more than 30 fps dense disparity maps with 640×480 pixels in 64-pixel disparity search range. The paper [8] proposes an architecture that solves the matching problem on 8-bit 512×512 stereo images by using the SAD as similarity metric. When realized using a XILINX Virtex4 XC4VLX15 device, the circuit computes a 512×512 disparity map, with a maximum disparity of 255 in 39ms. The paper [9] describes the implementation of a stereo depth measurement algorithm in hardware on Field-Programmable Gate Arrays (FPGAs). This system generates 8-bit sub-pixel disparities on 256×360 pixel images at video rate 30 fps. Gardel et al. propose a hardware implementation of a dense recovery of stereovision 3D measurements in paper [10] . Considering hardware FPGA clock of 100 MHz, image flows up to 50 frames per second of dense stereo maps of more than 30,000 depth points could be obtained considering 2 Mpix images. The paper [11] presents a binary fully adaptable window for incorporating in a stereo matching System-on-Chip (SoC) architecture. The design in [12] is a real-time stereo vision System-on-Chip (SoC) architecture for a depth-field generation processor as required in 3D TV applications. A real-time stereo matching calculation at a frame rate of 56 Hz with a resolution of 800×600 and a disparity of 80 has been realized using this architecture without the need for external memories. However, these designs rarely attain the target of producing above 720P resolution disparity map at real-time speed.
In this paper, we propose SoPC architecture to implement a stereo matching algorithm which can process HD level stereo images in real-time by using the SAD based stereo matching algorithm. The stereo matching process, including cost calculation and cost aggregation, are modeled by the DSP Builder and parallelized within a pipelined architecture. Based on efficient hardware-oriented optimizations, our design achieves 30 frames per second when it matches 1396×1110 high-definition stereo images under 60MHz working frequency.
II. DSP BUILDER DESIGN FLOW
Digital signal processing (DSP) system design in programmable logic devices (PLDs) requires both highlevel algorithm and hardware description language (HDL) development tools.
The Altera DSP Builder integrates these tools by combining the algorithm development, simulation, and verification capabilities of The MathWorks MATLAB and Simulink system-level design tools with VHDL and Verilog HDL design flows, including the Altera Quartus II software.
We can combine existing MATLAB functions and Simulink blocks with Altera DSP Builder blocks and Altera intellectual property (IP) MegaCore functions to link system-level design and implementation with DSP algorithm development. In this way, DSP Builder allows system, algorithm, and hardware designers to share a common development platform.
The DSP Builder Signal Compiler block reads Simulink Model Files (.mdl) that contain other DSP Builder blocks and MegaCore functions. Signal Compiler then generates the VHDL files and Tcl scripts for synthesis, hardware implementation, and simulation. 
III. SOPC ARCHITECTURE FOR SAD MATCHING ALGORITHM
The SoPC architecture proposed herein is divided into the following main modules as shown in Fig. 2: (1) Microprocessor system: It consists of a 32-bit Nios II processor core, a set of on-chip peripherals, onchip memory, and interfaces to off-chip memory.
(2) SAD Stereo Matching Unit (SSMU): This unit computes sum of the absolute difference as similarity metrics to seek disparities from 64 candidates 5×5 windows. There are three Avalon-MM interfaces. One is slave interface to communicate with the Nios II CPU. The other two are read and write master interfaces. The read master is in charge of reading raw data of stereo images from the off-chip PSRAM acting as frame buffer in system. The write master takes charge of write final disparities to the DDRII. 
IV. HARDWARE IMPLEMENTATION OF THE SSMU
The SAD equation used for 5×5 windows with a maximum disparity of 64 can be seen in equation (1):
Where disp is the disparity value ranging from 0 to 63, P R (i, j) serves as the reference pixel in the right image and P L (i, j+disp) as the currently analyzed candidate pixel in the left image.
The layout plan of the SSMU is showed in Fig. 4 . At the first stage of the SSMU is a custom DMA engine. It is in charge of transferring all raw image data from the PSRAM to the dual-clock FIFOs. Every time the DMA engine invokes 8 times of word size pipelined Avalon-MM interface reading to get 32 bytes data from the right or left image in turn and it almost consumes 27 clocks. So under the working frequency of 100MHz, the DMA engine can offer about 113MB per second data bandwidth. It is enough for 2×1396×1110@30fps needs. The two dual-clock FIFOs (DCFIFO), followed by the DMA engine, have two functions. One is temporary storage for the raw pixel data. The other is separating the workspace into two regions which work under different working frequency. At the writing and reading side of the FIFO, the working frequency is 100MHz and 60MHz respectively.
Beside the read side of the DCFIFO is a dual-port RAM (DPRAM) array which is constituted by 5 DPRAMs, each with 2048 byte of memory space. They are used as line buffers for cost calculation and correlation processing element (CCCPE). Every DPRAM has data bus connected with CCCPE; therefore every DPRAM array can output 5 bytes of image data to the CCCPE in every clock cycle.
The CCCPE is the most complex hardware circuits in the SSMU. As shown in Fig. 5 , there are 64 SAD computers and 2 shift-tap devices in the CCCPE. The 2 shift-tap devices have 25 and 340 shift registers respectively to store the pixel data participated in computation of the SADs. One SAD computer can sum 25 absolute differences up which are produced by two 5 × 5 matching windows in a clock-period. Therefore, the CCCPE module can achieve 64 SADs from a template window in the right census image compared with 64 candidate windows in the left census image in a single clock. Fig. 6 is the block diagram of the SAD computers. The SAD computer is constituted with 25 absolute difference calculators (AD) and a parallel adder with 25 inputs. The parallel adder has 4 pipelined stages architecture for improving the fmax.
The 64 SADs produced by CCCPE are sent to the disparity segregator (DS). The DS calculates the minimum SAD using parallel comparators from 64 SADs and outputs the index number as the disparity, and the cost time is one clock period. Fig. 7 is the block diagram of the disparity segregator. The final disparities are push into the DCFIFO connected with DS module. The DCFIFO's function in here is similar with the one mentioned in previous. A custom DMA engine followed with read port of the DCFIFO is responsible for writing the final disparities to the disparity table stored in the offchip DDRII SDRAM. To fully using of the bandwidth, the DMA engine compresses four 8-bit disparities into a 32-bit word and transfers it to the off-chip DDRII SDRAM. 
V. DSP BUILDER MODELS OF THE SSMU

A. Model of the Absolute Difference (AD) Computer
As shown in Fig. 8 , the absolute difference computer is composed of one if statement, two multiplexers and one two 8-bit width inputs adder. It can output the absolute difference of two input data. 
B. Model of the SAD Computer
The model of the SAD computer, as shown in Fig. 9 , has 25 AD computers and a parallel adder with 25 9-bit width input ports. The parallel adder is fed 25 AD values from 25 AD computers and outputs a 15-bit SAD value within 4 clocks delay. 
C. Model of the 64 SAD Computers Array
The 64 SAD computers array is made of 64 SAD computers, 365 input ports and 64 output ports. In this model, the SAD computers are placed in subsystem style shown in Fig. 10 . The chart of this model is too large to fit in the paper, so we don't put the chart here. 
D. Model of the Basic Comparison Element
The basic comparison element (BCE) compares two input SAD values and produces a comparison bit served as a select signal for a multiplexer. The multiplexer outputs the minimum SAD involved in the comparison in the next stage. 
E. The Top Level Model of the Disparity Segregator
The disparity segregator is made of 63 BCEs. The two blue blocks in Fig. 12 are subsystem modules involving 31 BCEs each. The DS module can segregate minimum SAD from 64 input values. 
VI. RESULTS AND DISCUSSION
The proposed stereo matching circuit has been realized on the Altera Cyclone III Development Board as shown in Fig. 13 . Table I shows the resources required from the FPGA device in order to implement the designs presented in this paper. The report is produced by the Quartus II v12.0SP2 edition. 
where c d (x, y) is disparity map produced by the proposed hardware .
The ground truth image has disparity range from 0 to 59. The disparity range of our design is 0 to 63. So a disparity error tolerance d δ = 1 to 4 is used. The measures are computed over the whole disparity map, excluding image borders, where part of the image is totally occluded. Several tests have been performed. In Fig. 14, the produced disparity maps are showed. In Table II , the bad pixel rates are listed. Disparities are all encoded using a scale factor of 4 for gray levels 0 to 252. In Table III , the proposed system is compared to the existing approaches in terms of speed. 
VII. CONCLUSIONS/OUTLOOK
An efficient hardware implementation of a real-time stereo matching algorithm is proposed by using an FPGA for the calculation of disparity maps. It takes full advantage of the convenience of IP reuse based on SoPC architecture and Matlab-based modeling design tools. The frame rate could enable real time performance at the resolution of 1396×1110. The results of our system are very promising and may get better in the future. The system has been implemented on static image input from C code in the Nios II processor. We plan to incorporate live stereo video streams and combine the algorithm with pre-stage and LR-check stage to make it more suitable for the operation in robot auto-navigation and visual servo applications.
