Abstract-This paper presents a fast and area-efficient implementation of a real-time stereo vision algorithm for spatial depth mapping. The design combines two well-known areabased approaches to stereo matching and includes an occlusion detection method. Hardware efficiency is achieved by storing only partial images on-chip, avoiding full-sized frame buffers. A lowlatency dataflow-oriented structure makes it possible to process 256x 192 pixel Input streams with a rate in excess of 50 frames per second, amounting to more than 54 million pixel x disparity measurements per second (PDS) (for a 25-pixel disparity range), or roughly 18 GOPS. The design has been integrated in a 0.25 pm standard CMOS technology and occupies an area of less than 3 MM2
Abstract-This paper presents a fast and area-efficient implementation of a real-time stereo vision algorithm for spatial depth mapping. The design combines two well-known areabased approaches to stereo matching and includes an occlusion detection method. Hardware efficiency is achieved by storing only partial images on-chip, avoiding full-sized frame buffers. A lowlatency dataflow-oriented structure makes it possible to process 256x 192 pixel Input streams with a rate in excess of 50 frames per second, amounting to more than 54 million pixel x disparity measurements per second (PDS) (for a 25-pixel disparity range), or [3] [4] [5] and autonomous vehicles [6] as well as industrial automated production [7] . Current stereo vision algorithms are demanding with respect to calculation time and require high data throughput. FPGA-based approaches have been presented by [8], [9] , a multi-component solution including DSPs can be found in [10] . These implementations are made up of several processing devices, yielding costly and power-intensive hardware. A low power solution has been presented in [11] : The "Small Vision System" operates at 8 frames per second (fps) on 160x 120 images while consuming 600 mW of power.
In this paper a hardware-efficient architecture of a stereo vision module for fast dynamic applications is presented and a complexity analysis is provided. The design simultaneously applies two stereo matching methods and combines them. As the algorithm works on partial images there is no need to store entire frames. The In contrast, area-based algorithms lead to dense depth maps that include more uncertainties. Correlation is determined by pattern matching, thereby neglecting the actual image content. This class of methods is specially suited for hardware integration since it allows for a homogenous, content-independant dataflow. However, it is more vulnerable to some of the' inherent problems mentioned later.
A simple but efficient area-based techniiique is block sea?Iing. A block of pixels of the right stereo image is horizontally scanned for in the left image, starting at the same image coordinates (displacement 0), continuing over the entire search range to the maximnum displacement. The best matching block displacement is determined by a correlation function and is considered as the local disparity value.
Commonly used correlation functions include SAD (sum of absolute differences), SSD (sum of squared differences), non-parametric transforms (Census, Rank transform) [1] and NCC (normalized cross correlation). An evaluation of these functions within our setup showed that the SSD approach produced higher quality results than the SAD method. The NCC function is not very efficient with respect to hardware [2] .
The same holds for the Rank transform, the Census transform on the other hand is attractive for integration as it only requires addition operations.
B. Inherent Problems of Stereo Vision
As a scene is viewed from two different viewpoints there are regions that are visible to only one camera. This is due to foreground objects hiding objects in the background. This effect ist called occlusion and leads to uncorrelated information on the image pair. Table I ). The Census function first performs a non-parametric transform [1] on pixel blocks (see Fig. 1 ): The basic operation works on square pixel blocks of odd size Wc The center pixel is characterized by the surrounding pixels by an intensity comparison. If the neighbouring pixel is brighter, a "I" is set, otherwise a "O". These (W2-1) bits form a bias-independent signature. In the example of Fig. 1, eight .i
Occlusion effects are effectively circumvented by an LR-RL consistency check [12] , in which block matching is performed twice, from right to left (RL) and from left to right (LR). Occluded areas yield uncorrelated results. So, a superposition of RL and LR results allows to drop inconsistent areas of the scene. Furthermore, most erroneous matches from homogenous image areas are eliminated as well. As both correlation functions are applied RL and LR, a selection function picks results based on a priority scheme. Despite the occlusion detection mechanism there is always a certain possibility that unwanted results slip through, so the output image is postfiltered using a median function.
III. ARCHITECTURE
A. Dataflow
Six functional blocks make up the design as depicted in Fig. 2 . The input buffer stores several image lines and maintains two shift register banks on which block searching is performed. It is implemented using a RAM and organized as an extended ringbuffer that provides the succeeding modules with data. Two purely combinational modules calculate the SSD and Census correlations (Fig. 3, C and D) and pass their results on to the displacement module. There, data is collected and perspectively mapped to the virtual viewpoint of the output map. Four intermediate disparity maps are thus created; an LR and an RL map for each correspondence function. These maps are merged into one single map by the merge module (Fig. 3, E) . Several parameters configure the merge function according to the application needs. The outputfilter eliminates remaining erroneous pixels and smoothes the image by means of a median filter (Fig. 3, F) . The design has been implemented and fabricated using a 0.25 ,um 5 Metal process and occupies a total core area of less than 3 mm2. The fabricated samples were verified to be functional at a clock frequency of 75 MHz and thus achieve a frame rate of more than 50 fps. The necessary on-chip RAM amounts to a mere 1.35 KBytes. Median-filtered output image.
V. CONCLUSION
A hardxxare-efficient stereo vJision ASIC has been presentedl The approach of combining three well-knowvn methods (SSD, census transformation and occlusion detection) in stereo vision has been proven to deliver high-quality results with low hardware complexity. The presented architecture is scalabale to accomodate various resolutions and frame rates. Furthermore, the continuous dataflow does not rely on large on-chip RATM blocks to store entire frames, resulting in an area-efficient design. The implementation xvorks at more than 50 fps which enables real-time applications in veiy dynamic environments.
