This paper presents a quantitative evaluation of a set of approaches for increasing the accuracy of an area-based stereo matching method. It is targeting real-time FPGA systems focused on low resource usage and maximized improvement per cost unit to enable concurrent processing. The approaches are applied to a resource optimized correspondence implementation and the individual and cumulative costs and improvements are assessed. A combination of the implemented approaches perform close to other area-matching implementations, but at substantially lower resource usage. Additionally, the limitation in image size associated with standard methods is removed. As fully piped complete on-chip solutions, all improvements are highly suitable for real-time stereo-vision systems.
INTRODUCTION
The extraction of depth data through the localization of the same point in two images is not trivial. Stereo matching of an entire scene 30 times per second (real-time) is computationally demanding, and require high-performing hardware. Hardware implementations range from regular computers, to specialized hardware such as GPUs and FPGAs. Lazaros et al. (Lazaros et al., 2008 ) make a thorough presentation of various implementations.
FPGAs, often referred to as reconfigurable parallel hardware, are utilized in mobile applications using vision, as they outperform other approaches in terms of speed, size, and power requirements. The major obstacle is the limited resources, which restricts which algorithms are possible to implement. In general, approaches for stereo matching are divided into global and local methods, with the latter being the preferred real-time stereo matching approach for a long time due to ease of implementation and speed (Lazaros et al., 2008) .
A complete vision system residing in an FPGA requires several processing components just for preprocessing the image, such as, image rectification, motion compensation, and depth estimation. Additionally, higher-level tasks, such as tracking, object recognition, or navigation, should also be encompassed in the FPGA.
In this paper we examine the impact of heavily reducing the resource usage of a stereo matching approach. The goal is to achieve high throughput at minimal system cost. This work is part of the Two Camera-project at Mälardalen University. The aim is to construct a compact, vision-based autonomous system encompassing both sequential and parallel processing units . Previous work in the project include the construction of the FPGAbased stereo platform (Lidholm et al., 2008) , and an implemented resource optimized basic stereo matching algorithm . The code composing the components in this project will be made available as open source to promote FPGA-based image processing on our publicly available vision system.
Matching is evaluated using stereo images with ground truth, as shown in figure 1 , and the online tool provided by the vision department at Middlebury University (Evaluation, 2011) .
BACKGROUND
For all approaches, we assume rectified and parallel images with a unified baseline, in order to reduce the correspondence problem to a 1-dimensional search (Scharstein and Szeliski, 2002) .
In , we use SAD (Sum of Absolute Differences) for a resource optimized correspondence implementation for real-time systems. The support window is reduced to a single row, thus producing a disparity map with preserved salient details but with increased noise, as can be expected when compared to the standard 2D implementation. The noise is primarily located in low-texture areas with low signal-to-noise ratio. In fact, the approach outperforms the 2D version around discontinuities due to reduced foreground fattening. The major advantages of the 1D approach are the substantial reduction in resource usage and the removal of the need for complete scan-line retention. The question is by how much can the matching quality can be improved, and at what cost?
RELATED WORK
An FPGA consists of different elements that can be configured in a multitude of ways. Resource utilization is normally expressed in slices and LUTs (LookUp-Tables which realize boolean operations). The 1D implementation from produced the disparity maps in this paper from 1.2K slices when implemented in a Spartan-3 FPGA. This is just above 4% of the available slices in the chip.
Several other stereo matching approaches with low resource usage exists, such as the one proposed by Arias-Estrada et al. (Arias-Estrada and Xicotencatl, 2001 ). The utilization is only 4.2K slices on a Virtex-II, but the disparity map is only fair. The implementation is capable of 71 fps with images of 320x240 pixels. Lee et al. (SUNGHWAN et al., 2005) present an implementation below 10K slices in resource usage. The resulting disparity map is moderate with extensive blurring of edges and noise.
IMPROVEMENTS
Noise in stereo matching is evident as false matches. False matches occur from the fundamentals of matching two images from different viewpoints because of projective distortion, Kanade and Okutomi (Kanade and Okutomi, 1994) . Fusing two views together will leave areas where depth estimation is impossible as they are occluded in one of the views. This affects all area-based approaches, but is even more evident for smaller support windows as they have lower signal-to-noise ratio. Post-processing of the estimated disparity map is usually adopted to remove false matches, and established methods include left-right consistency check (LRC) (Fua, 2004) , propagation, and median filtering. 
Consistency Check
The left-right consistency check verifies that only disparity values with mutual correspondence are accepted as matches, as detailed by Fusiello et al. (Fusiello et al., 1997) . Our implementation practically doubles the resource usage for the matching process (1.2K vs 2.4K slices), but no external memory nor any reduction in system performance need to result from it. The effect of the consistency check can be observed in figure 2. The images are with and without consistency check, but both have median filtering performed subsequently, to minimize the empty regions. The consistency check identifies almost all of the occluded areas. However, it also removes pixels that are not occluded but still differ due to poor correlation data. Noteworthy is the deterioration of the lamp arm, partly due to the check but also due to the filter. The removal of data in the disparity map reduces the quality, and it is evident that the median filter (here a 7x1) is not filling the empty areas. For this to happen we need to propagate.
Propagation
With propagation, the underlying data is important. A logic assumption is that it is important to remove as much noise as possible before performing propagation, to avoid propagating false matches. As can be seen in figure 3 , there is a difference between performing median filtration before or after the propagation. Propagation directly after the consistency check followed by a median filter produces a disparity map of the highest accuracy. However, some areas deteriorate, such as the lamp arm, when compared to a non-consistency checked image. 
Filtering
Median filtering is a well-known approach to remove sporadic noise and is frequently used in postprocessing to improve disparity maps (Muhlmann et al., 2002) . Realization of a median filter is a search and rank problem with large filters being difficult to implement for real-time (Vega-Rodrguez et al., 2007) . We have implemented the median filter as a classic systolic array, according to (Vega-Rodrguez et al., 2007) , for sorting 9 elements. This translates to a 9x1 median filter for 1D and a 3x3 filter for 2D. The improvement with a median filter are quite significant for the 1D approach, but not so much for the 2D, as can be seen in figure 4 and in table 1. The filter removes noise and the 2D implementation is already noise reduced by design. It is obvious that the noise in the 1D approach fits the characteristics of a median filter. Noteworthy is the fact that the 1D approach outperforms the standard 2D in regions of discontinuities, due to the lack of vertical summing. This is the case already with the basic 1D, but is even more improved with the added filter. The cost of the filter is very low, only 247 slices, an increase of 20%. As a conclusion, median filtering closes the gap between 1D and 2D implementations. Table 2 shows the improvement for the stereo matching component with the implemented approaches, both individually and combined. The improvements are evaluated with the Middlebury stereo evaluation tool (Evaluation, 2011) which show the error percentage in the disparity image. Three different parameters are presented: Non-occluded pixels which are visible in both images; Pixel at or around discontinuities in the image; All image pixels. From observing the matching scores in table 2, it can be noted that the individual order of tools is important when combining for improvement.
RESULT SUMMARY

CONCLUSIONS
Utilizing an inexpensive median filter effectively closes the gap between the 1D and the 2D approaches. From a cost/performance perspective, only using a median filter with the 1D is the best approach. However, there is only so much a 1D median filter can do with noisy data. For further improvement, noise reduction is a must. A function removing, or never allowing, false matches in the initial disparity map, through confidence assessment, could render a substantial improvement together with a competent propagation method. Implementing a small confidence measurement would be a good continuance of this work.
It is further evident that it is possible to achieve acceptable disparity maps without extensive memory usage and without a limitation on image size. Megapixel images will not affect the throughput or the resource utilization of the suggested approach as image data is only stored in a shift register approach without the need for multi-scanline retention. Furthermore, the 1D implementation is resource reduced, and can be fitted to practically any FPGA. It has been implemented with a maximum disparity range of 64 for images of 1024x1024 pixels.
The implementations run at 125 MHz, the system clock of the FPGA-board (Lidholm et al., 2008) . As the implementations are fully piped, the frame rate is dependent on the speed of the cameras and the size of the frame. Theoretically, it is capable of processing over 100 frames per second for Megapixel images.
