Abstract-In this work, we present a real-time implementation of a stereo algorithm on field-programmable gate array (FPGA). The approach is a phase-based model that allows computation with sub-pixel accuracy. The algorithm uses a robust multi-scale and multi-orientation method that optimizes the estimation extraction with respect to the local image structure support. With respect to the state of the art, our work increases the on-chip power of computation compared to previous approaches in order to obtain a good accuracy of results with a large disparity range. In addition, our approach is specially suited for unconstrained environments applications thanks to the robustness of the phase information, capable of dealing with severe illumination changes and with small affine deformation between the image pair. This work also includes the rectification images circuitry in order to exploit the epipolar constraints on the chip. The dedicated circuit can rectify and process images of VGA resolution at a frame rate of 57 fps. The implementation uses a fine pipelined method (also with superscalar units) and multiple user defined parameters that lead to a high working frequency and a good adaptability to different scenarios. In the paper, we present different results and we compare them with state of the art approaches.
I. INTRODUCTION
S TEREOSCOPY, after many years of research, is still an open issue aiming at new challenges in terms of accuracy and performance. Current dense stereo techniques are mainly divided into two categories: local approaches and global approaches. The phase-based algorithm used in this work is a local approach. Our method benefits of phase information and works very stably and robustly especially in unconstrained real sequences where other methods fail. The advantage of phase-based approaches has been pointed out by different authors especially due to their robustness to luminance variations and camera imbalance problems. Furthermore, it leads to a The authors are with the Department of Computer Architecture and Technology, University of Granada, E-18071 Granada, Spain (e-mail: mtomasi@atc. ugr.es; mvanegas@atc.ugr.es; fbarranco@atc.ugr.es; jdiaz@atc.ugr.es; eduardo@atc.ugr.es).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2011.2172007
better behavior against affine transformations (for instance, due to different camera perspectives) [1] , [2] . Currently, one important goal is the disparity computation on real time with a stable and robust technique able to extract this feature in multiple unconstrained scenarios. Real-time computation leads to its applicability on very diverse scenarios (robotic platforms, driver assistance, etc.). This work wants to develop a real time system for a driving scenario as proposed in the EU project DRIVSCO [3] . For this reason, we need a good choice for the system implementation capable of working on very difficult illumination scenarios as well as dealing with multiple image artifacts (as the ones caused by rain, foggy days, motion blurred images, etc.). With the increasing computational power of machines, software approaches have improved their computational performance. In previous works such as [4] , authors achieved a real time stereo system using parallel computation. Current alternatives are GPU implementations [5] that can provide a fast and accurate result. Unfortunately, all these systems require considerable power consumption and their applicability on portable embedded systems is questionable. A good alternative towards on-chip implementation is the use of field-programmable gate-array (FPGA)-based approaches capable, if required, to be converted to an ASIC solution. As FPGA devices progressed both in terms of resources and performance, the latest FPGAs have come to provide "platform" solutions that are easily customizable for system connectivity, DSP, and/or data processing applications. As platform solutions are becoming more and more important, leading FPGA vendors are coming up with easy-to-use design development tools. Designers achieve a great improvement in terms of power consumption [6] , speed [7] , and performance [8] . Carefully designed systems do not lose accuracy as compared to a software implementation and at the same time, feature a high data rate and portability. Further advantages are the low power consumption and reduced size of the whole system. These important advantages allow a large use in robotic platforms [9] or in an automotive scenario. Previous works proof the validity of FPGAs in stereo computation with very high frame rate [10] , [11] but with restricted accuracy compared to other software approaches. Our purpose is to maintain a good accuracy due to the use of a phase-based method and robustness against image artifacts and illumination problems and at the same time, a high data throughput. In [12] , a similar implementation of an algorithm in a mono-scale version is described. The problem with this system is the limited disparity range and its strong relationship with the filter size. Our work improves previous works with a multi-scale and multi-orientation approach that allows a very large disparity range and fits properly the image structure thanks to the different oriented filters employed. Our system consists in a very complex design that is justified in order to reach a robust solution. This approach is a novel hardware system for calculating disparity using a phase-based algorithm with the latest design techniques. The design strategy is based on the construction of deep pipelined data paths composed by heterogeneous computing units and a combination of high level abstraction descriptions as well as RT level ones, in order to keep performance and at the same time keep short the design times of very algorithmic system descriptions. This paper is structured as follows. In Section II, we describe the phase-based algorithm and its multi-scale implementation. In the following parts, we present the hardware/software system (see Section III) and the experimental results (see Section IV). Finally, we conclude in Section V and present the future work.
II. PHASE-BASED ALGORITHM
Depth perception derives from the differences in the positions of corresponding points in the stereo image pair projected on the retinas. In a first approximation, the positions of corresponding points are related by a 1-D horizontal shift, being the disparity related to the direction of the epipolar lines. The use of local, band-pass, linear filters has become the focus of considerable research on early biological and computational visual processing [13] . It is readily accepted that the amplitude and zero-crossing of such filter output provide a useful source of information for texture analysis and measurement of binocular disparity, image orientation, and optical flow [14] - [16] . Considerable promises have been exhibited by techniques for binocular (stereo) disparity measurement in which disparity is expressed in terms of phase differences in the output of local, band-pass filters applied to the left and right image [17] . One important advantage of phase-based approaches is that disparity estimates are obtained with sub-pixel accuracy, without requiring an explicit sub-pixel signal reconstruction or sub-pixel feature detection and localization. The measurements may be used either directly or iteratively as predictions or further, more accurate, estimates. Another major advantage of phase-based approaches is the stability of band-pass phase behavior with respect to image deformations that typically exist between left and right views. In particular, phase information is stable under small scale perturbations and contrast variations; phase is more stable than amplitude as stated in [1] and [2] . This is important if the image-point disparity is to be used as an approximation to scene-point disparity. Despite the general stability of phase information, there are regions which are prone to errors near phase singularities where disparity estimates should not be trusted [18] . Simple constraints and thresholds can be used to detect these regions and discard incorrect or non reliable measurements as described on [17] .
A. Mathematical Formulation
We have focused on the phased-based computing model proposed by Solari and Sabatini [19] and on its multi-scale extension proposed in [16] . We will focus on this section on the mono-scale approach and in the next section, we will present the multi-scale extension. The basic idea of this algorithm is that the disparity can be estimated in terms of phase differences in the spectral components of the binocular image pair. Spatially localized phase measurements can be obtained by filtering operations with quadrature-pair Gabor filters. Phase difference is then computed from (2) without explicit calculation of left and right phase as described in [19] (1) (2) where we note with and the left and right local image phases, and correspond to the values of left and right image pixels after convolving with the even part of the quadrature filter, and to the odd quadrature filter outputs, atan2 stands for the principal part of the argument (i.e., the argument belongs to ) and finally, is the average instantaneous frequency of the bandpass signal, which can be approximated by , the quadrature filter peak frequency. An eight orientations filter with a band pass frequency of one octave is used (1/3 of the filter peak frequency). The approximation is valid except near singularities where abrupt frequency changes occur as a function of spatial position. Basic steps of mono-scale computation can be summarized as follows:
1) even (C) and odd (S) filtering with quadrature filter pairs of left and right images; 2) disparity computation using (2) at each orientation and threshold operation assuming ; 3) choice of final disparity estimation between different orientations: median value.
B. Multi-Scale Extension
Due to phase periodicity, phase-based techniques can only detect shifts that do not exceed half the filter wavelength [1] . To extend this range, a coarse-to-fine control strategy can be used [17] . An efficient solution involves the use of an image pyramid, in which the image resolution is halved at each level [20] . Specifically, a coarse-to-fine Gaussian pyramid [21] is constructed, where each layer is separated by an octave scale. Accordingly, the image is increasingly blurred with a Gaussian kernel and sub-sampled to build the image pyramid. At each pyramid level , the sub-sampling operation reduces the image resolution to a half in height and width with respect to the previous level , reducing also the disparity range presented at this scale and helping the filters to properly tune their response. By applying the original filters to each level of the pyramid, the detectable range of shifts is doubled each time. The control strategy starts at the lowest resolution and uses stereo estimation obtained with a mono-scale method to warp the images at the next higher resolution so that the estimated disparity is removed [22] . The residual disparity is then within the range of the filters applied at that level. The stereo algorithm which we use is particularly suitable for this warping strategy since it uses strictly local information. In our implementation, only disparity values that can be reliably computed at the highest resolution are retained. In other words, if the refinement made at the highest resolution to a lower resolution estimate (that was reliable at that lower resolution) is unreliable, the disparity value is discarded and not included in the density counts of the next section. The phase differences computed at this level and the disparity for this level are estimated as explained in Section II-A. Disparity values are then transformed (multiplied by two) to the next scale and the filter outputs at that level are warped to compensate for the effects of these phase differences. All disparity is recalculated with new information and is propagated to the next level; the procedure is repeated until the original resolution is obtained [23] . Moreover, we have included a multi-oriented analysis of the image that computes the disparity values with filters tuned at seven different orientations (vertically oriented filters cannot be used because they do not have projection on the horizontal axis where the displacement takes place). All operations can be summarized as follows.
1) Selection of the number of scales according to the image resolution and expected disparity range presented in the scene. The smallest scale is recommendable to be, at least, 1.5 times bigger than the filter size. A pyramid representation is produced for the left and right images. 2) Processing iteration for each scale: a) Computation of disparity values using (2) at each image orientation. Selection of disparity value from the seven available estimates using a median filter. b) Merge with previous disparity values (not necessary in the first step iteration). c) A simple 2-D median filter is used to regularize the disparity values. d) Up-sampling of merged disparity: values are multiplied by two to take into account current image resolution. Bilinear interpolation of values during up-sampling operations (this step is not required for the last scale). e) Warping pyramid images with expanded disparity to cancel already computed values reducing disparity range (this step is not operated for the last scale). It is worthwhile to mention that the warping operation uses a backward scheme with sub-pixel accuracy (computed using bilinear interpolation). Therefore, a smart memory access scheme will be necessary for the algorithm implementation in order to avoid performance penalties associated to external accesses. This multi-scale extension allows increasing the disparity range compatible with our approach. Moreover, the multi-oriented scheme allows to properly tuning (optimizing) the phase estimations according to the image structure presented at each pixel and, therefore, to increasing the system accuracy.
III. HARDWARE ARCHITECTURE
The whole design has been implemented for a Xilinx Virtex 4 xc4vfx100 FPGA [24] . The architecture has been described with a high level hardware description language (Handel C [25] ) which optimizes work at an algorithmic description but has proven to be competitive to lower level abstraction levels [26] . Critical stages as the memory controller unit (MCU on Fig. 1 ) or PCI-Express communication have been described in VHDL in order to optimize the performance of these critical elements. As described in Section II and in Fig. 1 , the system can be decomposed in two different parts: pyramid and processing (disparity, merge, median filter, expansion, and warping). The design strategy consists of a fine pipelined circuit which takes full advantage of the high parallelism of FPGA; nevertheless, some parts need a sequential execution due to the iterative multi-scale extension. This sequential processing can be avoided with a core replication (for processing coarse scales, from 1 to ) and with the use of double buffer techniques. It implies a major effort in terms of hardware use and generates high latencies. This approach would require two stereo matching circuits running on parallel, for instance with frames of instants and , to pipeline the processing. The output frame is delayed one frame more; for many applications, this increase in the latency (response delay) is not affordable (for instance, in obstacle avoidance of fast driving car applications): it can be avoided using high frame rate cameras. In this way, acquisition time is negligible with respect to pipeline latency. Therefore, further pipelining of the circuitry should be properly studied for each application to verify if resource consumption and latency values are appropriate.
Note that Fig. 1 also includes information about the different operands bit-widths and memory map used for the proposed architecture. Motivation of the different design choices as well as discussion of other alternatives are presented on next sections. The data-flow and connections for the adopted architecture are similar to those of a previous publication by authors [27] for an optical flow processing. For interested readers, this allows a better understanding of sharing capabilities with modules of optical flow design. Nevertheless, although the bank filter is the same, the optical flow characteristics (2-D temporal-processing) are quite different if compared with stereo (1-D spatial processing). As a consequence, the rectification module is only used in the stereo engine, the disparity extraction is computed without explicit phase estimation, confidence measure needs to be redefined and the displacement compensation module uses a completely different circuitry. Therefore, although they are conceptually very similar, in order to optimize the circuitry and achieve the maximum performance, the architecture and specifically these modules required to be completely redesigned.
A. Memory Controller Unit
Generally, physical interfaces such as memory access and external communication represent the bottleneck for hardware architecture due to the inherent sequential operation at the interfaces. Therefore, high clock frequencies and an optimized circuit description are required. As described in Fig. 1 every processing stage has to interact directly with off-chip RAM memory (for temporal results storage) reducing speed due to physical constraints of this communication. For this reason, a specific memory controller unit has been defined in VHDL in order to abstract the complex memory access in a shared memory scheme and to deliver high-level ports to connect process-stages to physical memory [28] . Bearing in mind multiple accesses on shared memory, we implement an architecture with multiple abstracted access ports (AAPs) for accessing to external memory chip. Clock domain of circuit that manages AAPs is faster than processing. This allows the scheduling of more accesses during a clock cycle of the slower clock domain. The same MCU architecture has been used in previous works [27] , where authors use five AAPs while the proposed work uses only four: two writer channels and two readers. Further information about the MCU architecture can be found in [28] . Note that another important advantage of the MCU is that each memory bank can be considered as an n-port memory. This simplifies the design at a higher level of abstraction (for instance, using the Handel-C language) allowing the avoidance of problems due to the inherent sequential properties of storage elements. It facilitates a faster algorithm codification into specific hardware data-paths.
B. Memory Mapping
Parallel accesses to RAM are allowed with a multiple bank strategy and multiple clock domains (MCU works 2 times faster than processing). Memory mapping is generated from FPGA at initialization time and depends on the image size (user defined parameter). Memory space is organized as indicated in Fig. 1 , DB bank is only used for input and output images, while rectified images, pyramid, partial disparities, and rectification lookup tables (LUTs) are stored in SB. Thus, the operation sequence described in this section writes at time t the correspondent output image in DB. A third additional memory bank is used as stereo buffer (SB) just to obtain parallel access to memory. Some speed up for the processing system can be operated in a specific board with at least four memory banks. It is possible to run in parallel rectification/pyramid with disparity calculation if a second double buffer for SB is adopted. Section IV-B includes details about the computation speed (frame rates) of different circuits.
C. Fixed Point Study
Though floating point processing units can also be synthesized [29] , they require a larger amount of resources and this motivates the utilization of fixed-point arithmetic. For all our operations, we use a fixed point notation so that the system precision is strongly dependent from the bit precision of the fractional part. A proper bit-width study is made to cope with these limitations. The bit precision chosen reduces hardware utilization and leads to a good trade-off for system accuracy. To prevent error propagation, we study the bit-width at the first processing stage and go on studying the bit depth in the next stages maintaining a suitable bit precision for the previous ones. Simulation of fixed point quantization is done using MATLAB libraries and a software model that reproduces hardware simplifications. In each processing stage, we maintain the minimum integer part that avoids variable overflow and we study only the fractional part to increase accuracy. After this study, the bit precision chosen reduces hardware utilization and obtain a good trade-off for system accuracy. In Fig. 2 , critical stages beginning from Gabor convolution kernel bit-width and ending with disparity bit-width are plotted: squared marks indicate the final choice for the fractional part and Table I reports both the integer and fractional part for each processing stage. Errors for the different fractional bits choice are expressed in terms of mean absolute error (MAE) defined as (3) where is the ground truth for the stereo sequence and is the computed disparity with fixed-point notation. The full study is out of the scope of the present contribution. For interested readers in the bit-depth working methodology, please see [30] .
D. Undistortion and Rectification
System provides an image undistorsion and rectification unit that consists in a 2-D warping of input images with a LUT of values previously calculated (offline) for the camera system configuration. These LUTs are codified using fixed point notation (8 bits for integer part and 4 for the fractional one) and written by the host in external memory banks during a previous "LUT loading" stage. After this stage, for each frame and pyramid construction, the circuit reads images from the DB and stores them after the rectification in the SB (see memory mapping in Fig. 1 ). Rectified pixels lens-distorsion free come from a bilinear interpolation in a neighborhood of 4 pixels. In the worst case, this operation needs four different sequential accesses to memory, this occurs each four consecutive memory accesses on image data.
In average, some pixel values can be reused and this gives an average 2.5 (4 out of 10) memory accesses for rectification of each image pixel.
E. Image Pyramid
After undistorsion and rectification, the pyramid is built by a smoothing and down-sampling circuit. Left and right circuits are replicated and work in parallel. Each pyramid scale is obtained sequentially one after the other (mainly, due to limitations of the sequential access to the external memory). Undistorsion-rectification and first image reduction are executed in parallel, input and output images are directly read/stored into an external RAM memory. A low pass Gaussian filter of 5 taps is applied before 
Thus, convolution with the input image is separated into and operation in order to benefit the FPGA parallelism. Five different image lines are stored in an embedded multi-port BlockRAM which is used like a FIFO. After the pre-processing, we send to output (external SRAM) a pixel every four clock cycles: three pixel are discarded (sub-sampling).
F. Scales Processing
Stage two (right part of Fig. 1 ) starts after the first one and is sequentially repeated for every scale. Disparity, merge, expansion, warping, and median filter circuits work all in parallel.
For the smallest scale, the merge circuit is omitted and the disparity block works directly on the pyramid output. Disparity calculation, as described in Section II-A, is divided in three main steps and benefits from a fine pipelined design. It uses two (left and right) Gabor filters of 11 taps, 7 atan2 cores with CORDIC core [24] , [31] for calculating (2) and a simple median circuit for choosing final disparity between seven different orientation-based estimations. On Fig. 3 , we describe these three different steps with specification of pipeline stages. In the bottom right of Fig. 3 we provide details of the convolution pipeline stages. The same pipelined architecture is also adopted for subsequent modules. The data stream enters in the pipeline and pixels are processed/transferred ahead along the pipeline structure every clock cycle. The filtering part also takes another 10 (filter size minus one) image line cycles to fill temporary FIFO convolution, which this time will be added as latency to the total disparity computation. Although the adopted algorithm is a phase-based approach, this mono-scale core, that represents the main module for our processing, is different if compared with phase-based optical flow processing (previously described in [27] ): stereo processing is based on phase differences while optical flow is based on phase-consistency. The only common part for this module is the filtering stage that generates the harmonic representation [16] of the image and can be considered the basis for further processing. The following merging circuit is simply the result of adding old and new disparity values provided respectively from a FIFO and from the disparity circuit (see Fig. 1 ). In this case, non valid values are propagated from lower resolutions: if unreliable at a lower scale, values are not considered at higher resolutions (the reliability measure is not suited for this). Low resolution valid estimates are retained even if they are never refined. A regularization after merging output is applied to drop a lot of erroneous values and get a more homogeneous output; a median filter is operated on a 3 3 window and discards non valid numbers depending on a user defined threshold. This regularization uses a tree-balanced sorting and adds a further latency to store the first two image rows.
Note that, in order to synchronize all the simultaneous processing data-paths, a careful latency analysis has been done to determine the intermediate FIFO memories capacity and when they are required.
The rest of blocks in the processing stages interact with memory trough the MCU. In detail, the memory data flow is displayed in Fig. 1 and can be summarized as follows.
1) Expand circuit, read old partial disparity and up-sample it with a bilinear interpolation (new values are multiplied by 2 to adapt disparity values to the new scale). 2) Warping circuit, read pyramid images and shift those using expanded disparity as LUT. 3) Median filter stores the partial/final result. All circuits work at a data rate of one pixel per clock cycle.
G. Warping Circuit
As commented before, the warping operation allows to reduce the disparity range. It makes possible the tuning of the Gabor filters response to the image input pattern. In order to get a further reduction of memory accesses, the warping architecture has also been optimized with the use of embedded memory (on-chip memory resources). The nature of stereo algorithms is to compute distances between positions of corresponding points only along the direction of the epipolar lines (rectified pixels in direction) and therefore, we do not need a random memory access of the whole image at every cycle as in [27] , but only a partial access to some specific lines; thus, we use a multi-port embedded RAM as cache to store two input lines and obtain in this way a fast access to pixel and disparity (LUT) values: two different accesses to RAM for the bilinear interpolation. Values are continuously refreshed as in a circular FIFO buffer. A double buffer technique is used to operate multiple accesses and optimize data throughput. We have a line of latency necessary to store the first values. Processing takes seven clock cycles more of latency for the search of the new pixel and the bilinear interpolation. Due to the fine grain pipeline, the circuit can process a pixel every clock cycle at a maximum frequency of 55 MHz. On  Fig. 4 , we present the basic architecture of the warping block for the stereo case. MPRAM memories are embedded in the Virtex 4 and they do not use any logic resources. Parallel multipliers are also implemented in embedded DSPs. The other operations are simply shifts and bit selection to separate integer from fractional part in the LUT values; finally we have an adder which sums the interpolated values to obtain the final average value (a further shift operation).
H. Implementation and Hardware Utilization
According to the bit-width choice and architecture described in Section III, we synthesize different circuit parts to esti- Fig. 4 . Warping architecture. The circuit processes a pixel every clock cycle as input data is previously stored in a multi-port RAM.
mate the hardware consumption for each functional block. On Table II , we report hardware utilization for an xc4vfx100 Virtex 4 chip. This table provides detailed information about maximum frequency according to the critical path of the system. The largest path is represented by the mono-scale disparity module and it takes 23.6 ns in the worst case. The maximum frequency of the final circuit is slightly slower than this module due to the additional logic added for the communication interfaces.
In order to reduce hardware utilization, it is possible to operate in different ways. A lossless choice is adopting sharing strategies. On Fig. 3 , we can see how some blocks (corresponding to different orientation or to the input streams of left/right images) are repeated in the architecture; the idea is to share a block by multiplexing input data in time. Sharing the same circuit for two different processes can save up to 50% of hardware resources for this processing unit, but it restricts the final system data throughput: one data for every two clock cycles. Possible targets for sharing resourses are the pre-processing filters and cross phase blocks; in the first case, the same kind of filter is repeated for the left image and right image for filtering, the most expensive operation can be shared for the image pair. In Fig. 3 , it is also evident the repetition in the second step of mono scale block disparity: seven different atan2 cores are used for computing phase (one for each orientation). In this case, we can use only four cores and share them for all seven processes. In general, the adopted sharing strategy implies some communication cost in order to synchronize input or output data for each shared module. Depending on the amount of shared resources, this cost can be significant; for this reason, a previous study of sharing benefits has to be evaluated before the hardware implementation.
Different sharing schemes are possible, depending on whether we use external memory or on-chip memory; the former results in a considerable reduction in performance because of the external memory accesses, while with the latter, we are limited by the on-chip memory resources available and it requires a much lower level design. In the first mechanism, the access to external data (off-chip memory) implies a much more significant degradation of performance. This is due to: external memory latencies, reduced bandwidth, power dissipation, and arbitration requirements. Thus, due to the large amount of on-chip memory resources available in current FPGA devices, the implementation allows the use of blocks which are specially dedicated to each processing datapath efficiently.
IV. SYSTEM RESULTS
The final circuit is optimized and set up to run on a XircaV4 board produced by Seven Solutions [32] . Although in this work the platform is used as co-processing board, the great advantage is the possibility of using it in a stand-alone mode for robotic issues and for driving scenarios. We have implemented a hardware/software platform to work as an interface between the external world (sequences captured from on-line cameras) and the FPGA platform. It provides the input images and shows the results of the disparity processing on-line. The whole system consists of a co-processing FPGA board and a host computer connected through the PCI Express interface able to transfer up to 130 MB/s. The communication using the PCI Express interface is performed with a simple handshaking protocol. The double buffer scheme optimizes the system speed and regularizes the read and write operations on external SDRAM banks. For each couple of frames, the application and the FPGA board wait for each other to commute the memory bank. The application writes the input image in the first memory segment (reserved for the input images in the memory map, Fig. 1 ) or reads the previous results. FPGA manages the bank memory map, allocates a bank for its processing and commutes the other bank to the application. The software for image acquisition and results displaying is available as Open Source at http://code.google.com/p/openrtvision/. Data acquisition is performed through image buffers and parallelized with openMP. The provided software can be further improved in terms of speed and synchronization between left and right image, but this is not the main aim of this work.
A. System Accuracy
In order to evaluate our system, we have used some benchmark images from [33] and from real driving scenarios. This allows the comparison of the results with other approaches and with ground truth if it is available. Unfortunately, for road sequences, there is not ground truth, and so, results can be evaluated only qualitatively. We have used the previously described software interface to process stored sequences in different approaches. For error estimation, we use MAE, SAE, density, and percentage of error higher than one as described in [34] . We have processed different image pairs downloaded from the Middlebury database [33] ; processed disparities are compared with ground truth for a multi-resolution computation which changes depending on image resolution: quantitative results are provided in Table III and qualitative results are shown in Fig. 5 . Table VI compares our system results with other approaches in the literature. Since our architecture provides semi-dense results, it is not possible to evaluate directly the platform output in the Middlebury web page [33] but we compare the results Table IV with semi-dense algorithms as previously described by other authors in [35] . The Middlebury Stereo Ranking primarily contains non-real-time algorithms, a few real-time algorithms and a minority of embedded real-time systems. Among the latter, the proposed real-time system represents a relevant contribution for the State-of-The-Art achieving 4.5 billion point disparities per second. In terms of accuracy, it ranks approximately at the middle among the non-real-time algorithms and achieves a high ranking position among real-time implementations. Although our phase-based approach does not rank among the first ones in the Middlebury evaluation, the robustness of the algorithm against illumination and contrast variation makes it very competitive in a real world environment, and it should be pointed out that these artifacts are not taken into account in most standard benchmark data sets. For instance, the Tsukuba dataset is not designed for outdoor applications but we include it because it is widely used in the literature for comparison purposes. The reader should bear in mind that the goal of the proposed approach is far beyond this kind of scenes and the main contribution is in terms of robustness and performance. Unfortunately, a general benchmark for outdoor applications is not available at the moment. A more careful benchmark would be desirable as pointed out in [36] . The model robustness and stability for real world unconstrained scenarios (see Fig. 6 ) is of crucial importance for outdoor-scenario applications. In fact, the robustness of the phase signal allows the same accuracy under different illuminations. The same sequence has been processed with different conditions available from [33] ; moreover, left and right frames have been chosen with a different illumination. Results maintain a good accuracy level also in very complicated conditions where conditions change from left to right image (third row of Fig. 6 ). Quantitative results for the first three rows (Monopoly sequence) are reported in Table III : E0-E0 (first row of Fig. 6 ), E2-E2 (second row), and E0-E1 (third row). Fourth and fifth rows of Fig. 6 report results for driving scenarios. In these cases, the information in the bottom part of the image (own car dashboard) is misleading and not reliable and it can be filtered (these values can be easily filtered in a previous configuration stage, converting them into NaN). In order to avoid errors in the image borders (due to the filter convolution), the frame mark is replicated before the processing and successively discarded in the final output. In general, we can say that the disadvantages of the phase-based approach are in terms of computational effort (higher than gradient-based, census, or SAD) and accuracy loss in borders. Concerning accuracy, it has some accuracy loss in object edges: borders are not clearly defined. As a result, this approach is not appropriate for object structure estimation but, on the other hand, the well-known advantage of this approach makes it specifically well suited for applications that work on outdoor unconstrained scenarios and only require approximate disparity values (as many robotics and obstacle avoidance approaches).
B. Performance Analysis
Our hardware implementation is optimized for the Xirca platform [32] but FPGA external interface with memory and PCI interface can be easily adapted for other kind of board and system maintaining the power of processing cores. On the other hand, if we have fewer resources on the target platform, the solution is dropping frequency and sharing resources as described in Section III. Some speed-up, as described in Section III-B, is also possible with a major effort in terms of external memory usage. Table V shows a comparison between different architectures in terms of external memory usage, latency and frame rate: we are assuming that camera frame rate is not binding and in latency, we provide the time required for the processing of the first complete image. The last row of Table V contains the theoretical results for a fully pipelined approach where two banks are used for the communication with the host, one for the image rectification, one for the pyramid and one for the coarse scales (the finest scale will be stored in the available DB bank). Note that this version has not been implemented since the used platform permits up to four external memory banks.
On Table VI , we provide the final frame rates for some usual image resolution and for the faster system architecture described: using four memory banks and without any sharing (third row of Table V) . On the other hand, if the target platform includes larger hardware resources and more memory banks available, the design can take full advantage of a multi-core implementation. It is important to take into account that the system is a real-time processing module which is faster than most existing systems, see for instance [40] and [42] . It is important to remark that, except for the mono-scale phase-based version presented in [12] which presents a poor accuracy, our approach overcomes the performance of all the other available contributions. The image rectification stage together with the sequential coarse-to-fine nature of the algorithm limits the frame rate performance, but as we saw in the previous section, provides reliable results especially in non controlled environments and no image preprocessing is required. The mono-scale phase-based implementation such as [12] achieves a higher frame rate but is not capable of solving high range disparities and many times produces over-smoothed results with reduced spatial localization capabilities. Note that our approach achieves a higher Point Disparity per Second (PDS) defined as: (5) This measure is commonly used in the literature as performance metric (see for instance [10] , [12] , [40] , [43] , and [44] ) and evaluates the throughput and the disparity range at the same time. It is the multiplication of the frame rate per the image resolution per the disparity range. In literature, we can also find a multi-resolution architecture [45] that simply computes disparity for different scales in parallel but does not propagate information among them and as a consequence, they reduce the hardware resource requirements. It is worthwhile indicating that the warping operations have a high complexity mainly due to the non-deterministic data access scheme (this is why they are rarely implemented in the literature) but they are justified for the accuracy improvements. In Table VI , we compare our best implementation with some works in the state of the art. Obviously, this solution is the most expensive configuration and can be replaced with a more economic one based on hardware sharing depending on the final application and the platform constraints. We achieve a very high processing speed (17.6 Megapixels per second) which is the third fastest approach in the table. But the most important comparative result is that we obtain PDS which is the second best comparative mark (more than 35% better than the third best approach) of the table and takes into account the large disparity range achievable by our multi-scale approach. Our disparity range is overcome only by [46] , which implements a binocular spherical stereo which is also useful for driving scenarios, but the final PDS obtained in this case is (lower than ours). Finally note that if the undistorsion and rectification stage is not used (it is the case in most of the literature solutions that assumes rectified inputs images with an unrealistic assumption), the processing speed achieves up to 35.7 MPPS (approximately a x2 gain factor).
V. CONCLUSION
This paper describes a stereo system of high complexity and performance implemented on a reconfigurable device. We have improved existing FPGA approaches with a higher accuracy and a larger disparity range thanks to an optimized data bit-width utilization and a coarse-to-fine multi-scale processing scheme. The final system can process video sequences at a high frame rate (for instance SVGA resolution at a frame rate of 36 fps), achieving real time for large resolution images. A comparison with the literature asserts that our approach is among the fastest approaches in terms of pixels per second and the second best approach in terms of PDS (i.e., taking into account the disparity range covered). Furthermore, it is an on-chip approach which is very well ranked among other approaches addressed with diverse technologies.
The described stereo system is robust against illumination variations between the two cameras and local contrast differences since it is based on phase and uses multi-orientation estimations to better optimize accuracy for different local contrast structures. This feature is obtained by the multi-orientation phase-based stereo model. The outstanding computing power of the system in terms of Megapixels per second has been achieved by adopting a fine grained pipelined processing data-path with superscalar units at several stages. This represents a global circuit with more than 2000 processing elements working in parallel. This is an approximative number of elementary operations computed in each clock cycle: it is estimated multiplying the pipelined stages by the level of super-scalar parallelism at each stage. Such a complex design has been carried out by adopting a modular design strategy.
We have validated the stereo processing engine in the framework of a co-processor solution that uses a software interface as frame-grabber and real-time visualization tool. The same board can be used to implement a stand-alone system and work on embedded applications with reduced occupation in space and low power consumption compared to other approaches as high performance processors, GPUs, etc. This makes the presented processing engine of specific interest for a wide variety of application fields, which include robotic, driving scenarios, mobile video surveillance, etc.
As future work, we will deal with the stand-alone version of the system, developing an Ethernet-based frame grabber with a direct connection of cameras on this interface. This structure will be adapted for a car system and integrated with feedback signals from/to the vehicle and the driver.
