Abstract-Obtaining highly accurate depth from stereo images in real time has many applications across computer vision and robotics, but in some contexts, upper bounds on power consumption constrain the feasible hardware to embedded platforms such as FPGAs. Whilst various stereo algorithms have been deployed on these platforms, usually cut down to better match the embedded architecture, certain key parts of the more advanced algorithms, e.g. those that rely on unpredictable access to memory or are highly iterative in nature, are difficult to deploy efficiently on FPGAs, and thus the depth quality that can be achieved is limited. In this paper, we leverage a FPGA-CPU chip to propose a novel, sophisticated, stereo approach that combines the best features of SGM and ELAS-based methods to compute highly accurate dense depth in real time. Our approach achieves an 8.7% error rate on the challenging KITTI 2015 dataset at over 50 FPS, with a power consumption of only 5W. [7] . Often, this information will be obtained in the form of a depth image, and various options for acquiring such images exist. Passive approaches, which rely only on one or more image sensors, are popular due their low cost, low weight and size, lack of active/moving components, ability to work at longer ranges, deployability in a wider range of operating environments and lack of interference. Among them, binocular stereo relies on a pair of synchronised cameras to acquire the same scene from two different points of view. Given the two frames, a dense and reliable depth map can be computed by finding correspondences between the pixels in the two images [8] . State-of-the-art algorithms for this problem usually rely on costly global image optimisations or on massive convolutional neural networks that involve significant computational costs, making them hard to deploy on resource-limited systems such as embedded devices [9] .
Abstract-Obtaining highly accurate depth from stereo images in real time has many applications across computer vision and robotics, but in some contexts, upper bounds on power consumption constrain the feasible hardware to embedded platforms such as FPGAs. Whilst various stereo algorithms have been deployed on these platforms, usually cut down to better match the embedded architecture, certain key parts of the more advanced algorithms, e.g. those that rely on unpredictable access to memory or are highly iterative in nature, are difficult to deploy efficiently on FPGAs, and thus the depth quality that can be achieved is limited. In this paper, we leverage a FPGA-CPU chip to propose a novel, sophisticated, stereo approach that combines the best features of SGM and ELAS-based methods to compute highly accurate dense depth in real time. Our approach achieves an 8.7% error rate on the challenging KITTI 2015 dataset at over 50 FPS, with a power consumption of only 5W.
Index Terms-Heterogeneous, FPGA, real-time, stereo, depth
Obtaining information about the 3D structure of a scene is important for many computer vision and robotics applications, e.g. 3D scene reconstruction [1] - [3] , camera relocalisation [4] - [6] , navigation and obstacle avoidance [7] . Often, this information will be obtained in the form of a depth image, and various options for acquiring such images exist. Passive approaches, which rely only on one or more image sensors, are popular due their low cost, low weight and size, lack of active/moving components, ability to work at longer ranges, deployability in a wider range of operating environments and lack of interference. Among them, binocular stereo relies on a pair of synchronised cameras to acquire the same scene from two different points of view. Given the two frames, a dense and reliable depth map can be computed by finding correspondences between the pixels in the two images [8] . State-of-the-art algorithms for this problem usually rely on costly global image optimisations or on massive convolutional neural networks that involve significant computational costs, making them hard to deploy on resource-limited systems such as embedded devices [9] .
Two popular solutions offering a good trade-off between speed and accuracy are Semi-Global Matching (SGM) [10] and ELAS [11] . SGM computes initial matching hypotheses by comparing patches around pixels in the left and right images, then approximates a costly image-wide smoothness constraint with the sum of several directional minimizations over the disparity range. By contrast, ELAS first identifies a set of sparse but reliable correspondences to provide a coarse approximation of the scene geometry, then uses them to define slanted plane priors that guide the final dense matching stage. We propose a novel stereo pipeline that efficiently combines the predictions of these two algorithms, achieving high accuracy and overcoming some of the limitations of each algorithm. First, we use multiple passes of a fast SGM variant [12] , leftright consistency checking and decimation to obtain a sparse but reliable set of correspondences. Then, we use these as the support points for ELAS to obtain disparity priors from slanted planes. Finally, we incorporate these disparity priors into a final SGM-based optimization (again based on [12] ) to achieve dense predictions with high accuracy.
Our pipeline targets not only accuracy, but also speed, aiming for real-time execution (30 fps) on an embedded platform. Recent works have deployed SGM successfully in real time both on multi-core CPUs [13] and GPUs [14] , [15] , but in real-world scenarios, power constraints often force us to rely on low-power devices like FPGAs. The development of reliable stereo pipelines for FPGAs is an active research field [9] , [16] - [22] , with recent works proposing FPGA-friendly variants of SGM [15] , [23] - [27] or ELAS [28] . However, FPGA implementations of stereo algorithms usually perform some kind of approximation to deal with the limited resources available and to traverse the pixels in raster order.
We show how some of the intrinsic limitations of a pure FPGA-based implementation can be mitigated by appropriately leveraging a new-generation hybrid system on a chip (SoC), e.g. the Xilinx ZCU104, which combines both an ARM processor and an FPGA, with shared direct memory access, into a single chip. Recently, several works have explored the deployment of stereo methods on such platforms: both [26] and [19] use the CPU mainly for handling communication and controlling peripherals, while [28] actively leverages the CPU to execute iterative steps that would be infeasible on an FPGA (e.g. Delaunay triangulation). Similar to [28] , we propose to actively use the elaboration capability of the builtin CPU to handle I/O and to execute part of the ELAS pipeline, while deploying all the other elaboration blocks on the FPGA. We show how our pipeline outperforms previously published works by achieving an 8.7% error rate on the challenging KITTI 2015 dataset [29] , [30] , while still operating with realtime performance and low power consumption. Fig. 1 : Overview of our approach. First, we use Fast R 3 SGM (see §I-A1) to compute disparity images for the input stereo pair (in raster and reverse-raster order). We then flip the right result and perform a left-right consistency check to obtain an accurate but sparse disparity map for the left input image (see §I-A2). Next, as ELAS [11] does, we perform support checking (see §I-B1) to remove points whose disparities appear abnormal relative to neighbouring pixels: this yields a sparser support point image that contains only points with confident disparities. This support point image is subsequently used in multiple ways: (i) it is further sparsified via a redundancy check, producing sparse anchors that are then used to generate plane disparity priors through a triangulation and interpolation process (see §I-B2); (ii) it is split into a grid where, for each grid cell, a binary vector representing the set of viable disparities is computed (see §I-B3). Finally, the support point image is combined with the outputs of (i) and (ii) in a disparity optimization that combines R 3 SGM and ELAS to produce a dense disparity image (see §I-C). We then median filter this image for robustness to produce the final result.
I. METHOD
Our overall pipeline is shown in Figure 1 . It consists of several different components which we describe in the subsections that follow. The system leverages both parts of the FPGA-CPU hybrid SoC to achieve optimal results. Tasks that are very data-intensive, but which access that data in a predictable manner, are run on dedicated FPGA accelerators to benefit from their parallel processing capability. In addition, they can take advantage of the FPGA accelerators' internal ability to pipeline data so that multiple inputs are processed together in staggered fashion. Tasks that are very dynamic and unpredictable, which often involve many unforeseen or random accesses to external memory, are run on the CPU, since they benefit both from the significantly faster clock frequency of the CPU and its ability to access memory in constant time (CPU memory accesses can be sped up via appropriate use of the cache). To minimize the amount of FPGA resources used by our method, as well as allow the deployment of the design on a real platform, we reuse some accelerators whilst buffering intermediate results in RAM. We will detail which blocks are reused in our final design in the rest of this section.
A. Sparse Disparity Computation 1) Fast R 3 SGM: Initially, we use a modified version of R 3 SGM [12] (a memory-efficient adaptation of classic SGM [10] to FPGAs), which we call Fast R 3 SGM, to compute disparity images for input stereo pairs. The original version of R 3 SGM aggregated contributions to the disparity of each pixel along four different scanlines: three above the pixel, and one to the left. However, as mentioned in [12] , using the left scanline severely limits the overall throughput of the system (one disparity value is output every three clock cycles) due to a blocking dependency between immediately successive pixels. To avoid this, we modify the approach to use only the scanlines above the pixel, allowing us to output one disparity per clock cycle. The mild loss in accuracy this causes is more than compensated for by the improvements yielded by the rest of our pipeline. Comparing the implicit biases that exist in raster and reverse-raster passes of R 3 SGM. The region of influence used in the original version of Fast R 3 SGM contains only pixels above the pixel of interest (a), and as such the disparity value computed for that pixel is unaffected by the pixels below it. The opposite holds true when performing Fast R 3 SGM in reverse-raster order (b).
We process each input pair twice: once in raster order, and once in reverse-raster order, yielding two disparity images overall. The advantage of this is that, as illustrated in Figure 2 , the raster and reverse-raster passes of R 3 SGM will base the disparity for each pixel on the disparities of pixels in different regions of influence. By comparing the results output by both these separate passes, we can identify pixels for which the difference in bias caused a significant change in the resulting output value. These can then be removed so that only pixels that maintain the same value, regardless of bias, are retained. Through this type of consistency check, the confidence and accuracy of the results can be improved.
In our implementation, we deploy a single instance of the Fast R 3 SGM block, together with the associated median filtering and L/R consistency checking blocks. We first feed the blocks with the raster-order stereo pair, then with the reverseraster-order pair, storing the disparities resulting from each pair back into RAM between the computations. Example images resulting from this two-pass process are depicted in Figure 3 . For further architectural details of the internal structure of the main blocks, we refer the interested reader to [12] .
2) Consolidating Consistency Checking: Each pass of Fast R 3 SGM outputs a disparity map that has been checked for consistency using the first input as the reference image [12] . The raster pass outputs a disparity map for the left input image; the reverse-raster pass outputs one for the (reversed) right input image. Due to the streaming nature of the disparity computation, however, the results suffer from a raster or reverse-raster scan bias, i.e. the disparity value of any given pixel is encouraged to be similar to those computed before it. To reconcile the inconsistencies between these two disparity maps, we perform a further left-right consistency check, which yields an accurate but sparse disparity map for the left input image as its result (see Figure 1) . The memory access pattern of such a process is problematic, however, as the first pixels in the left disparity map need to be checked against the last pixels in the right disparity map. To overcome this problem, we first reverse the latter image on the CPU (since this is an inherently sequential process, it benefits from the higher clock rate provided by the ARM core), then perform a standard leftright consistency check (on the programmable logic).
B. Generation of Priors
Using the sparse disparity map output by the consolidating consistency check, we adapt the ELAS method described in [11] to generate priors that can be fed into a combined disparity optimization process (see §I-C) to produce a more accurate and dense final result. The prior generation process begins by taking the disparity map produced by §I-A2 as input and producing a support point image (see §I-B1) containing sparse but confident disparities. The support points are then fed to two more blocks before being used by the final disparity optimization process: (i) a redundancy checking and disparity prior generation block, which first computes a sparse anchor points image and then triangulates such anchors to generate disparity priors for all pixels in the image (see §I-B2); and (ii) a grid vector extraction block that divides the support points image into a grid and then determines the set of possible disparities for each cell (see §I-B3).
1) Support Checking: To produce the support point image (Figure 4) , we filter the sparse disparity map to remove any pixels whose disparities are not sufficiently supported by the pixels in their immediate neighbourhoods (in practice, a square window centred on each pixel). For a pixel to be considered "supported", there must exist, in its neighbourhood, another predefined number of pixels that have very similar disparity values (e.g. at least 10 pixels within a 5×5 window that differ by less than 5 from that of the centre pixel). The disparities of all other pixels are marked as invalid. The resulting support point image will evidently be sparser than the original disparity map, since we have kept only those pixels about whose disparities we can be reasonably confident.
2) Redundancy Checking and Disparity Prior Generation: To produce the anchor image (see Figure 5) , we further sparsify the support point image from §I-B1 by processing it in raster order and invalidating any pixel whose disparity has already been seen within a window behind and above it. Unlike [28] , which for each pixel (x, y) used a window of where K was set to 5, which only encompassed points in the same row or same column as the pixel being processed, here we use a larger window of
This has the effect of creating a sparser anchor image than that used in [28] , significantly speeding up the subsequent Delaunay triangulation process. Whilst this inevitably reduces the granularity of the generated triangles, its impact on the quality of the subsequent depth priors is minor, as shown in [28] . As the Delaunay triangulation process was additionally shown to be a key bottleneck, the advantage of reducing the number of points to triangulate (thus reducing CPU processing time) outweighs the marginal benefit in accuracy obtained with more fine-grained triangles.
Finally, to produce the disparity priors, we first move the anchor points image back to RAM, then perform a Delaunay triangulation of those points, and finally compute the disparity of each non-anchor point located within one of the Delaunay triangles by interpolating the disparities of the triangle's vertices (depicted in Figure 6 ). The entirety of this process is performed by the CPU, since the triangulation and interpolation algorithms are inherently non-sequential in their memory access patterns, and can benefit from both the availability of memory caches and the higher speed of the ARM core.
3) Grid Vector Extraction: The final input to the combined disparity optimization we describe in §I-C is a set of binary grid vectors used to determine which disparities are suitable for each part of the image. To produce such vectors, we first Fig. 6 : The plane priors produced by constructing a Delaunay triangulation based on the sparse anchor points in Figure 5 , and then linearly interpolating the disparities within each triangle. Fig. 7 : The final dense disparity image for our running example, produced using the combined disparity optimisation described in Section I-C. divide the support point image into a regular grid (with cells of size 50 × 50 in our implementation). Then, for each cell, we find the valid disparity values within it, and store both those and their neighbouring disparities (±1) into a binary grid vector for that cell. See [28] for more details.
C. Combined Disparity Optimization
Finally, we perform a combined disparity optimization that takes into account not only the original pair of input images, but also the plane priors, grid vectors and support points. Essentially, we perform Fast R 3 SGM, as in §I-A1 (once again reusing the corresponding FPGA block), but first modifying the cost vectors of the pixels to take the various different priors we have available into account.
The disparities of the support points are fixed throughout and not recomputed. Every cost vector element for a support point (each of which corresponds to a specific disparity) is set to a large arbitrary value, except for the element corresponding to the disparity of the support point, which is set to zero instead. Through the Fast R 3 SGM smoothing process, pixels near the support point will then naturally be encouraged to adopt disparities similar to that of the support point itself, with the influence of this effect attenuating with distance. To take the disparity prior for each pixel into account, we decrease those elements of its cost vector that correspond to disparities close to the prior (more specifically, we superimpose a negative Gaussian over the cost vector, centred on the prior, and decrease the relevant elements within a certain radius accordingly). To make use of the grid vectors, we set all elements of the cost vectors for the pixels within each grid cell that do not appear in the grid vector for that cell to an arbitrarily large value, thus strongly encouraging them not to be selected. As with the effects of the support points, these cost vector modifications are similarly propagated by the Fast R 3 SGM smoothing process. At the end of this process, we perform a final median filter on the Fast R 3 SGM result to further mitigate impulsive noise, ultimately yielding a dense, accurate disparity map as demonstrated in Figure 7 .
II. RESULTS
We developed the FPGA accelerators using the Vivado High-Level Synthesis (HLS) tool, as this approach was quicker, and allowed for greater flexibility and reusability of the components. We deployed the system on a Xilinx ZCU104 board, and all of the power consumption results [29] , [30] . As in the official evaluation protocol, we report the percentage of accurate disparities (using a threshold of < 3 disparity values or 5%, whichever is greater) after an interpolation step (meant to assign a disparity value to all pixels in the image), on respectively the subsets of background, foreground and all pixels. We additionally report the density of valid disparity values. As can be seen, with the exception of R 3 SGM [12] , all methods provide almost dense disparity images, therefore the extra interpolation step mandated by the benchmark is not strictly required to obtain usable disparity images. Finally, for each method, we report the typical time required to process a stereo pair, as well as the approximate power consumption of the platform used. Whilst all approaches can process images in real time, only the FPGA-based methods (ours and [12] ) can do so in a power-efficient manner, with ours providing ≈ 12% additional accuracy and much higher density w.r.t. [12] , at the expense of slightly higher power usage and processing time. that we present for our method were estimated by the Xilinx Vivado tool. Although the values provided by the tool are only approximations, they still provide an accurate sense of the power requirements. We quantitatively evaluate the disparities produced by our approach on the standard KITTI 2015 stereo benchmark [29] , [30] . In Table I , we report the average percentages of pixel disparities estimated correctly for background, foreground and all pixels, respectively. We also report average runtimes and power consumptions for both our and alternative methods that achieve real-time processing speeds on the images used in the benchmark (which have a resolution of 1242 × 375). Whilst the proposed method results in slightly less accurate disparities than the DeepCostAggr [31] and CSCT-SGM-MF [15] methods, it is worth pointing out that both [15] , [31] rely on powerful GPUs to achieve real-time processing speed, whereas our approach does so in a much more powerefficient manner, relying only on a hybrid FPGA-CPU board. We also compare favourably to R 3 SGM [12] , the underlying method on which we base our approach for the estimation of the initial disparities (see §I-A1), providing more accurate and denser results at a similar speed and with similar power consumption. We similarly outperform the FPGA variant of ELAS [28] , achieving a lower error rate at a much higher speed, and with similarly low power consumption.
In Table II , we detail the hardware resources used by our approach when deployed on our Xilinx ZCU104 board. We break down the amount of logic resources used in the FPGA chip, as well as the power consumption of both the programmable logic and the ARM core. We also report the amount of resource and power used by the methods from which we draw inspiration [12] , [28] . Notably, despite making full use of many of the logic resources available on the FPGA, our power consumption remains very low. More specifically, breaking down the resource utilization of the programmable logic amongst the different accelerators, the largest share is taken by the Fast R 3 SGM block which, alone, consumes about 65% of the FPGA power. The next most resourceheavy blocks are the ones which perform the median filtering of the disparities, which require approximately 30% of the power. The remaining blocks have much smaller resource requirements, which altogether account for the remaining 5% of the power.
III. CONCLUSIONS
In this paper, we have presented a novel approach to computing depth from stereo images on a hybrid FPGA-CPU chip. Our approach uses an adapted version of ELAS [11] to refine the initial sparse disparity map produced by a fast variant of R
