Figure 1: Our bilateral solver produces smooth, edge-aware ow elds. Given an input pair of images (a), a low-resolution ow is estimated (b), upsampled to a noisy high-resolution ow (c), and processed with the bilateral solver (d) to produce an edge-aware smoothed ow (e). Our algorithm for bilateral solving is better-suited for hardware acceleration and results in speedups of up to 50× over prior work [2, 3].
Figure 1: Our bilateral solver produces smooth, edge-aware ow elds. Given an input pair of images (a), a low-resolution ow is estimated (b), upsampled to a noisy high-resolution ow (c), and processed with the bilateral solver (d) to produce an edge-aware smoothed ow (e). Our algorithm for bilateral solving is better-suited for hardware acceleration and results in speedups of up to 50× over prior work [2, 3] .
INTRODUCTION
Virtual reality (VR) devices are becoming widely available, from camera rigs for video capture [Anderson et al. 2016; Facebook 2017] , to headsets for immersive viewing [Google 2017; Oculus 2017; Samsung 2017] . Real-time rendering of 3D-360°video can enable a wide range of VR applications, from live sports and concert broadcasting to telepresence. While the domains of VR video content capture and viewing are growing more popular, no system is capable of producing 3D-360°VR videos in real time as of yet.
e Google Jump camera rig [Anderson et al. 2016 ] is one example of a commodity VR video capture device, using 16 cameras to capture high-resolution (4K-1080p) overlapping video streams of a 360°scene. e collected video is used to estimate edge-aware ow elds, which are composited into a stereoscopic 3D-360°video. A key portion of this processing pipeline, ow estimation, is constructed around the bilateral solver, a fast and edge-aware algorithm that combines simple bilateral lters with domain-speci c optimization problems [Barron and Poole 2016] . is ow estimation algorithm, highlighted in Figure 1 , consumes the majority of the processing time, despite using one of the fastest existing algorithms across thousands of cores [Anderson et al. 2016] .
In this paper, we introduce a new algorithm for this ow estimation problem, the hardware-friendly bilateral solver (HFBS). HFBS achieves signi cant speedups over the bilateral solver with li le accuracy loss. While Barron and Poole's bilateral solver [2016] is challenging to parallelize on modern hardware, our "hardwarefriendly" algorithm can be easily parallelized on GPUs and eldprogrammable gate arrays (FPGAs) . To demonstrate, we design a scalable FPGA-based hardware accelerator for HFBS, employing specialized memory layout and reduced-precision xed-point computation to achieve real-time results. Compared to the original bilateral solver, HFBS is 4× faster on a CPU, 32× faster on a GPU, and 50× faster on an FPGA. We evaluate the accuracy of HFBS on the depth superresolution task and show that our algorithm is faster than every more accurate algorithm, and more accurate than any faster algorithm.
is paper makes two contributions: an algorithm for hardwarefriendly bilateral solving, and a xed-function FPGA accelerator implementing HFBS. To achieve fast performance while maintaining accuracy, we take a hardware-so ware codesign approach where both the algorithm and hardware substrate are developed in tandem. Our algorithm modi es the original bilateral solver to ensure memory access is predictable and therefore fast, and performs optimization using preconditioned gradient descent with momentum to reduce global communication and enable parallel execution. Our hardware accelerator explores xed-point arithmetic and bilateralgrid-specialized memory layout to process large-resolution bilateral grids in a scalable way in real time. Many of these performance optimizations are codependent, and we evaluate performance of the algorithm and hardware together to illustrate our results. Our algorithm and accelerator design make it more practical to generate real-time VR video from camera rigs, either locally at the capture device, or in the cloud to accelerate large-scale video processing.
BACKGROUND
Before formalizing our hardware-friendly bilateral solver, we provide an overview on bilateral solving and its role in VR video. We also describe related work in so ware and hardware acceleration for bilateral solving.
Bilateral Filtering and the Bilateral Grid
We base our design for fast and accurate VR video on a state-of-theart bilateral-space optimization algorithm, the bilateral solver [Barron and Poole 2016] . e bilateral solver is general-purpose and scalable, and can be applied to the many vision applications: optical ow, stereo, depth superresolution, image colorization, and semantic segmentation. e bilateral solver can be used as part of an edge-aware optical ow algorithm for VR video, and scales to high resolutions e ciently [Anderson et al. 2016] . is optical ow algorithm generates a correspondence map from a pair of images by computing a rough ow vector for every pixel (Figure 1b-c) , and then re ning that ow eld until a cost function has been minimized. To compute this edge-aware per-pixel ow eld, the bilateral solver resamples a coarse ow eld into bilateral-space (Figure 1d) , and then solves an optimization problem in bilateral-space to infer the smoothest possible ow-eld that is similar to the input coarse ow eld. In bilateral-space, simple local lters are equivalent to costly, global, edge-aware lters in pixel-space-consequently, ow re nement in bilateral-space is much faster than its pixel-space equivalent. We perform optimization in a three-dimensional bilateral grid data structure [Chen et al. 2007] . Figure 2 illustrates a simpli ed version of bilateral space and its use in our problem. We begin with the noisy ow eld of Figure 2a i, where color corresponds to some ow value. If we a empt to denoise this noisy ow eld by applying a simple smoothing kernel, the result will present undesirable blurring at color edges. In Figure 2a -ii, for instance, the green region is successfully denoised, but the blue and red regions (which likely belong to di erent objects) blend around the edges, producing incorrect ow values there.
To smooth this ow eld while maintaining sharp edges, we map the problem to bilateral space. First, we construct a bilateral grid for the original image, where a pixel in the image at location (x, ) with luminance l corresponds to a grid block at location (x, , l) (Figure 2b -i, b-ii). In the 3D bilateral space, the lighter pixels are separated from neighboring darker pixels. We then map the ow value of each pixel (Figure 2b -iii) to its corresponding grid location (Figure 2b -iv). When we smooth this noisy ow in bilateral space, the blue and red areas are no longer neighbors and do not a ect each other's value. Finally, we map the smoothed 3D ow back to the 2D representation . e resulting bilateralsmooth output in Figure 2a -vi retains sharp edges.
Virtual reality video with the bilateral solver. We tailored our algorithm for a VR pipeline (similar to that of Anderson et al. [2016] ), which takes 16 camera streams as input and processes them with the bilateral solver to construct 3D-360°video. ere are many other ways to capture and render VR video, but each method presents unique challenges. Light elds [Levoy and Hanrahan 1996] , a type of image that conveys information about the ow of light in a scene, are the most general and immersive solution to VR imagery. Proposed light eld-based systems, however, require an enormous amount of input and output data, and rendering on the client side is compute-intensive. While impressive results for light eld images have been demonstrated in VR [Huang et al. 2015] , video is a greater challenge and still impractical. Other solutions for immersive video viewing, such as free-viewpoint video [Carranza et al. 2003 ] and concentric mosaics [Shum et al. 2005] , are also challenging to process and display using standard video formats, resulting in systems that are not yet stable enough to motivate hardware support. In contrast, the Jump VR video system is designed to be practical to compute, edit, and stream. Processing the 16 camera video array requires computing optical ow between images from each adjacent pair of cameras and then interpolating the images to produce the omnidirectional stereo projection [Peleg et al. 2001 ]. e output is a pair of equirectangular spherical video streams, one stream for the
Step 1:
Grid construction
Step 2:
Mapping flow to grid
Step 3:
3D smoothing the grid
Step 4:
Mapping back to 2D 
Figure 2: Smoothing a noisy ow eld in (a) regular 2D space and (b) bilateral space. Regular smoothing produces undesirable artifacts at edges as the ow values blur together. e bilateral grid allows edge-aware smoothing and produces a correct denoised output.
le eye and one stream for the right. Anderson et al. [2016] employ a bilateral-space solver for optical ow to e ciently produce high quality edge-aware ow results that are well-suited to omnidirectional stereo image interpolation. ey observe that the majority of rendering time is spent running the bilateral solver.
Hardware Acceleration for Bilateral Grids
is work is the rst, to our knowledge, to accelerate the bilateral solver on GPUs or with custom hardware, and builds on related work in hardware-e cient algorithms and accelerators for bilateral ltering and the bilateral grid. e bilateral grid itself was originally proposed as a solution for fast, parallelizable bilateral ltering on GPUs [Chen et al. 2007] . Towards more hardware-e cient execution on GPUs, Yang [2014] proposed a hierarchical bilateral lter technique, but their approach has much higher error than our algorithm. Most similar to our accelerator is that of Rithe et al. [2013] , who designed a low-power recon gurable processor for bilateral ltering. eir design di ers from ours by implementing spla ing and slicing in hardware, but it can only perform streaming bilateral lters and does not support the repeated ltering iterations necessary for bilateral-space optimization.
HARDWARE-FRIENDLY BILATERAL SOLVING
In this section, we formulate a bilateral solver that maintains speed, scalability, and accuracy, while also being parallelizable. We rst describe the original bilateral solver of [Barron and Poole 2016] , and motivate the requirements for a hardware-friendly bilateral solver. We then provide a detailed formulation of our algorithm and its advantages.
Bilateral-Space Optimization
e original bilateral solver (OBS) consists of an objective and optimization technique [Barron and Poole 2016] . e input to the solver is a reference RGB image, a target image that contains noisy observed quantities we wish to improve, and a con dence image. e goal is to recover an "output" vector x, which will resemble the input target where the con dence is large while being smooth and tightly aligned to edges in the reference image. To achieve this, Barron and Poole construct an optimization problem of the following form:
e rst term of the loss encourages that for all pixel pairs i and j, the overall di erence between their ow values x i and x j is minimized if they are neighboring pixels in the bilateral space. e second term of Eq. 1 encourages each pixel x i to be close to the target input t i if that pixel's con dence c i is high. e a nity matrixŴ is a bistochastized (all rows and columns sum to 1) version of a bilateral a nity matrix W. Each element of the bilateral a nity matrix W i, j describes the a nity between pixels i and j in the reference image in the YUV colorspace:
where p i is a pixel in the reference image with location (p x i , p i ) and color (p l i , p u i , p i ). e σ x , σ l , and σ u parameters control the support of the spatial, luminance (luma), and chrominance (chroma) components of the lter. Bistochastization normalizes this a nity matrix while maintaining symmetry [Barron et al. 2015] .
Bilateral operations (e.g., ltering) can be sped up by treating the lter as a "splat/blur/slice" procedure in the bilateral grid. e splat/blur/slice ltering approach corresponds to a compact factorization of W: W = S T BS (3) where S and S T are spla ing and slicing, and B is a [1 2 1] blur kernel. As in Barron and Poole [2016] , S de nes a per-pixel mapping from a pixel to a coarse bin in the bilateral grid, where that mapping is a function of the x and coordinates, l luma, u and chroma of that pixel. Multiplying by S is a data-dependent histogramming operation, and multiplying by S T is a data-dependent interpolation.
e bilateral space optimization formulation of Barron et al.
[2015] performs bistochastization by calculating two matrices m and n that satisfy the following:
whereŴ is a bistochastic version of matrix W. e vectors m and n describe a normalizing transformation required by the solver. Barron and Poole also perform a variable substitution [2016] , transforming the high-dimensional pixel-space optimization problem into one with lower-dimensional bilateral-space vertices:
where y is a small vector of values for each bilateral grid vertex, and x is the large vector of values for each pixel. Equations 3 and 5 allow us to reformulate the pixel-space loss function of Eq. 1 into bilateral-space in a quadratic form:
where y is the solution to the problem in bilateral-space, m and n are de ned by Eq. 4, and t and c are per pixel initial solutions and con dences (Eq. 1). e Hadamard (element-wise) product is denoted by •. e optimization problem of Eq. 1 is intractably slow to solve naively. However, the bilateral-space formulation allows feasible and fast execution. Minimizing Eq. 6 is equivalent to solving a sparse linear system: Ay = b and we can produce a pixel-space solutionx by slicing out the solution from the linear system:
In summary, OBS takes an input image vector and a con dence image to construct a simpli ed bilateral grid from the reference image. With that, it produces the A matrix and b vector of Eq. 6 to solve the linear system in Eq. 7 and obtain an output image.
Algorithmic Modi cations
ough computationally e cient, OBS as presented has a number of properties that make it di cult to implement in hardware, or even to achieve real-time operation on modern CPU or GPU systems. Vectorizing and parallelizing CPU or GPU processing on the sparse 5D bilateral gridŴ demonstrates too-irregular memory access pa erns to achieve large performance bene ts from parallelization. Moreover, the use of second order global optimization limits the level of parallelism we can extract from the algorithm. We modify OBS to construct a hardware-friendly bilateral solver and address these speci c challenges: color and sparse memory indexing, and second order global optimization. Our modi cations also allow for an alternative, more e cient initialization and reduced quantization artifacts, which we will discuss a er formulating our algorithm.
Color and sparse memory indexing. e bilateral solver of Barron and Poole [2016] was designed around a hard bilateral grid or a permutohedral la ice [Adams et al. 2010] , meaning that optimization takes place in a "sparse" ve-dimensional bilateral space (where the ve dimensions are position in x and , pixel luma, and two pixel chroma values). e resulting 5D grid has an image-dependent "sparsity" that is challenging to exploit in parallel algorithms. Moreover, the connectivity structure of the graph used in the bilateral solver varies as a function of the input, leading to expensive and unpredictable memory access pa erns. A empting to resolve this by solely converting the sparse grid into a "dense" representation of the 5D space requires a prohibitive amount of memory. Instead, HFBS ignores the color of the input image and uses a "dense" 3D bilateral grid [Chen et al. 2007 ], which makes memory indexing predictable and enables further optimizations. Ignoring color this way induces a small decrease in accuracy, as we will demonstrate.
Second order global optimization. e numerical optimization in OBS was performed using the preconditioned conjugate gradient method with a Jacobi or Jacobi-like hierarchical preconditioner. Conjugate gradient methods use a global optimization step: at each iteration, updating each variable of the optimization vector requires reasoning about the gradient at all other variables. Such global communication requirements make parallel hardware implementation di cult, as we want to be able to individually update and optimize any variable in our state space via local communication with the "neighboring" variables in our bilateral grid. To avoid global communication, HFBS performs optimization using gradient descent with momentum (i.e., the "Heavy Ball" algorithm), which can be shown to have similar asymptotic performance as conjugate gradient [Polyak 1964] . is converts an irregular number of global matrix operations into a regular, but larger, number of local updates that are much easier to execute in parallel.
e Heavy Ball algorithm does not naturally accommodate a preconditioner, so we reformulate our optimization problem with a transformation that indirectly applies a Jacobi preconditioner during optimization. We nd that HFBS slightly underperforms the preconditioned conjugate gradient solver of Barron and Poole [2016] and therefore requires roughly twice as many steps for convergence. However, since each step is signi cantly faster to compute (roughly 4× faster on CPU and much faster on GPU/FPGA), we see an overall increase in performance.
Algorithm Formulation
We now formalize the details of HFBS and how it relates to the original bilateral solver. Both OBS and HFBS minimize an optimization problem of the form of Eq. 1. In this case, the t i is the low resolution ow shown in Figure 1b . We derive the con dence image for these low resolution ow elds by computing normalized sum-of-squared di erences. e resulting con dence is larger for areas that are near each other and match well.
We obtain the weightŴ i, j , which determines the bilateral-space distance between two pixels i and j, from a bistochastized version of the matrix W whose elements are calculated via the following:
where each pixel p i has a spatial position (p x i , p i ) and luminance p l i . While OBS includes color information inŴ i, j (Eq. 2), HFBS only considers luminance. e bistochastization step in Barron and Poole [2016] requires 10-20 iterations to achieve low error. To reduce the xed cost of this step, we use a faster, approximate bistochastization step for initializing the bilateral solver. Unlike OBS, which fully-bistochastizes W intoŴ, we construct an approximately bistochastizedŴ (equivalent to one iteration of bistochastization) that still satis es the requirements of the bilateral solver:
In OBS, bistochastization is done to convergence, which produces a n which satis es m 0 = n • (Bn). Partial bistochastization requires that we treat this equality as an assignment, thereby constructing m 1 to explicitly obey this constraint (Eq. 9). is produces nearlyindistinguishable output while being faster and easier to compute. Our normalization also di ers from OBS by the use of ϵ ∼ 0.00001 in the construction of n. Adding ϵ to the numerator prevents divideby-zero later and ensures that empty grid cells do not propagate information during optimization. Adding it to the denominator prevents the addition of ϵ in the numerator from biasing the solution towards 0. Note that the partial bistochastization step of HFBS is not iterative and does not require any convergence, and thus is signi cantly faster than the bistochastization step of OBS.
As described earlier, the expensive per-pixel optimization in Eq. 1 can be reformulated to a much more tractable optimization problem inside a bilateral grid. For convenience we will de ne By (the product of some grid y with a blur B) as a scaling of and "di usion" of y:
where D is a di usion operator (which we can interchangeably refer to as a matrix and a function) that replaces each element in y with the sum of its neighbors. Because our 3D bilateral grid is dense in memory, this di usion process is a simple stencil operation. We now perform a variable substitution, as in Eq. 5. For us, this simply requires dividing by the square root of the diagonal of the A matrix:
whereẑ is the solution to the substituted problem.
With our variable substitution in place, we can reformulate Eq. 6:
Here c is the same as in Eq. 6. Note that the diagonal ofÃ is 1, so optimizing this problem without a preconditioner is the same as optimizing Eq. 6 with a Jacobi preconditioner. Minimizing this problem requires solving a linear system, undoing our preconditioning variable substitution, and then slicing out a solution:
We will solve this problem using the "Heavy Ball" method (gradient descent with momentum). is problem is fully-described by the di usion operator D(·) and the bilateral gridsb and q. Algorithm 1 shows pseudocode describing how optimization is performed. It can be shown that if the momentum and step Algorithm 1 Bilateral-Space Heavy Ball Method Input: problem description {D(·),b, q}, initial state z init , step size α = 1, momentum β = 0.9, number of iterations n = 256. Output: state a er n iterations z
z ← z − αh 7: end for size hyperparameters are set correctly, this heavy ball method has the same asymptotic performance as conjugate gradient [Polyak 1964] . Because preconditioning has been absorbed into the problem, performance approaches preconditioned conjugate gradient. Since the di usion operator D(·) is a local stencil, the gradient update to g and the optimization update to h and z can be performed e ciently (i.e., vectorized, parallelized, etc).
Be er Initialization to Reduce Optimization Iterations. Our objective function is convex and thus invariant to the initialization z init , but a be er initialization may allow us to converge in fewer iterations. We can achieve this with a simple weighted blur in our bilateral grid.
is a large-support 3D Laplacian blur of a with a scale of σ b :
and t x , t , and t z are 3D coordinates. is can be e ciently implemented as three separable in nite impulse response lters (i.e., exponential smoothing, forward and backward) in the three dimensions of the grid. e intuition behind this initialization is that the solution should be close to b where the con dence is large and smooth where con dence is small. We found that this initialization can be implemented e ciently on a CPU and roughly halves the number of required iterations.
Reduced antization Artifacts. In OBS, slicing can introduce "blocky" quantization artifacts in the output [Barron and Poole 2016] . is quantization requires post-processing, adding to the computational cost of running the bilateral solver. However, HFBS uses a dense and low-dimensional grayscale bilateral grid which allows us to e ciently slice out of our bilateral grid using trilinear interpolation. As shown in Chen et al. [2007] , this produces smooth results without post-processing. e trilinear interpolation can be done through a weighted slice, where S tri is analogous to S but with trilinear weights instead of hard "one-hot" assignment:
By performing a weighted slice according to the per-vertex grid occupancy m 0 this process produces artifact-free results in comparison to results from OBS, even if trilinear interpolation is not used in the spla ing step. is "so " slicing is only slightly more expensive than its "hard" equivalent, though both forms can be implemented very e ciently by virtue of being simple gather operations in a dense bilateral grid.
HARDWARE ARCHITECTURE
e formulation of HFBS allows for fast bilateral solving on highperformance CPUs or GPUs, but the resulting power consumption may prove prohibitively costly for a full system. FPGA platforms, on the other hand, can demonstrate fast performance with be er power e ciency. is makes them a more suitable target for a system requiring multiple high-performance processors in a single chassis that can support processing 16-camera outputs simultaneously. To demonstrate power and performance e ciency on FPGAs, we co-designed our hardware implementation with the HFBS algorithm. In addition to the algorithmic optimizations, we apply hardware-speci c techniques such as customized variable bitwidths and bilateral-space memory partitioning to enable better performance. We rst discuss the hardware system at a high level, and then our speci c design exploration for bitwidth precision and bilateral grid memory layout. Finally, we describe the hardware-so ware interface of our design and how we integrate the accelerator into an application.
Microarchitectural Design
We focus on executing the inner loop of Algorithm 1 with custom hardware, and maintaining the higher-level control ow in so ware. In this scheme, a so ware application splats the optimization problem de ned in Section 3 onto a bilateral grid, and transfers it to the accelerator for iterative solving. Figure 3 shows a high-level overview of how application functionality is distributed across the system. e gure also illustrates details of our design's microarchitecture. e CPU constructs the bilateral grid based on the input reference image and the initial low-resolution solution provided from prior steps. e transferred data includesb, q, and the initial solution z init shown in Algorithm 1. During transfer, the memory controller of Figure 3 interleaves the data corresponding to each bilateral grid vertex, and partitions the data into memory banks. A er the data transfer is complete, a pool of parallel workers iteratively solve the optimization problem by running the loop of Algorithm 1. A er some number of iterations (we chose 256 iterations for our experiments to ensure convergence), the CPU reads back the nal solution and slices it into a 2D result.
Each worker (shown in in Figure 4 ) performs the inner loop of Algorithm 1 on one grid cell. It computes the result by streaming in the data from the neighboring cells (required for the "blur" operation), as well as the normalization factors required for the optimization process. e workers compute their local stencil operations synchronously, interfacing primarily with an assigned memory bank and occasionally the neighboring memory banks to access grid blocks that may be stored across banks. Because each worker executes in lockstep, there are no memory collisions when accessing data in other banks. Figure 3 demonstrates how multiplexers, managed by the main controller, shepherd access to neighboring banks. As we scale the number of workers, we nd that parallelism introduces a 1% reduction in speedup against perfect linear scaling. is near-linear scaling can entirely be a ributed to our inclusion of the "Heavy Ball" algorithm in HFBS, which allowed our design to use only local-neighbor communication rather than global synchronization a er each iteration. Figure 5: MSE of xed-point implementations at varied fractional widths, for di erent bitwidths. MSE is reported relative to 32-bit oating-point. We chose a con guration with 31 bits of fractional precision to reduce chance of over ow in the integer portion.
Fixed-point Conversion
To improve resource utilization, we converted the algorithm from oating-point to xed-point number representation. We rst implemented our workers using single-precision oating-point, like our CPU and GPU implementations. We found that the large number of digital signal processing units (DSPs) required for a single oatingpoint multiplier prohibitively limited the number of workers we could employ, and consequently, the amount of parallelism. Converting FPGA designs from oating-point to xed-point number representation resolves this by reducing the resource requirements of hardware multipliers. Using the cheaper xed-point multiplier, however, required us to evaluate three competing tradeo s: (1) the bitwidth of our xed-point numbers, (2) the precision at a given bitwidth for the integer and fractional portions of the number, and (3) convergence of the solver. If less than 12 bits were used for the integer portion, the bilateral grid data would quickly populate with over ow values. If less than 24 bits were used for the fractional component of the number, the bilateral solver would not converge, because grid vertices would not have enough precision to capture the change in a value a er blurring. ese constraints prevented us from using 32-bit xed-point numbers, as highlighted by the high mean squared error (MSE) shown in Figure 5 across integer-fractionratio con gurations. We delineate a maximum error threshold of ∼ 0.00001, because any errors exceeding that precision eliminate the positive bene ts of using an ϵ-value to reduce zero-propagation. Using 64-bit xed-point numbers resulted in very low MSE, but, as seen in Table 1 , required 16 DSPs per worker, limiting the number of parallel workers we could deploy with these con gurations.
As a compromise, we evaluated a 47-bit number representation that was more accurate than 32-bit xed-point, with 75% less DSPs than 64-bit xed-point. To maintain some precision of 64-bit numbers during non-multiplier arithmetic, we chose a 64-bit xedpoint representation with 15 bits of integer precision and 48 bits of fractional precision, and cast it to and from 47-bit for multiply operations only. Before multiplying two 64-bit numbers, we round o the bo om 16-bits of each number, resulting in the 1-bit sign, 15-bit integer, 31-bit fraction number highlighted in Figure 5 . We zero-extend the resulting 47-bit output back to a 64-bit number for the rest of the computation. is xed-point con guration has a MSE of 3.17 × 10 −7 compared to the oating-point implementation, resulting in negligible accuracy loss at the solver output, and the solver converges at the same number of iterations.
On-Chip Bilateral Grid Memory Layout
To take advantage of block RAM distribution on the FPGA, we partitioned the bilateral grid into chunks along di erent dimensions, and dedicated grid workers for each partition. For large, nely divided grids with many vertices (the largest grids we consider have up to 5 million vertices), we could achieve full resource utilization simply by partitioning the grid along one dimension and allocating a single worker to process each memory bank. For more coarse grids, we partitioned the memory in multiple dimensions.
Our method for laying out data in memory consists of storing all the data needed for a grid vertex in a single packet, and writing the packets sequentially in memory. Rather than storing multiple bilateral grid data structures separately and repeatedly indexing into each of them to process a single vertex, we interleave the data structures together to access all the information for processing a grid vertex as a single packet. When a worker is assigned a grid vertex to process, it can fetch most of the data required for its computation from a single partition, including neighboring vertex data for some dimensions. For large grids, where we only partition on one dimension, the data for two of the three dimensions is stored in the same memory bank, and the worker only has to communicate across banks for the two neighbors in other partitions. For smaller grids, where we partition along multiple dimensions to improve parallelism, workers may need to fetch more of their neighbors from neighboring partitions. All inter-bank communication is handled via the main controller of Figure 3 .
To aid in fetching grid vertex data for a worker's vertex or neighboring ones, we abstracted this memory layout into a simple addressing method: we dedicate log 2 (k) bits of address space for each grid dimension with size k, and use the last three bits to index into the packet for a grid vertex. For instance, with a bilateral grid of shape 247, 166, 16 partitioned on the rst dimension only, a worker assigned the address 0b 00001010 10100001 0100 001 would map the rst dimension's value to memory bank 10, and use the second and third dimensions to fetch the second item in the packet for grid vertex 10, 161, 4 . Indexing into a neighboring vertex in any dimension means incrementing or decrementing a dimension's tag; the main controller detects when a worker is requesting an address in a neighboring bank and multiplexes the request appropriately. is discrete mapping of grid dimensions to address spaces results in simple logic for memory addressing, but at the cost of wasted memory space. Each grid vertex packet contains ve items but requires the memory space for eight. e same is true at the grid partition level, since the number of grid vertices along a dimension is a function of the image resolution and the σ x or σ l , and does not o en t nicely in power-of-two partitions.
FPGA Implementation
We implemented a maximal design in Verilog and wrapped the accelerator in an AXI4-Stream compliant interface for portable 
Model
Logic RAM DSP Clock (MHz)
Virtex Ultrascale+ 44% 99% 100% 250 deployment across Xilinx FPGAs. We can t a maximum of 1,710 workers on a Xilinx Virtex UltraScale+ device. We detail resource utilization of our maximal design in Table 2 .
To invoke the bilateral solver accelerator in an application, the application sends a bytestream of bilateral grid data over the FPGA's PCIe-to-AXI DMA interface. e FPGA's driver can be invoked with standard Unix I/O system calls like read() and write(), and can thus be integrated with so ware applications wri en in any high-level language.
At con guration time, we x the number of grid workers, memory size, and partitioning based on a chosen set of parameters for image resolution and bandwidths in the luma and spatial dimensions. e parameters essentially de ne the maximum memory size and number of partitions, which can be interpreted as the upper bound of grid sizes that can be run under a certain con guration.
e bilateral grid dimensions and number of iterations are so warede ned at program runtime. is level of exibility in our design allows applications to process images of varied resolutions at varied grid bandwidths, but can result in wasted resources if the grid size being processed is much smaller than the accelerator's con gured grid size.
EXPERIMENTAL RESULTS
We designed HFBS with the goal of improving bilateral solver performance by parallelizing on modern hardware, while maintaining comparable visual accuracy. In this section we evaluate the performance of our algorithm and hardware, including runtime comparison, power consumption measurements, and accuracy evaluation.
Methodology
We compare our algorithm across CPU, GPU, and FPGA implementations. e CPU is a Xeon E5-2620 with six cores, and the GPU is an NVIDIA GTX 1080 Ti. Both platforms execute optimized implementations of the kernels wri en and tuned with Halide [Ragan-Kelley et al. 2013] . We prototyped our FPGA design on a Xilinx Kintex-7 connected to a host CPU over PCIe to evaluate host-device memory tra c, and synthesized and simulated a target evaluation design for the Xilinx Virtex UltraScale+ VU9P to evaluate full-resolution frame processing. In this section we only report results from the Virtex UltraScale+ design.
To benchmark our algorithm, we execute the bilateral solver on ow elds and con dences generated from the ten training images in the Middlebury stereo dataset [Scharstein and Szeliski 2002] , and evaluate runtime and accuracy. We compare runtimes for our algorithm on CPU, GPU, and FPGA with the bilateral solver of Barron and Poole [2016] on CPU as the baseline. For CPU and GPU implementations, we report the average runtime from 8 trials; the FPGA runtime is deterministic and did not vary across trials. We characterize power consumption for the CPU and GPU by measuring utilization and scaling from the reported device power. For the FPGA, we report estimated power consumption from Xilinx Vivado's power report. e size of the bilateral grid data ranges from 4KB-1.8GB, depending on the σ x used to construct the grid. All results use a σ l = 16. We use 256 iterations of optimization in all cases, more than enough guarantee convergence for all algorithms and implementations. Note that our performance comparison is not at iso-quality, as our algorithm has slightly more error but qualitatively similar results, which we discuss more in Section 5.3. All computation is single-precision oating-point, except for bilateral solving on the FPGA which is conducted with 64-bit xed-point numbers. We observe transfer throughput for the FPGA over a single PCIe channel to range between 9.6-11.3 Gbps, which is in keeping with reported estimates. Since both GPU and FPGA communicate with the host over PCIe and we assume frames can be pipelined, we omit the transfer time between the host processor and the device from reported runtimes. Figure 6 plots the runtime results of bilateral solver implementations on all our benchmarks, as a function of the spatial bandwidth. Figure 6a shows the runtime for the optimization portion of the solver, and Figure 6b shows the runtime for the complete bilateral solver including pre-processing. As we increase the spatial bandwidth (σ x of Eq. 8), we see that the overall grid size decreases, [Ferstl et al. 2013] Algorithm Error (MSE) Runtime (sec) Chan et al. [2008] 3.89 3.02 Min et al. [2014] 3.78 0.383 Domain Transform [Gastal and Oliveira 2011] 3.60 0.021 Ma et al. [2013] 3.53 18 Zhang et al. [2014] 3.51 1.346 Guided Filter (Matlab) He et al. [2010] 3.51 0.434 Fast Guided Filter He and Sun [2015] 3.45 0.225 Yang [2015] 3.44 0.304 Farbman et al. [2008] 3.24 6.11 JBU [Adams et al. 2010; Kopf et al. 2007] 3.19 1.98 Barron and Poole [2016] 2.75 0.234 Our Model 3.27 0.047 ± 0.002
Runtime Results
and runtimes shorten for all implementations. We nd that our algorithm outperforms the baseline on all platforms at all spatial bandwidths. For optimization alone, CPU and FPGA results scale with the grid size, while the GPU results scales until the size of the grid is too small to fully utilize resources. Because spla ing and slicing is not accelerated on the FPGA, runtime for the entire bilateral solver does not scale as well at large grid sizes. Table 3 highlights the runtime results speci cally for the VR Video use-case, where σ x = 12 as in Anderson et al. [2016] , as well as the power consumption of each hardware con guration. Our algorithm's speed outperforms the CPU baseline on all platforms evaluated, and our FPGA accelerator is signi cantly faster than the baseline while also reducing power consumption. Note that the CPU-only HFBS runtime reported in Table 3 is still far from the real-time requirement of 30 frames-per-second. e GPU and FPGA implementations get very close to real-time for σ x = 12, but still do not make it. By selecting σ x = 32 and losing some accuracy, both FPGA and GPU implementations meet the real-time requirements.
We also observe that HFBS signi cantly reduces pre-processing. is is mainly caused by the elimination of the Jacobi preconditioner. e switch to a dense 3D bilateral grid improves available parallelism in the splat-slice routines as well.
Depth Superresolution
Because our proposed model is an approximation to the bilateral solver, we should expect some drop in the quality of our output relative to that of Barron and Poole [2016] . To quantify this drop in accuracy, we evaluate on the depth superresolution task of Ferstl et al. [2013] , which was the primary evaluation used in Barron and Poole [2016] . We evaluate using the same experimental setup and the same hyperparameters as Barron and Poole [2016] (σ x = 8, σ l = 4), and report MSE with respect to ground truth from the Middlebury Stereo Dataset [Scharstein and Szeliski 2002] .
As can be seen in Table 4 , our model produces a slightly higher error than that of Barron and Poole [2016] , but has a signi cantly lower runtime (here we report runtime on a Nvidia 1080 Ti). is increase in error is due to the fact that our model ignores color in the input image, and so has di culty distinguishing between pixels with di erent chroma but similar luma. e images in this task are unusually colorful and "cartoonish", by virtue of being a constructed Ferstl et al. [2013] . HFBS produces similar output to Barron and Poole [2016] and is signi cantly faster.
vision task, so this increase in error represents an upper-bound on the increased error we expect to see in natural scenes. Even with this reduction in error, we see that HFBS is signi cantly faster than all more-accurate techniques, and signi cantly more accurate than all faster techniques. We present qualitative results for this task in Figure 7 . As discussed in Section 3.2, HFBS requires double the iterations to achieve the same accuracy level of OBS, but still performs signi cantly faster. We can see that our output depths are qualitatively very similar to those of Barron and Poole [2016] , as expected.
DISCUSSION
ere are a number of optimizations our hardware design can integrate for improved performance. Nevertheless, we observe that our design can be practically deployed at both the camera node or in the cloud to enable real-time VR video rendering.
Accelerator optimizations. ere are many opportunites to further optimize our hardware design. For instance, our design only accelerates the iterative optimization portion of HFBS. Integrating splat and slice operations into our accelerator, as in Rithe et al. [2013] , would reduce transfer costs from GB-large bilateral grid to smaller MB-sized images and further reduce runtimes. Also, many vertices of the dense bilateral grid begin as zeros and do not need to be processed; intelligently ignoring these zero-valued grid vertices can reduce wasted computation and potentially improve the runtime. Similarly, the wasted memory space from our addressing scheme can be mitigated with increased control logic, which may allow us to maximize the bilateral grid size.
System speci cations for real-time VR video processing platforms. While our design can execute bilateral solving under real-time constraints, the bilateral solver is just one step in the Jump VR video rendering pipeline. Moreover, the design we present processes the ow eld from a camera pair while the VR video capture system we target processes 16 ow elds from a 16-camera rig. We outline the speci cations and cost for a system that could process the full 16-camera input to produce virtual reality video in real-time in Table 5 . e monetary cost of deploying such a many-FPGA system in both con gurations is high, but the power consumption of our FPGA-based system, with 16 high-end fully-utilized FPGAs, is approximately that of two GPUs. Such power savings can be critical for mobile camera rigs. At the data center level, power constraints are less stringent, but deploying custom hardware for high-bandwidth tasks can still reduce power consumption and operating costs.
CONCLUSIONS
e hardware-friendly bilateral solver enables scalable, real-time processing of VR video on modern hardware. We explore a hardwareso ware codesign approach to construct an algorithm that is both faster and more accurate than prior work, optimizing algorithm details and hardware implementation together. In particular, HFBS uses a bilateral-space Heavy Ball algorithm and a 3D dense bilateral grid that allows fast and predictable memory accesses. We also design an FPGA accelerator for HFBS using reduced-precision xedpoint numbers and customized memory layout. Our CPU, GPU, and FPGA implementations of HFBS are 4×, 32×, and 50× faster than the original bilateral solver. We observe that our FPGA accelerator is more energy-e cient than comparable CPU and GPU implementations, and can be practically deployed at both the camera node or in the cloud to enable real-time VR video rendering.
