360°panoramic video provides an immersive Virtual Reality experience. However, rendering 360°videos consumes excessive energy on client devices. FPGA is an ideal o oading target to improve the energy-e ciency. However, a naive implementation of the processing algorithm would lead to an excessive memory footprint that o sets the energy bene t. In this paper, we propose an algorithmarchitecture co-designed system that dramatically reduces the onchip memory requirement of VR video processing to enable FPGA o oading. Evaluation shows that our system is able to achieve signi cant energy reduction with no loss of performance compared to today's o -the-shelf VR video rendering system.
Fueled by next-generation cellular technologies such as millimeter wave that promise orders of magnitude higher bandwidth and lower latency, users soon will be able to create, share, and watch 360°v ideos just like any other media. Although with huge opportunities, rendering 360°content is power-hungry. Previous work has shown that rendering a 720p 360°v ideo in 30 frames per second (FPS) consumes over 4 W power [18, 19] , exceeding the Thermal Design Point (TDP) of typical mobile devices [14] . The reason that rendering 360°is power-hungry is that today's rendering software translates 360°video rendering to a texture mapping problem that gets o oaded to the GPU [9] . Although using the GPU in o -the-shelf Systems-on-a-chip (SoCs) accelerates the adoption of 360°content, GPUs are power hungry.
We expect that next-generation mobile SoCs will soon integrate 360°content-speci c Intellectual Property (IP) blocks to improve the rendering energy-e ciency. To facilitate this trend, we propose a new 360°video rendering accelerator. We choose to base the accelerator design on FPGA, which not only allows us to exploit the ne-grained, pixel-level parallelisms exist in the 360°content rendering algorithms, but also to retain exibility to accommodate future developments in the rendering algorithms.
The key challenge of accelerating 360°rendering is the rendering algorithm's large memory footprint and irregular data access pattern, which not only introduce a high memory footprint that exceeds the on-chip memory of a mobile SoC and but also are not amenable to conventional memory optimizations such as linebu ering and prefetching. As a result, frequent, and random, DRAM accesses would have to be made, o setting the energy bene t of hardware acceleration. This paper proposes an algorithm-architecture co-designed system for e cient 360°content rendering. Our key observation is that the irregular memory accesses in today's rendering algorithm are fundamentally caused by the algorithm's data-ow that leads to arbitrary indexing of the input frame pixels. To tame the memoryine ciencies, we propose a new rendering algorithm that, by design, enforces a di erent data-ow that guarantees a streaming memory access pattern. As a result, the rendering computation becomes a stream of stencil operations, each operating on a xed-size window of pixels in a raster order.
The new rendering algorithm uniquely enables us to design a simple, yet e cient, hardware accelerator. The accelerator architecture exploits the pixel-level parallelisms by pipelining the rendering of di erent pixels, and hides the memory transfer latency with rendering computations. We judiciously apply a series of energy-oriented optimizations including trading-o the pipeline depth for the overall latency and tuning pixel data representations. We implement our algorithm-architecture co-designed system on the Xilinx Zynq Ultrascale+ ZCU104 FPGA development board [7] . Comparing against an o -the-shelf baseline running on the Nvidia Jetson TX2 development board utilizing its mobile Pascal GPU [2], our system achieves 55% energy savings at the same frame rate.
In summary, this paper makes the following contributions:
• We provide a detailed analysis of the memory access patterns of today's 360°content rendering algorithm, and demonstrate the ine ciencies of conventional memory optimizations such as line-bu ering. ( § 3). • We propose a new 360°content rendering algorithm that improves the data locality of 360°content rendering, and thus enables e cient hardware acceleration ( § 4.1). • We propose an accelerator architecture that is co-designed with the new algorithm to maximize its e ciency ( § 4.2). • We prototype the co-designed system on an embedded FPGA and demonstrate signi cant energy savings ( § 5).
The rest of the paper is organized as follows. We rst provide the background and related work of 360°content rendering ( § 2). We then analyze today's rendering algorithm ( § 3), focusing on its irregular memory accesses. We then introduce our new rendering algorithm and the co-designed hardware architecture ( § 4). We evaluate the our system ( § 5), followed by conclusion ( § 6).
Background
This section brie y introduces the necessary background and terminologies that are used throughout this paper. We refer interested readers to El-Ganainy and Hefeeda [12] for a comprehensive survey.
Panoramic Content 360°video is a form of Virtual Reality that has seen wide adoption recently in many areas such as news, movie, sports, and medical industry [21] . Fig. 1 shows an end-to-end 360°c ontent delivery pipeline. It mainly consists of a capturing (creation) phase and a playback phase. 360°videos are typically created using special capture devices (e.g., an omnidirectional camera [8] ) that capture every direction of the scene, which later are stitched together to form panoramic frames that present a 360°view of the scene to users. After creation, 360°videos are streamed to client VR device for playback, which is the focus of this paper.
Rendering Algorithm Once on the VR device, the playback software renders di erent regions of the frame according to the user's head movement. In the context of 360°video only rotational motion is captured, but not translational motion. The head motion can be characterized by the polar and the azimuthal angles in a spherical coordination system. The size of the displayed region depends on the eld-of-view (FOV) of a particular device, which characterizes the vertical and horizontal angles of the viewing area. 
; // projection <u, > = P(x, , z); // filtering if u and are integer coordinates then
Conventional video frames, once decoded, can be directly rendered on the display. However for 360°videos, the client rendering software converts an input panoramic frame (decoded from the video streamed from the cloud) to a frame that contains only the user's viewing area based on the user's viewing angle and device's FOV. Prior work shows that the rendering algorithm contribute about 40% of the device power consumption [19] .
Hardware Architecture Today's o -the-shelf mobile SoCs directly support rendering 360°videos. In particular, the video codec rst decodes 360°videos into a set of panoramic planar frames; the Graphics Processing Unit (GPU) is then used to execute the rendering algorithm that converts the panoramic planar frames to FOV frames that are then sent to the display processor [9] .
The reason that the GPU is tasked with the rendering algorithm is that the latter can be viewed as a texture mapping problem [15] , where the planar panoramic frame is treated as a texture map that is mapped to a particular region on a sphere. The spherical region's size is the same as the device's FOV and its location is determined by the user's viewing angle. Modern GPUs can e ectively execute texture mapping through the specialized Texture Mapping Unit (TMU) and the texture cache [13] . The TMU accelerates the computation of texture mapping, and the texture cache captures the irregular data access pattern to the texture map, i.e., the input panoramic frame in the case of 360°video rendering.
Rendering Algorithm Analysis
This section rst presents the algorithm used in today's 360°rendering software and discusses its computation pattern that is suitable for hardware acceleration ( § 3.1). We then particularly focus on the irregular memory access patterns of the algorithm ( § 3.2), from which we motivate the need for a new algorithm-architecture codesigned strategy. 
Algorithm and Its Computation Patterns
The goal of the rendering algorithm is to generate an output frame I out from the input panoramic frame I in . Algo. 1 shows the pseudocode. Speci cally, the rendering algorithm calculates each output frame pixel (<i, j>) by mapping it to a pixel in the input frame (<u, >), e ectively sampling the input frame. The mapping is done by raster scanning all the points in the output frame and iteratively applying two operations, rotation (R) and projection (P), on each <i, j> point to obtain its corresponding <u, > coordinates. R and P are matrix multiplications and cartesian-spherical conversions to support perspective rotation and projection [20] . The renderer then uses the <u, > coordinates to look up the input frame, and returns the corresponding input pixel as I out <i, j>. If <u, > are not integer coordinates, the renderer applies a so-called ltering function (F ), such as nearest neighbor or bilinear ltering [15] , to return a "best approximation" of the pixel value at I out <i, j>.
The rendering algorithm is highly parallel. In particular, the rendering of every output pixel is completely independent of each other. Under a particular head orientation, an output pixel's value I out (i, j) depends only on its coordinates <i, j>. In addition, the computation involved in R, P, and F are mostly a ne transformations that are suitable for e cient hardware implementations. Overall, the computation patten is ideal for hardware acceleration.
Memory Access Patterns
In stark contrast to the computation pattern, the memory access pattern of the rendering algorithm is far from ideal for an e cient accelerator design, especially on FPGAs.
Large Footprint The rendering algorithm accesses the memory in the ltering step, which uses the <u, > coordinates generated from the projection step to index into the input frame (I in ), and sequentially writes to the output frame (I out ) in the raster order. The output frame is small in size; its accesses are sequential, and thus 1 All gures here are down-sampled to reduce their sizes. could be bu ered on-chip and e ciently streamed to the DRAM in the end [22] . However, the input frame can not be fully captured by a typical on-chip memory. For instance, a 1080 and 4K frame would require over 5.9 MB and 23.7 MB memory, respectively. Irregularity The pixel access pattern of the input frame is nonsequential, which severely hurts the e ciency of hardware acceleration. Fig. 2a shows a rendering example where the input frame on the left is rendered to the output frame on the right. In this example, the FOV size is 110°× 110°, and the head orientation is (45°, 90°). The black pixels at the top of the input frame refer to all the pixels that are accessed by the rendering algorithm. To better illustrate the memory access pattern, Fig. 3a plots the input pixels that are accessed as the rendering algorithm iterates over three output frame rows, which are shown in Fig. 3b . Each <x, > marker in the gures indicates that the pixel at position <x, > is referenced.
Although the output frame pixels are accessed sequentially in a streaming fashion, the input frame pixels are referenced irregularly as is evident in the three access traces in Fig. 3a . Irregular memory accesses are known to hurt e ciency for two reasons. First, random DRAM accesses consume much more energy than sequential DRAM accesses [6, 17] . Second, irregular memory accesses require explicit control logic to manually coordinate the tra c between DRAM and on-chip memory [10] . This is particularly an issue for FPGAs, which implement control ow logic rather ine ciently. Ideally, FPGAs prefer streaming data accesses, which exhibit strong locality and can be e ectively captured by memory optimizations such as line-bu ering [16] . The non-raster access order of the input pixels indicates that line-bu er would be ine cient.
To quantify the ine ectiveness of using a line-bu er, Fig. 4 shows how the line-bu er e ciency (left -axis) and hit rate (rightaxis) vary with the line-bu er capacity when rendering the frame in Fig. 2a . The e ciency is de ned as the percentage of pixels that are brought into the line-bu er and that are actually referenced; the miss rate is de ned as the number of memory references that are not found in the line-bu er and thereby cause pipeline stall; the line-bu er capacity is characterized by the number of lines in the input frame the line-bu er can hold. The rendering algorithm requires about 512 lines, which roughly equate almost 4 MB of line-bu er size, to reach 100% hit rate. Even under such a large linebu er size, over 50% the fetched pixels would never be referenced, leading to signi cant bandwidth waste and energy-e ciencies.
Variation The irregular memory access pattern varies both spatially and temporally, making static optimizations ine ective.
On one hand, the rendering algorithm exhibits di erent input access patterns when iterating over di erent output rows as shown in Fig. 3a , exhibiting spatial variance. On the other hand, although the input access pattern is deterministic given a particular head orientation, the access pattern changes across di erent head orientations as users move, exhibiting temporal variance. Fig. 2b illustrates the memory accesses for a di erent head orientation at (45°, 90°), which has a di erent pattern from that of (45°, 45°) shown in Fig. 2a . Since the head orientation is not known until runtime, pre-computing and memoizing the access streams for all possible head orientations would lead to prohibitive memory overhead.
Algorithm-Hardware Co-Design
We propose a new 360°content rendering algorithm, which streamlines the memory accesses while retaining the pixel-level parallelism, enabling practical FPGA acceleration. We rst describe the algorithm ( § 4.1), and then describe the co-designed the hardware architecture and the implementation details ( § 4.2).
Algorithm
Overview Fundamentally, the root-cause of the irregular memory accesses in the original rendering algorithm is inherent in the algorithm's data-ow. In particular, the rendering algorithm calculates each output frame pixel by mapping it to a pixel in the input frame. Since the input pixel indexing is arbitrary, memory accesses to the input frame are irregular. Our idea is to invert the rendering algorithm such that it scans the input frame in the raster order, and maps each input pixel to one pixel in the output frame. In this way, the input frame is accessed in a streaming fashion, enabling ecient line-bu er optimizations. The trade-o is that output frame is now accessed in an arbitrary order. However, this is an acceptable trade-o because the output frame is small in size and could be bu ered on-chip before streaming out.
Inverting the original algorithm is possible because the rotation function (R) and projection function (P) are unique and thus are naturally invertible. The ltering function (F ) is not invertible because it is not unique. Consider the simple nearest-neighbor ltering; multiple input points could be mapped to the same output // Input resolution /* iterate over all output boundary coordinates */ for <i, j> coordinates on the I out boundary do <u, > = P(R (i, j, α, β, θ , λ)); Add <u, > to B ; // B is the input boundary set end /* iterate over all input pixels */ for u = 0; u < H i ; u = u + 1 do for = 0; < W i ; = + 1 do if <u, > within the boundary B then <i, j> = R −1 (P −1 (u, ), α, β, θ , λ) I out (i, j) = I in (u, ); end end end /* apply filtering to all output pixels */ foreach <i, j> in I out do F (<i, j>); end point that is the nearest neighbor to both input points. However, since ltering is inherently an approximation, we could approximate the ltering step without loss of visual quality as we will demonstrate later.
Reduce Redundancies Naively inverting the rendering algorithm, however, introduces lots of redundant computation. This is because only a small fraction of the pixels in the input frames is actually referenced during the rendering process. For instance, only 17.1% and 16.5% of the input pixels are referenced in Fig. 2a and Fig. 2b , respectively. In other words, the vast majority of the input pixels will not be needed to generate any output frame pixels, and therefore inversely mapping them would waste computation.
To reduce the redundant computations, our idea is to determine the boundary of the input region that contains the pixels that are needed for rendering. In this way, we are able to apply the inverse mapping only to the input pixels that are within the boundary. Input boundary calculation can be easily achieved by applying the original rendering algorithm to the output boundary coordinates. Since boundary pixels are only a very small portion of the entire frame, boundary calculation has low overhead as we show later.
Algo. 2 shows the pseudocode of the new algorithm. It rst applies the original rotation and projection functions R and P to generate a boundary set B. It then iterates over all the input pixels, and apply inverse functions R −1 and P −1 to pixels that are within the boundary delineated by B. In the end, it applies a ltering step of the entire output image. Note that this ltering function F is not, and needs not to be, the same as the original ltering function F or its inverse F −1 due to the approximate nature of ltering.
Output Quality The output of our new rendering algorithm is not pixel-accurate compared to the original algorithm because the ltering function is non-invertible. We verify that the di erence is acceptable, both objectively and subjectively.
Objectively, we use two metrics to quantify the di erence between the outputs generated by our rendering algorithm and the original algorithm: the Peak Signal to Noise Ratio (PSNR) and the Normalized Root Mean Square Error (NRMSE). The PSNRs across three representative viewing angles, (0°, 0°), (45°, 45°), and (45°, 90°), are 40.4, 57.1, and 42.6, respectively, and the NRMSE is below 0.01, con rming the high precision of the new algorithm. Subjectively, we also conduct subjective user study and nd that the di erence is visually indistinguishable.
Architecture Co-design
We co-design the hardware architecture to maximize the e ciency of the proposed rendering algorithm. Fig. 5a shows the overall execution model. The boundary calculation is serialized with the rest of the processing because it provides the boundary set for testing input pixels. The input frame is streamed from the DRAM, which is overlapped with pixel rendering. We exploit the pixel-level parallelism of the new algorithm by pipelining the processing of di erent pixels. The pipeline has an initiation internal of one. That is, a new pixel starts execution every cycle. During pixel rendering, the output frame is bu ered on-chip and is streamed out in the end.
The architecture block diagram is illustrated in Fig. 5b . The boundary calculation and the rendering module both use a set of multiply-accumulate (MAC) units and trigonometric function hardware to support the perspective rotation, cartesian-spherical conversion, and ltering operations. To support the streaming I/O, we use the simple AXI4-Stream interconnect design, which has e cient IP implementation on FPGA [22] .
Optimizations We apply a series of optimizations to improve the performance and reduce resource utilization. First, the boundary test is on the critical path and thus impacts the overall performance. Testing precisely whether a pixel is within the boundary requires storing all the boundary pixels and many comparisons. To reduce the boundary test, we approximate the boundary by its smallest bounding box (a rectangle), which requires us to store only four parameters and only four comparisons for boundary test. The tradeo is that the rendering algorithm now has to process more pixels that are not in the boundary. We nd that this is a desirable trade-o because the bene ts of reducing the per-pixel latency out-weights the overhead of pipelining more pixels.
In addition, we choose to use a xed-point representation for computation to improve the resource utilization and speed. We empirically nd that a 28-bit representation with 14 bits for the integer part is su cient to guarantee negligible loss of visual quality.
Evaluation Results
This section rst introduces the experimental setup ( § 5.1). We then evaluate on a set of micro-benchmarks using di erent resolutions and head orientations to understand the e ciency of the co-designed system ( § 5.2). Finally, we present the evaluation results on a 360°dataset using real user head orientations ( § 5.3).
BC

Streaming Input Frame
Pipelined Rendering
Skipped if BT fails
(a) The execution model. "BC" stands for "Boundary Calculation", i.e., the rst loop in Algo. 2; "Pipelined Rendering" is the second and third loop in Algo. 2; "BT" stands for "Boundary Test", i.e., the test condition of the second loop in Algo. 2. Execution times are not to scale. 
Output Buffer
Rendering
Input FIFO
AXI4-Stream Interconnect
Experimental Setup
We implement our architecture on the Xilinx Zynq UltraScale+ MP-SoC ZCU104 development board [7] , which is speci cally designed for embedded visual applications such as Augmented Reality and Virtual Reality. It has a programmable logic with 1.38 MB on-chip BRAM. We synthesize, place, and route the design using the Xilinx Vivado tool chain, and obtain the post-layout power consumption. The design is clocked at 100 MHz, which meets the 30 FPS real-time target for all resolutions. The ZCU104 development board uses a 2 GB, 64-bit wide DDR4 memory system [4] . We estimate the DRAM power using the Micron DDR4 power calculator [1, 5] based on the application's memory access traces.
Baselines We compare against two baselines. First, we compare against a baseline that implements the original rendering algorithm (Algo. 1) on the mobile Pascal GPU available on the Nvidia Jetson TX2 development board [2] . TX2 is used in many o -the-shelf VR devices such as the ones from Magic Leap [3] . This baseline is representative of how 360°video rendering is performed in today's VR devices as discussed in § 2. GPU power is measured using TX2's built-in TI INA 3221 voltage monitor IC, from which we retrieve power consumptions through the I2C interface.
Second, we compare against an FPGA baseline that implements the original rendering algorithm on the ZCU104 FPGA. Comparing against this baseline normalizes the e ect of FPGA acceleration and thus highlights the bene ts of algorithm-architecture co-design. 
Microbenchmark Results
We evaluate four di erent resolutions, including 480p, 720p, 1080p, and 2K, to represent di erent rendering requirements. Since di erent head orientations a ect how many input pixels are processed, we evaluate three di erent head orientations, (0°, 0°), (45°, 45°), and (45°, 90°), which mimic users watching the front (around the Equator), middle, and top (around the North Pole) region of a video. Energy Savings Our co-designed system achieves signi cant amount of energy savings compared to the GPU baseline. Fig. 6a shows the energy savings per frame across di erent resolutions under the three viewing angles. Under the 480p and 2K resolutions and the front viewing angle, our system is able to save close to 75% and 55% of the energy compared to the GPU baseline, respectively. On average, under a 2K resolution, our system consumes about 1.4 W of power, of which about 65% is the dynamic power.
Our co-design system also out-performs the FPGA baseline with the original rendering algorithm in most cases, as shown in Fig. 6b . The only exception is under the middle viewing angle. This is inherent to our new rendering algorithm, which processes more pixels when the viewing angle is near the (45°, 45°) region of the sphere. Recall from § 4.2 that we rst calculate a bounding box and then process all the pixels that are encapsulated by the box. The bounding box contains more pixels when the viewing angle is near the middle than near the Equator and the Poles.
Latency Breakdown We nd that each frame's execution time is consistently dominated by pixel rendering. We break down the frame latency into three main phases: boundary calculation, pixel rendering (which includes input streaming, which overlaps with pixel rendering), and output streaming. Regardless of the resolution, the pixel rendering time contributes to over 90% of the total frame latency. The boundary calculation has negligible execution time (< 0.2%), indicating that although it is on the critical path of frame latency, it is far from being a bottleneck.
Resource Utilization Finally, we show that our proposed system has low resource utilization. Fig. 7a shows the BRAM utilization across di erent resolutions. The BRAM usage increases from 0.1 MB at 480p resolution to 0.84 MB at 2K resolution, but is still well under the budget of the mobile-grade Ultrascale+ FPGA. Fig. 7b shows the utilizations of other FPGA resources including DSP, FF, and LUT. Their utilizations do not change across resolutions because the underlying data-path is exactly the same for di erent resolutions. Although boundary calculation contribute little to the frame latency, it occupies signi cant amount of FPGA resources. This is a typical resource-performance trade-o that accelerators make.
Real User Trace Evaluation
We evaluate on a recently released 360°dataset [11] , which includes the per-frame head orientations of 59 real users watching six di erent YouTube 360°videos. Fig. 8a shows the average energy saving of our system over the FPGA and the GPU baselines. The error bars indicate one standard deviation across all the users. On average, we achieve 26.4% and 40.0% energy savings over the GPU and the FPGA baselines in all ve benchmarked videos across all users. We nd that when watching 360°videos users tend to focus on the scenes in front of them, under which circumstances our system is able to signi cantly out-perform the baselines as quanti ed before using microbenchmarks ( Fig. 6 ). Fig. 8b shows the cumulative distribution function of the absolute vertical angles of all users across all videos. Each < x, > point in Fig. 8b reads as: users' vertical viewing angles are between -y°and y°for x% of time. While the vertical viewing angle theoretically span between −90°and 90°, over 80% of the time users focus on regions that are between −30°a nd 30°vertically. Users rare look at the 45°vertical angle, in which case our rendering algorithm introduces redundant pixel processing. In essence, our co-design system optimizes for the common case to achieve signi cant overall energy savings.
Conclusions
We expect that a signi cant amount of video tra c in the near future will be 360°panoramic videos. Mobile system designers will soon face the challenge of guaranteeing desirable user experience while rendering 360°content under severe energy constraints. This paper takes a promising rst step in energy-e cient 360°content rendering through a specialized accelerator design. We demonstrate that the key is to tame the irregular memory accesses by co-designing the rendering algorithm with the architecture.
