With the strong computation capability, NUMA-based multi-GPU system is a promising candidate to provide sustainable and scalable performance for Virtual Reality (VR) applications and deliver the excellent user experience. However, the entire multi-GPU system is viewed as a single GPU under the single programming model which greatly ignores the data locality among VR rendering tasks during the workload distribution, leading to tremendous remote memory accesses among GPU models (GPMs). The limited inter-GPM link bandwidth (e.g., 64GB/s for NVlink) becomes the major obstacle when executing VR applications in the multi-GPU system. By conducting comprehensive characterizations on different kinds of parallel rendering frameworks, we observe that distributing the rendering object along with its required data per GPM can reduce the inter-GPM memory accesses. However, this object-level rendering still faces two major challenges in NUMA-based multi-GPU system: (1) the large data locality between the left and right views of the same object and the data sharing among different objects and (2) the unbalanced workloads induced by the softwarelevel distribution and composition mechanisms.
INTRODUCTION
With the vast improvements in graphics technology, Virtual Reality (VR) is becoming a potential popular product for major hightech companies such as Facebook [30] , Google [16] and NVIDIA [29] . Different from normal PC or mobile graphics applications, VR promises a fully immersive experience to users by directly displaying images in front of users' eyes. Due to the dramatic experience revolution VR brings to users, the global VR market is expected to grow exponentially and generate $30 billion annual revenue by 2022 [24, 34] .
Despite the growing market penetration, achieving true immersion for VR applications still faces severe performance challenges [20] . First, the display image must have a high pixel density as well as a broad field of views which requires a high display resolution. Meanwhile, the high-resolution image must be delivered at an extremely short latency so that users can preserve the continuous illusion of reality. However, the state-of-the-art graphics hardware -the Graphics Processing Units (GPUs) in particular -cannot meet these strict performance requirements [20] . Historically, GPUs gain performance improvements through integrating more transistors and scaling up the chip size, but these optimizations on single-GPU system can barely satisfy VR users due to the limited performance boost [33] . Multi-GPU system with much stronger computation capability is a promising candidate to provide sustainable and scalable performance for VR applications [21, 33] .
In recent years, the major GPU verdors combine multiple small GPU models (e.g., GPMs) to build a future multi-GPU system under a single programming model to provide scalable computing resources. They employ high speed inter-GPU links such as NVLINK [28] and AMD Crossfire[3] to achieve fast data transmit among GPMs. The memory system and address mapping in this multi-GPU system are designed as a Non-Uniform Memory Access (NUMA) architecture to achieve 4x storage capacity over single-GPU system. The NUMA-based multi-GPU system employs shared memory space to avoid data duplication and synchronization overheads across the distributed memories [5] . In this study, we target the future multi-GPU system because it serves the VR applications more energy-efficiently than distributed multi-GPU system that employs separated memory space, and is becoming a good candidate for future mobile VR applications. Since the entire system is viewed as a single GPU under the single programming model, the VR rendering workloads are sequentially launched and distributed to different GPMs without specific scheduling. Applying this naive single programming model greatly hurts the data locality among rendering workloads and incurs huge inter-GPM memory accesses, which significant constrain the performance of multi-GPU system for VR applications due to the bandwidth asymmetry between the local DRAM and the inter-GPM links. There have been many studies [5, 21, 25, 43] to improve the performance of NUMA-based multi-GPU system by minimizing the remote accesses. However, these solutions are still based on single programming model without considering the special data redundancy in VR rendering, hence, they cannot efficiently solve the performance bottleneck for VR applications.
Aiming to reduce the inter-GPM memory accesses, a straightforward method is employing parallel rendering frameworks [7, 13, 14, 19] to split the rendering tasks into multiple parallel sub-tasks under specific software policy before assigning to the multi-GPU system. Since these frameworks are originally designed for distributed multi-GPU system, a knowledge gap still exists on how to leverage parallel rendering programming model to efficiently execute VR applications in NUMA-based multi-GPU system. To bridge this gap, we first investigate three different parallel rendering frameworks (i.e. frame-level, tile-level and object-level). By conducting comprehensive experiments on our VR featured simulator, we find that the object-level rendering framework that distributes the rendering object along with its required data per GPM can convert some remote accesses to local memory accesses. However, this object-level rendering still faces two major challenges in NUMAbased multi-GPU system: (1) a large number of inter-GPM memory accesses because it fails to capture the data locality between left and right view of the same object as well as the data sharing among different objects; (2) the serious workload unbalance among GPMs due to the inefficient software-level distribution and composition mechanisms.
To overcome these challenges, we propose object-oriented VR rendering framework (OO-VR) that reduces the inter-GPM memory traffic by exploiting the data locality among objects. Our OOVR framework conducts the software and hardware co-optimizations to provide a NUMA friendly solution for VR multi-view rendering. First, we propose an object-oriented VR programming model that provides a simple software interface for VR applications to exploit the data sharing between the left and right views of the same object. The proposed programming model also automatically groups objects into batches based on their data sharing levels. Then, to combat the limitation of software-level solutions on workload distribution and composition, we design a object aware runtime batch distribution engine in hardware level to balance the rendering workloads among GPMs. We predict the execution time for each batch so that we can pre-allocate the required data of each batch to the local memory to hide long data copy latency. We further design the distributed composition unit in hardware level to fully utilize the rendering output units across all GPMs for best pixel throughput. To summarize, the paper makes following contributions:
Left view

Right view
• We investigate the performance of future NUMA-based multi-GPU systems for VR applications, and find that the inter-GPM memory accesses are the major performance bottleneck. • We conduct comprehensive characterizations on major parallel rendering frameworks, and observe that the data locality among rendering objects can help to significantly reduce the inter-GPM memory accesses but the state-of-the-art frameworks and multi-GPU systems fail to capture this interesting feature. • We propose a software and hardware co-designed Object-Oriented VR (OO-VR) rendering framework that leverages the data locality feature to convert the remote inter-GPM memory accesses to local memory accesses. • We further build a VR featured simulator to evaluate our proposed design by rendering VR enabled real-world games with different resolutions. The results show that OO-VR achieves 1.58x performance improvement and 76% inter-GPM memory traffic reduction over the state-of-the-art multi-GPU system. With its nature of NUMA friendly, OO-VR exhibits strong performance scalability and potentially benefits the future larger multi-GPU scenarios with ever increasing asymmetric bandwidth between local and remote memory.
BACKGROUND AND MOTIVATION 2.1 Multi-View Rendering in Virtual Reality
In contrast to other traditional graphics applications, the state-ofthe-art VR applications employ Head-Mounted-Display (HMD), or VR helmet, to directly present visuals to users' eyes. To display 3D objects in VR, a pair of frames (i.e., stereoscopic frames) are generated for both left and right eyes by projecting the scene onto two 2D plate images. This process is referred as stereo rendering in computer graphics. Figure 1 shows an example of such VR projection. The green and yellow boxes represent the rendering process for left and right views, respectively, creating two display images for the HMD. Stereo rendering requires two concurrent rendering process for the two eyes' views, resulting in doubled amount of workload for the VR pipeline. Due to the observation that some objects in the scene (e.g. the robot in Figure 1 ) are shared by two eyes, mainstream graphics engines such as NVIDIA and UNITY employ simultaneous multi-projection (SMP) to generate the left and right frames simultaneously through single rendering process [8, 9, 27, 35] . This significantly reduces workload redundancy and achieves substantial performance gain. Based on the conventional three-step rendering process (i.e., Geometry Process, Rasterization and Fragment Process) defined by modern graphics application programming interface (API) [1, 2], VR rendering inserts multi-projection process after the geometry process and prior to the Rasterization, shown in Figure 2 (a). Thus, when SMP is enabled, VR rendering process is composed of four steps, detailed in Figure 2 (b). Basically, VR rendering begins from reading the application-issued vertex from GPU memory. During the geometry process 1 , the vertex shader calculates the 3D coordinates of the vertex and assembles them into primitives (i.e. triangles in Figure 2 (a)-(1)). After that, the generated triangles pass through the geometry-related shaders which perform clipping, face culling and tessellation to generate extra triangles and remove non-visible triangles. Then, the SMP step 2 is responsible for generating multiple projections of a single geometry stream. In other words, GPU executes geometry process only once but produces two positions for each triangle (Figure 2 (a)-(2)). These triangles are then streamed into the rasterization stage 3 to generate fragments (Figure 2 (a)-(3)), each of which is equivalent to a pixel in a 2D image. Finally, the fragment process 4 generates pixels by calculating the corresponding fragment attributes to determine their colors and texture (Figure 2 (a)-(4)). The output pixels will be written into the frame buffer in GPU memory for displaying.
SMP Featured GPU Architectures
Traditionally, GPUs are designed as the special-purpose graphics processors for performing modern rendering tasks. Figure 2 (c) shows a SMP supported GPU architecture which models the recent NVIDIA Pascal GPUs [27] . It consists of several programmable streaming multiprocessors (SMs) 1 , some fixed function units such as the GigaThread Engine 2 , Raster Engine 3 , Polymorph Engine (PME) 4 , and Render Output Units (ROPs) 5 . Each SM 1 is composed of a unified texture/L1 cache (TX/L1 $), several texture units (TXU) and hundreds of shader cores that execute a variety of graphics shaders (e.g., the functions in both geometry and fragment process). The GigaThread Engine 2 distributes the rendering workloads among PMEs if there are adequate computing resources. The raster engine 3 is a hardware accelerator for rasterization process. Each PME 3 conducts input assembler, vertex fetching, and attribute setup. To support multi-view rendering, NVIDIA Pascal architecture integrates an SMP engine into each PME. The SMP engine is capable of processing geometry for two different viewports which are the projection centers for the left and right views. In other words, it duplicates the geometry process from left to right views through changing the projection centers instead of executing the geometry process twice. Finally, the Render Output Units (ROPs) 5 perform anti-aliasing, pixel compression and color output. As Although recent generations of GPUs have shown capability to deliver good gaming experiences and also gradually evolved to support SMP, it is still difficult for them to satisfy the extremely high demands on rendering throughput from immersive VR applications. The human vision system has both wide field of view (FoV) and incredibly high resolution when perceiving the surrounding world; the requirement for enabling an immersive VR experience is much more stringent than that for PC gaming. Table 1 lists the major differences between PC gaming and stereo VR [20] . As it demonstrates, stereo VR requires GPU to deliver 116 (58.32×2) Mpixels within 5 ms. Missing the rendering deadline will cause frame drop which significantly damages VR quality. Although the VR vendors today employ frame re-projection technologies such as Asynchronous Time Warp (ATW) [15, 36] to artificially fill in dropped frames, they cannot fundamentally solve the problem of rendering deadline missing due to little consideration on users' perception and interaction. Thus, improving the overall rendering efficiency is still the highest design priority for modern VR-oriented GPUs [6] .
NUMA-Based Multi-GPU System and Its Performance Bottleneck
In recent years, major GPU vendors such as NVIDIA have proposed to integrate multiple easy-to-manufacture GPU chips at package level (i.e., multi-chip design) [5] or at system level [25, 43] using high bandwidth interconnection technologies such as Grand-Reference Signaling (GRS) [32] or NVLinks [28] , in order to address future chip density saturation issue. Figure 3 shows the overview of the multi-GPU architecture which consists of four GPU models (i.e., GPMs). In terms of compute capability, each GPM is configured to resemble the latest NVIDIA GPU architecture (e.g., Figure 2 (c)). Inside each GPM, SMs are connected to the GPM local memory hierarchy including a local memory-side L2 cache and off-chip DRAM, via an XBAR. In the overall multi-chip design (MCM-GPU), XBARs are interconnected through high speed links such as NVLinks to support the communication among different GPMs. This multi-GPU system generally acts as a large single GPU; its memory system and address mapping are designed as a Non-Uniform Memory Access (NUMA) architecture. This design also reduces the programming complexity (e.g., unified programming model similar to CUDA) for GPU developers. Future Multi-GPU System Bottleneck for VR Workloads. As previous works [5, 25] have indicated, data movement among GPMs will become the major obstacle for the continued performance scaling in these future NUMA-based multi-GPU systems. This situation is further exacerbated when executing VR applications caused by the large data sharing among GPMs. Due to the nature of view redundancy in VR applications, the left and right views may include the same object (e.g., the rabbit in Figure 3 ) which require the same texture data. However, to effectively utilize the computing resources from all the GPMs in such multi-GPM platforms, the rendering tasks for left and right views will be distributed to different groups or islands of GPMs in a more balanced fashion; each view will then be further broken into smaller pieces and distributed to the individual GPMs of that group. This naive strategy could greatly hurt data locality in the SMP model. For example, if the basic texture data used to describe the rabbit in Figure 3 is stored in the local memory of GPM_0, other GPMs need to issue remote memory accesses to acquire this data. Due to the asymmetrical bandwidth between the local DRAM (e.g., 1TB/s) and inter-GPM NVLink (e.g., 64GB/s), the remote memory access will likely become one of the major performance bottlenecks in such multi-GPU system design. More sophisticated parallel rendering frameworks such as OpenGL Multipipe SDK [7] , Chromium [19] and Equalizer [13, 14] , are designed for distributed environment where they separate memory space and the memory data need to be duplicated in each memory which greatly limits the storage capacity on our NUMA-based multi-GPU systems. Thus, employing them on our architecture requires further investigation and characterization. We will show this study in Section 4. Figure 4 presents the performance of a 4-GPM multi-GPU system as the bandwidth of inter-GPM links is decreased from 1TB/s to 32GB/s (refer to Section 3 for experimental methodology). We can observe that the rendering performance is significantly limited by the bandwidth. On average, applying 128GB/s, 64GB/s and 32GB/s inter-GPM bandwidth results in 22%, 42% and 65% performance degradation compared to the baseline 1TB/s bandwidth, respectively. Although improving the inter-GPM bandwidth is a straightforward method to tackle the problem, it has proven difficult to achieve due to additional silicon cost and power overhead [5] . This motivates us to provide software-hardware co-design strategies to enable "true" immersive VR experience for future users via significantly reducing the inter-GPM traffic and alleviating the performance bottleneck of executing VR workloads on future multi-GPU platforms. We believe this is the first attempt to co-design at system architecture level for eventually realizing future planet-scale VR.
EXPERIMENTAL METHODOLOGY
We investigate the performance impact of multi-GPU system for virtual reality by extending ATTILA-sim [10] , a cycle-level rasterizationbased GPU simulator which covers a wide spectrum of graphics features on modern GPUs. The model of ATTILA-sim is designed upon boxes (a module of functional pipeline) and signals (simulating the interconnect of different components). Because the current ATTILA-sim models an AMD TeraScale2 architecture [17] , it is difficult to configure it using the same amount of SMs as NVIDIA Pascal-like architectures [27] . To fairly evaluate the design impact, we accordingly scale down other parameters such as the number of ROPs and L2 cache. Similar strategies have been used to study modern graphics architectures in previous works [40] [41] [42] . The GPM memory system consists of two level cache hierarchy and a local DRAM. The L1 cache is private to each SM while the L2 cache are shared by all the SMs. Table 2 shows the simulation parameters applied in our baseline multi-GPU system.
In order to support multi-view VR rendering, we implement the SMP engine in ATTILA-sim based on the state-of-the-art SMP technology [8, 9] which re-projects the triangles in left view to right using updated viewport. Figure 5 shows the rendering example of Half-Life 2 after enabling SMP in ATTILA-sim. Our SMP engine first gathers the X coordinate of the display frame which is from -W to +W, where W is a coordinate offset parameter. Then, it duplicates each triangle generated from the geometry process. After that, the SMP engine shifts the viewport of the rendering object by half of W, left or right depending on the eye. The SMP engine can also re-project the triangle based on user-defined viewports for left and right views. Finally, we modify the triangle clipping to prevent the spill over into the opposite eye. We validated the implementation of the SMP engine in ATTILA-sim by comparing the triangle number, fragment number and performance improvement with that from executing VR benchmarks on the state-of-the-art GPUs (e.g. Sponza and San Mangle in NVIDIA VRWork [29] ) on NVIDIA GTX 1080 Ti). Specifically, we observe that the added SMP rendering on ATTILAsim can provide a 27% speed up over the sequential rendering on two views. We also model the inter-GPU interconnect as high bandwidth point-to-point NVLinks with 64GB/s bandwidth (one direction). We assume each GPM has 6 ports and each pair of ports is used to connect two GPMs, indicating that the intercommunication between two GPMs will not be interfered by other GPMs. Based on the sensitivity study shown in Figure 4 , we configure the inter-GPM link bandwidth as 64GB/s bandwidth. Following the common GPU design, each ROP in our simulation outputs 4 pixels per cycle to the framebuffer. To further alleviate the remote memory access latency on the NUMA-based baseline architecture, we employ the state-of-the-art First-Tough (FT) page placement policy and remote cache scheme [5] to create a fair baseline evaluation environment. Table 3 lists the set of graphics benchmarks employed to evaluate our design. This set includes five well-known 3D games, covering different rendering libraries and 3D engines. We also list the original rendering resolution and the number of draw commands for these benchmarks. Two benchmarks (Doom3 and Half-Life 2) from the table are rendered with a range of resolutions (1600×1200, 1280×1024, GPM3   T0  T1  T2  T3   T0   T1   T2 Figure 6 : Three types of parallel rendering schemes for parallel VR applied on future NUMA-based Multi-GPU systems. 640×480) while for other games we adopt 1280x1024 resolution if it is available and supported by the simulator. In order to feature these PC games as VR applications, we modify the ATTILA Common Driver Layer (ACDL) to enable the multi-view rendering. In our experiments, we let all the workloads run to completion to generate the accurate frames on the simulator and gather the average frame latency for each game.
CHARACTERIZING PARALLEL RENDERING SCHEMES ON FUTURE NUMA-BASED MULTI-GPU SYSTEMS
Aiming to reduce the NUMA-induced bottlenecks, a straightforward method is to employ parallel rendering schemes in VR applications to distribute a domain of graphics workloads on a targeted computing resource. While some parallel rendering frameworks such as Equalizer [13, 14] and OpenGL Multiple SDK [7] have been used in many cluster-based PC games, the NUMA-based multi-GPU systems face some different challenges when performing parallel rendering. In this section, we perform a detailed analysis using three state-of-the-art parallel rendering schemes (including frame-level, tile-level and object-level parallel rendering) for VR application running on such future NUMA-based multi-GPU architectures, to further understand the design challenges.
Alternate Frame Rendering (AFR)
Alternate Frame Rendering (AFR), also known as frame-level parallel rendering, executes one rendering process on each GPU in a multi-GPU environment. As Figure 6a demonstrates, AFR distributes a sequence of rendering frames along with the required data across different GPMs. AFR is often considered to be a better fit for distributed memory environment since the separate memory spaces make the concurrent rendering of different frames easier to implement [14] . To separate our NUMA memory system into unique memory spaces, we leverage the software-level segmented memory allocation to reserve distributed memory segments for each frame. We also employ a simple task scheduler to map the rendering workloads of a frame to a specific GPM. The benefit of this design is to eliminate the inter-GPM commutation. Figure 7 shows the performance improvement and single frame latency affected by AFR scheme. The results are normalized to the baseline NUMA-based multi-GPU setup (with 64GB/s NVLink) where the entire system is viewed as a single GPU under the programming model and rendering workloads are directly launched to this system without specific parallel rendering scheduling. On average, AFR improves the performance (i.e., overall frame rate) by 1.67X comparing to the baseline setup. AFR not only eliminates the performance degradation of low bandwidth inter-GPU links, but also increases the rendering throughput by leveraging the SMP feature of the GPM. However, Figure 7 (right) also suggests that AFR increases the single frame latency by 59% as a frame is processed by only one GPU. This increased single-frame latency may cause significant motion anomalies, including judder, lagging and sickness in VR system [42, 44] because it highly impacts whether the corresponding display on VR head-gear device can be in sync with the actual motion occurrence. Additionally, we observe that AFR near-linearly increases the memory bandwidth and capacity requirement according to the pre-allocate memory space for each frame. This decreases the maximum system memory capacity which directly limits the rendering resolution, texture details and perceived quality for different VR applications.
Tile-Level Split Frame Rendering (SFR)
In contrast to AFR, split frame rendering (SFR) tends to reduce the single-frame latency by splitting a single frame into smaller rendering workloads and each GPM only responses to one group of workloads. Figure 6b shows a tile-level SFR which splits the rendering frame into several pixel tiles in the screen space and distributes these sets of pixels across different GPMs. This basic method is widely used in cluster-based PC gaming because it requires very low software effort [37] . To employ tile-level SFR, we simply leverage the sort-first algorithm to define the tile-window size before the rendering tasks is processed in GPMs. Although this design can effectively reduce single-frame latency, its vertical pixel stripping [37] does distribute left and right views into different GPMs, ignoring the redundancy of the two views. Thus, to enable the SMP under this tile-level SFR, an alternative is to employ a horizontal culling method, shown in Figure 6c . It groups the left and right views as a large pixel tile so that the rendering workloads in the left view can by re-projected into the right via the SMP engine to reduce redundancy geometry processing and improve data sharing.
Object-Level Split Frame Rendering
Distributing objects among processing units represents a specific type of split rendering frame (SFR). Figure 6d shows an example of object-level SFR which is often referred as sort-last rendering [13] . In contrast to the traditional vertical and horizontal tile-level SFR, the distribution under object-level SFR begins after the GPU starts the rendering process. During object-level SFR, a root node is selected (e.g., GPM0 in this example) to distribute the rendering objects to other working units (e.g., GPM1, GPM2 and GPM3). Once a worker completes the assembled object, the output color in its local DRAM is sent to the root node to composite the final frame. In this study, we first profile the entire rendering process to get the total number of rendering objects, and then issue them to different GPMs in a round-robin fashion. Note that only one object is executed in each GPM at a time for better data locality. Although this object distribution can also occur during rendering process (e.g., between rasterization and fragment processing [21] ), it typical requires to insert additional inter-GPM synchronization which may cause increasing inter-GPM traffic and performance degradation. Thus, we only distribute the objects at the beginning of the rendering pipeline for our experiments. Best-to-Worst Ratio Figure 10 : The best-to-worst performance ratio among GPMs in object-level SFR across different workloads. Figure 8 and 9 illustrate the performance (i.e., the overall frame rate) impact and inter-GPM memory traffic for different SFR scenarios. The results are normalized to the baseline setup. We have the following observations:
(i) The tile-level SFR schemes only slightly improve the rendering performance over the baseline case, e.g., on average 28% and 3% for Tile-level (V) and Tile-level (H), respectively. This is because although processing a small set of pixels via tile-level SFR can improve the data locality within one GPU, the tile-level SFR schemes increase the inter-GPM memory traffic by an average of 50% for the vertical culling (V) and 44% for horizontal culling (H) due to the object overlapping across the tiles. While the horizontal culling (H) fails to capture the data sharing for large objects (e.g., the bridge on the right side of Figure 6c ), vertical culling (V) ignores the redundancy between the left and right view. Since when applying SMP-based VR rendering the GPMs do not render the left and right views simultaneously, the large texture data have to be moved frequently across the GPMs.
(ii) The object-level SFR outperforms tile-level SFR schemes and achieves an average of 60%, 32% and 57% performance improvement over the baseline, tile-level (V) and tile-level (H), respectively. The speedups are mainly from the inter-GPMs traffic reduction, indicated by Figure 9 . By placing the required data in the local DRAM for the rendered objects, Object-level SFR reduces approximately 40% of inter-GPMs traffic compared to the baseline. However, the state-of-the-art object-level SFR can not fully address the NUMAinduced performance bottlenecks for VR execution, because it still executes the objects from the left and right views separately. In other word, it ignores the multi-view redundancy in VR applications which limits its rendering efficiency.
(iii) Additionally, we also observe that the object-level SFR is challenged by low load balance and high composition overhead. Figure 10 shows the ratio between the best and the worst performance among different GPMs under the Round-Robin object scheduling policy. Since each object has a variety of graphical properties (e.g., the total amount of triangles, the level of details, the viewport window size, etc), the processing time is typically different for each object. If one GPM is assigned more complex objects than the others, it will take more time to complete the rendering tasks. Since the overall performance of Multi-GPU system is determined by the worst-case GPM processing, low load balance will significantly degrade the overall performance. Meanwhile, the high composition overhead (i.e., assembling all the rendering outputs from different GPMs into a frame) also contributes to the unbalanced execution time. As we mentioned previously, only the root node is employed to distribute and composite rendering tasks in the current objectlevel SFR. In this case, extra workloads will be issued to the root node while the ROP units of the other GPMs can not be fully utilized during this color output stage, causing bad composition scalability [7] . Therefore, we aim to propose software-hardware support to efficiently handle these challenges facing the state-of-the-art object-level SFR in a NUMA-based multi-GPU environment.
OBJECT-ORIENTED VR RENDERING F-RAMEWORK
In order to address the performance issues of the object-level SFR applied on future NUMA-based multi-GPU systems, we propose the object-oriented VR rendering framework (OO-VR). The basic design diagram is shown in Figure 11 . It consists of several novel components. First, we propose an object-oriented VR programming model at the software layer to support multi-view rendering for the object-level SFR. It also provides an interface to connect the VR applications to the underlying multi-GPU platform. Second, we propose an object-aware runtime distribution engine at the hardware layer to balance the rendering workloads across GPMs. In OO-VR, this distribution engine predicts the rendering time for each object before it is distributed. It replaces the master-slave structure among GPMs so that the actual distribution is only determined by the rendering process. Finally, we design a distributed hardware composition unit to utilize all the ROPs of the GPMs to assemble and update the final frame output from the framebuffer. Due to the NUMA feature, the framebuffer is distributed across all the GPMs instead of only one DRAM partition, so that it can provide 4x output bandwidth of the baseline scenario. We detail each of these components as follows.
Object-Oriented VR Programming Model
The Object-Oriented VR Programming Model extends the conventional object-level SFR as we introduced in Section 4 and uses a similar software structure as today's Equalizer [13, 14] and OpenGL Multipipe SDK (MPK) [7] . Figure 12 uses an simplified working flow diagram to explain our programming model. In this study, we propose two major components that drive our OO-VR programming model: Object-Oriented Application (OO_Application) to drive the VR multi-view rendering for each object, and Object-Oriented Middleware (OO_Middleware) to reorder objects and regroup the ones that share similar features as a large batch which acts as the smallest scheduling units on the NUMA-based multi-GPU system. The OO_Application provides a software interface (dark blue box) for developers to merge the left and right views of same object as a single rendering tasks. The OO_Application is designed by extending the conventional object-level SFR. For each object, we replace the original viewport which is set during the rendering initialization with two new parameters -viewportL and viewportR, each of which points to one view of the object. In order to enable rendering multi-views at the same time, we apply the built-in openGL extension GL_OV R_multiview2 to set two viewports ID for a single object. After that, each SMP engine integrated in a GPM Hardware Layer: Figure 11 : Our proposed object-oriented VR rendering framework (OO-VR).
automatically renders the left and right views to its own positioning using the same texture data. We also design an auto-model to extend the conventional object-level SFR to enable multi-view rendering through generating two fixed viewports for each object via shifting the original viewport along the X coordinate. In this case, only one rendering process needs to be setup for each object.
In constrast to the single-path stereo rendering enabled in modern VR SDKs [13, 29] , our OO_Application does not decompose the left and right views during the rendering initialization so that it still follows the execution model of the object-level SFR.
OO_Middleware is the software-hardware bridge to connect the OO_Application and multi-GPU system. It is automatically executed during the application initialization stage to issue a group of the objects to the rendering queue of the multi-GPU system. In the conventional object-level SFR, the objects are scheduled in a master-slave fashion following the programmer-defined order. However, different objects that may share some common texture data are not rendered on the same GPM. As Figure 12 illustrates, both "pillar1" and "pillar2" share the common "stone" texture. If they are rendered on different GPMs, the "stone" texture may need to be reallocated, increasing remote GPM access. In OO-VR, we leverage OO_Middleware to group objects based on their texture sharing level (TSL) to exploit the data locality across different objects.
To implement this, OO_Middeware first picks an object from the head of the queue as the root. It then finds the next independent object of the root and computes the TSL between the two using Equation (1).
Where t is the shared texture data between the two objects, P r (t) and P n (t) represent the percentages of t among all the required textures for the root and the target object. TSL represents how many texture data will be shared if we group the target object with the root. If TSL is greater than 0.5, we group them together as a batch and this batch then becomes the new root which consists all textures from the previous iterator and the target object. After this grouping, the OO_Middleware removes the target object from the queue and continues to search for the next object until the total number of triangles within the batch is higher than 4096, or all the objects in the queue have been selected. The triangle number limitation is used to prevent load imbalance from an inflated batch.
After this step, this batch is marked as ready and issued to a GPM in the system for rendering. Finally, the OO_Middleware repeats this grouping process for all the objects in the frame until there is no object in the queue. Note that for the objects that have dependency on any of the objects in a batch, we directly merge them to the batch and increase the triangle limitation so that they can follow the programmer-defined rendering order.
Object-Aware Runtime Distribution Engine
After receiving the batches from the OO_Middleware, the Multi-GPU system needs to distribute them across different GPMs for multi-view rendering. For workload balancing, we propose an object-aware runtime distribution engine at the hardware layer instead of using the software-based distribution method based on master-slave execution used in the conventional object-level SFR. Comparing to the software-level solution which needs to split the tasks before assigning to the multi-GPU system, the hardware engine provides efficient runtime workload distribution by collecting rendering information. Figure 13 illustrates the architecture design of the proposed distribution engine. The new hardware architecture is implemented as a micro-controller for the multi-GPU system which responses for predicting the rendering time for each batch, allocating an object to the earliest available GPM, and pre-allocating memory data using the Pre-allocation Units (PAs) in each GPM.
Runtime Batch Distribution Engine
GPM0
GPM1 GPM2 GPM3 Recall the discussion in Section 3 for our evaluation baseline, we employ the first-touch memory mapping policy (FT) [5] to allocate the required data in the local DRAM. Although FT can help reduce inter-GPM memory access, it can also cause performance degradation if the required data is not ready during the rendering process. As a result, we consider to pre-allocate data before objects are being distributed across GPMs. In this case, OO-VR needs to be aware of the runtime information of each GPM to determine which GPM is likely to become idle the earliest.
In order to obtain this information, we need to predict approximately how long the current batch will be completed. Equation (2) shows a basic way to estimate the rendering time of the current task X , introduced by [39] :
Where д x , c x is the geometry and texture property of the object X , HW is the hardware information, and ST is the current rendering step (i.e., geometry process, multi-view projection, rasterazition or fragment process) of the object X . While a complex equation can increase the estimation accuracy, it also requires more comprehensive data and increases hardware design complexity and computing overhead. Because the objective of our prediction model is to identify the earliest available GPM instead of accurately predicting each batch's rendering time, we propose a simple linear memorization-based model to estimate the rendering time as Equation (3):
Where #trianдle x , #tv x and #pixel x represent the triangle counts, the number of transformed vertexes and the number of rendered pixels of the current batch, respectively. c 0 , c 1 and c 2 represent the triangle, vertex and pixel rate of the GPM.
After building this estimation model, we split the prediction process into two phases: total rendering time estimation and elapsed rendering time prediction. We setup two counters to record the total rendering time and the elapsed rendering time for each GPM. First, we leverage #trianдle x (which can be directly acquired from the OO_Application) to predict the total rendering time. During rendering, the distribution engine tracks #tv x and #pixel x from GPMs to calculate the elapsed rendering time. If the #tv x or #pixel x increases by 1, the elapsed counter increases by c 1 or c 2 , respectively. At the end, by comparing the distance between the two counters from each GPM, we can predict which GPM will become available first.
At the beginning of the rendering, the distribution engine uses the first 8 batches to initialize c 0 , c 1 and c 2 . The first 8 batches will be distributed across GPMs under the Round-Robin object scheduling policy and baseline FT memory mapping scheme is also applied to allocate the rendering data. After GPMs complete this round of 8 batches, the total rendering time will be sent back to the distribution engine to calculate c 0 , c 1 and c 2 . Then, starting from the 9th batch, the rendering time predictor is enabled to find the earliest available GPM. After that, the PA Unit pre-allocates the required data to the selected GPMs, and the rendering time predictor updates the predicted total rendering time by increasing the triangle counts. Note that we limit the maximum size of the batch queue to 4 objects to reduce the memory space requirement. Multiple batches could be distributed onto one GPM at the same time. In this case, a PA Unit sequentially fetches the data based on the order of the batch ID.
We further observe that even though distribution engine can effectively balance the rendering tasks, it is possible that some large objects may still become the performance bottleneck if all the other batches have been completed. To fully utilize the computing and memory resources of these idle GPMs, we employ a simple fine-grained task mapping mechanism to fairly distribute the rest of the processing units (e.g. triangles in geometry process and fragments in fragment process) to idle GPMs based on their IDs. Meanwhile, the PA units duplicate the required data to the corresponding unused DRAM to eliminate inter-GPMs access for these left-over fine-grained tasks.
Distributed Hardware Composition Unit
In the conventional object-level SFR, the entire FrameBuffer (FB) is mapped in the DRAM of the master node, and all the rendering outputs will then be transmitted to the master node for the final composition. Although the color outputs can be executed asynchronously with the shader process, a small amount of ROPs in a single GPM limits the pixel rate which impacts the overall rendering performance. Since the NUMA-based multi-GPU system can be considered as a large single GPU, we consider to distribute the composition tasks across all the GPMs which is currently not supported due to the lack of relevant communication mechanism and hardware.
For example, shown in Figure 14 , we first split the entire FB into 4 partitions using the screen-space coordinate of the final frame. Here we employ the same memory mapping policy as the vertical Tilelevel SFR (V). Based on this, we propose the distributed hardware composition unit (DHC) to determine which part of FrameBuffer is used to store what color outputs of the final frame. This design is based on the observation that the color outputs of the final frames only incur a small number of memory access compared to the main rendering phase so that the small amount of remote communication for this phase will not become a performance bottleneck for NUMAbased multi-GPU systems. This is also why vertical culling shown in Figure 14 can perform well as the last stage of VR rendering (i.e., after the object-aware runtime distribution for the main rendering phase) since the inter-GPM bandwidth can be effectively utilized by the distributed hardware composition.
Overhead Analysis
The major hardware components added into the existing multi-GPU system is the object-aware runtime distribution engine, which consists of a rendering time predictor, GPM counters and a batch queue. For the baseline Multi-GPU architecture that we modeled for this work (Table 2) , we allocate 64 bits for each counter and 16 bits for each batch ID to store the predicted rendering time. Additionally, to predict the total and elapsed rendering time, twelve 32-bits registers are used to track the triangle counts, the number of transformed vertexes and the number of the rendered pixels for the current batches. In total, we only require 960 bits for storage and several small logic units. We use McPAT [22] to evaluate the area and energy overhead of the added storage and logic units for the distribution engine. The area overhead is 0.59 mm 2 under 24nm technology which is 0.18% to modern GPUs (e.g., GTX1080). The power overhead is 0.3W which is 0.16% of TDP to GTX1080.
EVALUATION
We model the object-oriented VR rendering framework (OO-VR) by extending AITTILA-sim [10] . To get the object graphical properties (e.g., viewports, number of triangles and texture data), we profile the rendering-traces from our real-game benchmarks as shown in Table  3 . Then in ATTILA-sim, we implement the OOVR programming model in its GPUDriver, and the object distribution engine in its command processor, and the distributed hardware composition during the color writing procedure. To evaluate the effectiveness of our proposed OO-VR design, we compare it with several design scenarios: (i) Baseline -the baseline multi-GPU system with single programming model (Section 2); (ii) 1TB/s-BW -the baseline system with 1 TB/s inter-GPU link bandwidth; (iii) Object-level -the Objectlevel SFR which distributes objects among GPMs (Section 4); (iv) Frame-level -the AFR which renders entire frame within each GPM; and (v) OO_APP -the proposed object-oriented programming model (Section 3). We provide results and detailed analysis of our proposed design on performance, inter-GPU memory traffic, sensitivity study for inter-GPM link bandwidth and the performance scalability over the number of GPMs. Fig.15 shows the performance results with respect to single frame latency under the five design scenarios. We gather the entire rendering cycles from the beginning to the end for each frame and normalized the performance speedup to baseline case. We show the performance speedup for single frame because it is critical to avoid motion sickness for VR. From the figure, we have several observations.
Effectiveness On Performance
First, without hardware modifications, the OO_APP improves the performance about 99%, 39% an 28% on average comparing to the Baseline, Object-level SFR and 1TB/s-BW, respectively. It combines the two views of the same object and enable the multi-view rendering to share the texture data. In addition, by grouping objects into large batches, it further increases the data locality within one GPM to reduce the inter-GPM memory traffic. However, it still suffers serious workload unbalance. For instance, object-level SFR slightly outperforms OO_APP when executing DM3-1280 and DM3-1600. This is because some batches within these two benchmarks require much longer rendering time than other batches, the software scheduling policy alone in OO_APP can not balance the execution time across GPMs without runtime information. Second, we observe that on average, OO-VR outperforms Baseline, Objectlevel SFR and OO_APP by 1.58x, 99% and 59%, respectively. With the software and hardware co-design, OO-VR distributes batches based on the predicted rendering time and provides better workload balance than OO_APP. It also increases the pixel rate by fully utilizing the ROPs of all GPMs.
We also observe that OO-VR could achieve similar performance as Frame-level parallelism which is considered to provide ideal performance on overall rendering cycles for all frames (as shown in Fig.7(left) ). However, in terms of the single frame latency, Framelevel parallelism suffers 40% slowdown while OO-VR could significantly improve the performance.
Effectiveness On Inter-GPU Memory Traffic
Reducing inter-GPM memory traffic is another important criteria to justify the effectiveness of OO-VR. Fig.16 shows the impact of OO-VR on inter-GPM memory traffic. Both Baseline and 1TB/s-BW have the same inter-GPM memory traffic, and Frame-Level is processing each frame in one GPM and has near-zero inter-GPM traffic. Moreover, the memory traffic reduction is mainly cause by our software-level design, the inter-GPM traffic is the same under the impact of OO_APP and OO-VR. Therefore, Fig.16 only shows the results for Baseline, Object-Level and OO-VR, and we mainly investigate these three techniques in the following subsections. From the figure, we observe OO-VR can save 76% and 36% inter-GPM memory accesses comparing to the Baseline and Object-level SFR, respectively. This is because OO-VR allocates the required rendering data to the local DRAM of GPMs. The majority inter-GPM memory accesses are contributed by the distributed hardware composition, command transmit and Z-test during fragment process. We observe that the delay caused by these inter-GPM memory accesses can be fully hidden by executing thousands of threads simultaneously in numerous shader cores. In addition, the data transfer via the inter-GPM links also leads to higher power dissipation (e.g. 10pj/bit for board or 250pj/bit for nodes based on different integration technologies [5] ). By reducing inter-GPM memory traffic, OO-VR also achieves significant energy and cost saving.
Sensitivity To Inter-GPM Link Bandwidth
Inter-GPU link bandwidth is one of the most important factors in multi-GPU systems. Previous works [5, 25] have shown that increasing the bandwidth of inter-processor link is difficult and requires high fabric cost. To understand how inter-GPM link bandwidth impacts the design choice, we examine the performance gain of OO-VR under a variety of choices on link bandwidth. Fig.17 shows the speedup under different link bandwidth when applying Baseline, Object-level SFR and our proposed OO-VR. In this figure, we normalize the performance to the Baseline with 64GB/s inter-GPM link. We observe that the inter-GPU link bandwidth highly affects the Baseline and Object-level SFR design scenarios. This is because these two designs cannot capture the data locality within the GPM to minimize the inter-GPU memory accesses during rendering. The large amount of shared data across GPMs significantly stalls the rendering performance. In the contrast, OO-VR fairly distributes the rendering workloads into different GPMs and convert numerous remote data to local data. By doing this, it fully utilizes the high-speed local memory bandwidth and is insensitive to the bandwidth of inter-GPM link even the inter-GPM memory accesses are not entirely eliminated. As the local memory bandwidth scales in future GPU design (e.g. High-Bandwidth Memory (HBM) [11] ), the performance of the future multi-GPU scenario is more likely to be constrained by inter-GPU memory. In this case, we consider the OO-VR can potentially benefit the future multi-GPU scenario by reducing inter-GPM memory traffic. 
Scalability of OO-VR
Inter-GPM Traffic Baseline
Object-Level OOVR Figure 16 : Normalized inter-GPM memory traffic under different design scenarios. average over the single GPU processing. On the other hand, the OO-VR provides scalable performance improvement by distributing independent rendering tasks to each GPM. Hence, with 4 and 8 GPMs, it achieves 3.64x and 6.27x speedup over the single GPU processing, respectively.
RELATED WORK
Architecture Approach For NUMA Based Multi-GPU System. There have been many works [5, 21, 25] improving the performance for NUMA based multi-GPU system. Some of them [5, 25, 43] introduce architectural optimizations to reduce the inter-GPM memory traffic for GPGPU application while Kim et al. [21] redistribute primitives to each GPM to improve the scalability on performance. However, none of them discusses the data sharing feature of VR application. Our approach exploits the data locality during VR rendering to reduce the inter-GPM memory traffic and achieves scalable performance for multi-GPU system. Parallel Rendering. Currently, PC clusters are broadly used to render high-interactive graphics applications. To drive the cluster, software-level parallel rendering frameworks such as OpenGL Multipipe SDK [7] , Chromium [19] , Equalizer [13, 14] have been developed. They provides the application programming interface (API) to develop parallel graphics applications for a wide rand of platforms. Their works tend to split the rendering tasks during application development under different configurations. In our work, we propose a software and hardware co-designed object-oriented VR rendering framework for the parallel rendering in NUMA based multi-GPU system.
Performance Improvement For Rendering. In order to balance the rendering workloads among multiple GPUs, some studies [12, 18, 26] propose a software-level solution that employs CPU to predict the execution time before rendering to adaptively determine the workload size. However, such software-level method requires a long time to acquire the hardware runtime information from GPUs which causes performance overhead. Our hardware-level scheduler could quickly collect the runtime information and conduct the realtime object distribution to balance the workloads. There are also some works [23, 38] designing NUMA aware algorithms for fast image composition. Instead of implementing a composition kernel in software, we resort to a hardware-level solution that leverages hardware components to distribute the composition tasks across the multi-GPU system to enhance the pixel throughput. Meanwhile, many architecture approaches [4, 40, 41] have been proposed to reduce the memory traffic during rendering. Our work focuses on multi-view VR rendering in multi-GPU system which is orthogonal to these architecture technologies.
CONCLUSIONS
In modern NUMA-based multi-GPU system, the low bandwidth of inter-GPM links significantly limits the performance due to the intensive remote data accesses during the multi-view VR rendering. In this paper, we propose object-oriented VR rendering framework (OO-VR) that converts the remote inter-GPM memory accesses to local memory accesses by exploiting the data locality among objects. First, we characterize the impact of several parallel rendering frameworks on performance and memory traffic in the NUMA-based multi-GPU systems. We observe the high data sharing among some rendering objects but the state-of-the-art rendering framework and multi-GPU system cannot capture this interesting feature to reduce the inter-GPM traffic. Then, we propose an object-oriented VR programming model to combine the two views of the same object and group objects into large batches based on their texture sharing levels. Finally, we design a object aware runtime batch distribution engine and distributed hardware composition unit to achieve the balanced workloads among GPUs and further improve the performance of VR rendering. We evaluate the proposed design using VR featured simulator. The results show that OO-VR improves the overall performance by 1.58x on average and saves inter-GPM memory traffic by 76% over the baseline case. In addition, our sensitivity study proves that OO-VR can potentially benefit the future larger multi-GPU scenario with ever increasing asymmetric bandwidth between local and remote memory.
