Abstract Sort-last parallel rendering is widely used. Recent GPU developments mean that a PC equipped with multiple GPUs is a viable alternative to a high-cost supercomputer: the Fermi architecture of a single GPU supports uniform virtual addressing, providing a foundation for non-uniform memory access (NUMA) on multi-GPU platforms. Such hardware changes require the user to reconsider the parallel rendering algorithms. In this paper, we thoroughly investigate the NUMA-aware image compositing problem, which is the key final stage in sort-last parallel rendering. Based on a proven radix-k strategy, we find one optimal compositing algorithm, which takes advantage of NUMA architecture on the multi-GPU platform. We quantitatively analyze different image compositing modes for practical image compositing, taking into account peer-to-peer communication costs between GPUs. Our experiments on various datasets show that our image compositing method is very fast, an image of a
Introduction
Parallel rendering is an important technique for visualizing complex scenes in computer graphics, scientific visualization, CAD, and virtual reality. Parallel rendering distributes data to different processors, then sorts and composites locally rendered data to produce the output image. According to when the sort is performed, parallel rendering can take one of three main approaches: sort-first, sort-middle, or sort-last [10] . Unlike sort-first and sort-middle approaches, sort-last parallel rendering has the distinct advantage of high scalability and good load-balancing. Task division for parallel geometry processing and rasterization is also simple, which makes it a prime candidate to extend visualization software to high-performance parallel rendering. However, it requires the intermediate images from processing nodes to be composited to create the final image (i.e. image compositing) [2, 14, 15, 18] . For an image of no more than a few tens of megapixels, this is still a very time-consuming task even for a supercomputer. For example, compositing a 64 megapixel result at 32 K cores takes over 80 ms (much longer than usable for real-time applications) using the IBM Blue Gene/p Intrepid machine at Argonne National Laboratory or the Cray XT5 Jaguar at Oak Ridge [6] .
Multi-GPU uses two or more display adapters in the same PC to speed up graphical applications via parallel processing and rendering. Multi-GPU technology has come a long way in the last few years. Single GPU (such as the NVIDIA Fig. 1 Real-time rendering of time-varying data. Each 4-megapixel frame image is rendered on a 4-GPU PC in about 10 ms. Every colored object is rendered on different GPUs, and finally composited by four GPUs in parallel GeForce 400 series) with Fermi architecture support uniform virtual addressing [16] , which is the foundation of non-uniform memory access (NUMA) [17] architecture on multi-processor platforms. NVIDIA's CUDA 4.0 [13] contains a number of features that simplify the use of multiple GPUs within a workstation or computational node. The use of NUMA architecture [17] helps to alleviate the shared memory bus bottleneck on multi-processor platforms. To take advantage of these advances, we consider real-time NUMA-aware image compositing using such a multi-GPU platform.
The main challenge in NUMA-aware image compositing is how to control the transfer of image data, as the communication infrastructure provided by the PCIe bus is still a limit to system performance [17] . Implementing effective NUMA-aware image compositing is a non-trivial task for a number of technical reasons.
-In an ideal scenario, as the total number of processors increases in a system, the compositing throughput should scale proportionally, but this is difficult to achieve directly using CUDA. -Various approaches to image compositing have been proposed, such as direct-send [4, 12] , binary swap [7, 19] , and recent radix-k [6, 11, 14] . It is not obvious which is most appropriate for a multi-GPU platform. -Times taken for reading and writing operations between peer-to-peer GPUs are different, and this must also be taken into account.
In this paper, we discuss NUMA-aware image compositing for a multi-GPU platform, and solve these technical issues in a systematic way. Our contributions include:
-An optimal NUMA-aware image compositing strategy is proposed based on the proven radix-k approach, which takes advantage of the NUMA architecture on multi-GPU platforms. -We qualitatively analyze different image compositing modes, taking account of peer-to-peer communication costs between GPUs.
Experiments on various datasets demonstrate that the suggested image compositing approach is very fast. An image with a few tens of megapixels image can be rendered in a multi-GPU platform about 10 ms, providing a basis for realtime rendering of time-varying data (see Fig. 1 ).
Related work

Image compositing
In this review, we focus on sort-last parallel rendering, as this is our chosen approach. In sort-last parallel rendering [10] , object data are partitioned among M processors. The objects are rendered locally, and the resulting images are depth-sorted and composited in a final step. Stompel et al. [18] surveyed approaches to image compositing for this step, while Cavin et al. [2] analyzed the relative theoretical performances of these methods. These overviews show image compositing algorithms can be broadly divided into one of three categories: direct-send [4, 12] , binary swap [7, 19] , and hybrid approaches. Direct send compositing divides this final image gathering task into M tiles to avoid exchanging full-size images. Each tile belongs to and is composited by one processor; the composited tiles are eventually assembled together to form the final image. As expected, the complexity of this algorithm is O(M 2 ). The main advantage of direct send is its flexibility, since it can accommodate any number of processing nodes. It is very easy to implement and allows computation (i.e. pixel processing) to overlap with communications. However, it involves multiple processors sending messages to the same processor at the same time in an unpredictable and non-deterministic communication pattern.
The binary swap algorithm [7] is based on a binary tree compositing strategy which keeps every node busy in all stages of the process. It takes log 2 M stages to complete, where M should be a power of two to fully exploit parallelism. Binary swap uses fewer messages than direct send: O(M log 2 M) messages in total, assuming minimal overlap between rounds.
To overcome the disadvantages of binary swap and direct send, many improved methods have been presented in recent years. Yu et al. proposed [19] a 2-3 swap method for parallel volume rendering which combined the advantages of both direct send and binary swap. It avoids all-to-all communication and considers binary swap as a special case of a 2-3 swap algorithm.
Radix-k compositing [6, 11, 14] was later introduced as a configurable parallel image compositing method. The radices are a set of configurable parameters represented as a vector k = [k 1 , k 2 , . . . , k r ], where M = k i , and r denotes the number of communication and compositing rounds. During each round i, the M processing nodes are divided into M/k i groups of k i participants, which communicate only within their group in a direct send fashion. When all kvalues are equal to 2, this is equivalent to the binary swap algorithm. When there is a single k-value equal to M, there is only a single round and this is equivalent to the direct send algorithm. Radix-k compositing has been shown to perform better than the binary swap and 2-3 swap algorithms. However, the biggest obstacle to applying the radixk algorithm is the lack of a clear strategy for determining the vector k which provides optimal performance, since the choice of radices depends on network topology and hardware. Attempts have been made [6, 14] to find the best radices for particular hardware platforms by executing a series of benchmarks, rather than giving an analysis. Here, we consider the optimal radix-k approach from a theoretical point by considering the properties of multi-GPU platform.
Multi-GPU systems
To simplify programming of rendering on multi-GPU systems, Moerschell and Owens [9] presented a consistent, distributed, shared memory system. Recent parallel rendering researching mainly focuses on advances in multi-GPU clusters, such as ones based on InfiniBand fat-trees [1] . GPUs attached to different I/O ports cannot access each other on a peer-to-peer (P2P) basis, so this does not provide a true NUMA architecture. NVIDIA developed CUDA 4.0 [13] to support multiple GPU parallel rendering using NUMA architecture on a PC. However, taking full advantage of it requires careful analysis and thought.
Stephane et al. [8] implemented sort-last volume visualization on a multi-GPU system, and analyzed bottlenecks in and advantages of multi-GPU systems. Spafford et al. [17] quantitatively analyzed the benefits of NUMA architecture on a multi-GPU platform, and provided guidance on programming strategies to maximize performance. The NUMA architecture was initially analyzed by [5] for hybrid multi-GPU clusters to optimize asynchronous parallelization of rendering stages. In our case, we specifically consider the problem of image compositing on the NUMA multi-GPU platform. Image compositing must communicate image data using reading and writing operations. Writing speed is not equal to reading during P2P GPU access, and bidirectional throughput is usually less than twice that of unidirectional transfers: the PCIe bus between GPUs is not an ideal full duplex connection.
Some concepts relevant to NUMA architecture are introduced here for further usage: -Local GPU: the GPU to which the compositing result image belongs; -Neighboring GPU: a GPU which can be accessed by P2P
by the local GPU; -Remote GPU: a GPU which cannot be accessed by P2P
by the local GPU; -GPU-pair: two GPUs connected by PCIe switches and/or the I/O hub.
NUMA-aware compositing algorithm overview
We study how the radix-k algorithm can be used for greatest compositing speed. The keys to improving radix-k performance are: avoiding contention and reducing communication cost. As video memory throughput may be 20 or more times greater than PCIe throughput, we assume we can ignore the cost of local video memory accesses (due to computation) in many cases, and focus on quantifying the effect of P2P accesses (needed for communication) on image compositing.
PCIe link contention always occurs due to the complex communication between GPUs in NUMA architecture. As Fig. 3 shows, GPUs are interconnected by PCIe switches and the I/O hub, and data streams scatter or gather at these interconnected nodes. The PCIe bus is a duplex connection, and can communicate concurrently in both directions. However, when more than one data stream path overlaps in the same direction, PCIe link contention will happen, as illustrated by examples in Fig. 3 . Quantitatively approaching the link contention is critical to analyzing the communication cost.
Multi-GPU NUMA-aware image compositing
Intuitively, it sounds obvious that the bottom-up binary swap is the optimal strategy as the GPU system is organized in a binary tree, and the amount of data transferred by PCIe switches will be minimal using binary swap. However, the transferred data size is not the only factor which impact on the performance of compositing, the utilization of the PCIe switches matters too. The main disadvantage of binary swap in multi-GPU system is that there always exist some idle PCIe switches (or IOH) at each compositing stage. For example, only the bandwidth of the lowest level PCIe switches are full utilized at the first stage of binary swap, and all the others will be idle at this compositing round. Compared with binary swap, direct send method can keep the whole PCIe switches busy enough, although the data transferred is much large than binary swap. Qualitatively speaking, binary swap and direct send are two extreme cases of these two factors, and like a performance spectrum and without quantitative analysis, it is difficult to determine which extreme or intermediate status is the best.
The various k-vectors of radix-k exactly represent each status of the performance spectrum, thus, the best compositing strategy will be determined if we deduce the optimal k-vector. In the following section, we first find the optimal k-vector for the radix-k algorithm on NUMA architecture (Sect. 4.1), then we address how to build practical GPU-pair compositing modes by considering the reading and writing communication costs (see Sect. 4.2).
Optimal compositing on the multi-GPU platform
Enlightened by the approach of mathematical analysis of collective communication [3] , we define the image compositing cost function as follows:
where k is the k-vector in the radix-k algorithm, N is the number of pixels in each partial image which participates in composition, and B is a bandwidth vector related to the grouping approaches as explained later. Given a compositing strategy specified by a vector k, function (1) returns the minimal compositing time using this strategy. For example, f ( [8] , N, [L 3 ]) is the minimal compositing time using the direct send algorithm for eight 
and M is the number of GPUs, analysis of the radix-k procedure [14] shows that the following equation holds:
Here, we additionally define k 0 = 1 to unify the expression of the cost function. In Eq. (2), each item of the sum corresponds to the cost of one compositing round in the algorithm, and the total compositing time is the sum of these times. For NUMA architecture structured in a binary tree, we claim that the optimal radix-k compositing strategy is the binary swap method. We prove this claim using the above formulation in the following. 
Proof As all GPUs are linked in a binary tree, a PCIe switch (or the I/O hub) must be the lowest common ancestor (nearest the root) for the current k i GPUs. We call this ancestor A k i and suppose its pixel bandwidth is B k i as defined previously.
is just the cost of direct send for k i GPUs; each GPU is in charge of compositing N/k i pixels. As an intrinsic property of direct send, the number of pixels passing through A k i unidirectionally is (k i /2) 2 ·N/k i , so the following inequality holds:
In the worst case, each compositing step of a single GPU encounters link contention. This gives an upper bound to the direct send cost as follows (here we let r = log 2 k i − 1):
Next we find bounds on
Equation ( 
Using Eq. (5) and B k i−1 ≥ B k i (since B k maybe the bandwidth of IOH), the upper bound of f ([
A simple function G(k i , N) can be constructed as follows:
After substituting inequalities (4) and (8), we can conclude that G(k i , N) ≥ 0 when k i ≥ 4, proving Theorem 1.
In practice, it is extremely unlikely that equality is reached in inequalities (4) and (5), so we may assume
, we can find the minimal compositing time by recursively using Theorem 1 to decompose the vector k for any element k i ≥ 4. Eventually, all elements of k must be two, and this is exactly the k-value for binary swap. One more key point to note is that the grouping order of binary swap must be bottom-up, that is, each GPU should firstly swap pixels with its nearest partner. This grouping order will minimize the number of pixels crossing PCIe switches, and maximize the performance of binary swap. In summary, we arrive at the important conclusion that binary swap is the optimal image compositing method for a NUMA multi-GPU platform.
GPU-pair image compositing modes
Although we have found the theoretically optimal compositing strategy, we still need to consider many factors (e.g. PCIe communications) to arrive at the best modes in practical terms. In this section we consider various GPU-pair compositing modes, and quantitatively consider how NUMA architecture affects image compositing.
For a GPU-pair GPU 0 and GPU 1 , GPU 0 communicate with GPU 1 through one or more PCIe switches (or the I/O hub), and they can read or write each other's GPU memory. By removing obviously low performance cases, we arrive at three different compositing approaches as shown in Fig. 4 . On the left of Fig. 4 , each GPU in a pair mutually reads the color (C) and depth (D) images from its neighbor to local memory, and composites partial images locally: we call this case the mutual read compositing (MRC) approach. In the middle case, each GPU in a pair mutually writes color and depth images from local memory to their neighbors, then images are composited on the local GPU: we call this mutual write compositing (MWC). The right case is more complicated and involves hybrid inter-GPU communication. Firstly, each GPU writes depth information to the neighboring GPU's memory. Secondly, both GPUs read color images The compositing process involves many reading and writing operations between GPUs, and it is not straightforward to determine which approach is best. Performance variations occur between P2P reading and writing, and these directly affect compositing performance. Suppose the time for a GPU to read one pixel from another is α, and the time for writing is β. To describe the imbalance between reading and writing, we define the coefficient ε to be:
For a given multi-GPU system, ε is a constant parameter determined by the PCIe bus and system chipset. In addition to differences in reading and writing, another factor to consider is that PCIe is not an ideal full duplex bus. As noted by the manufacturer, and observed in practice, the bidirectional bandwidth of PCIe is less than twice its unidirectional bandwidth; further investigation also shows that there are performance variations between duplex P2P reading and writing. We define two further ratios as below to relate simplex and duplex P2P accesses:
where T DR (or T DW ) is the time taken by a GPU to read (or write) image data from its neighboring GPU in bidirectional (duplex) operation, and T SR (or T SW ) is the time for reading (or writing) n data items in unidirectional (simplex) operation. k α and k β thus describe the decrease in PCIe bandwidth when using duplex, relative to simplex. If duplex throughput is exactly twice that of simplex, then k α = 1 and k β = 1, but k α and k β are usually greater than 1, since duplex throughput is less than twice of simplex.
To quantitatively express the cost of these three compositing approaches, we show what happens over time in each in Fig. 5 . Assume that N 0 and N 1 (N 0 ≥ N 1 ) are the active pixel numbers of images I 0 and I 1 in GPU 0 and GPU 1 , where an active pixel is involved in the compositing process. I 0 and I 1 are separately composed of I 00 and I 01 , and I 10 and I 11 , respectively, where I xy referring subimage y on GPU x . We count the total number of active pixels in I 0 and I 1 as N for the final image. We can equally distribute the active pixels of I 0 (or I 1 ) into I 00 (or I 10 ) and I 01 (or I 11 ). Then the cost time of the three approaches is
All three cost expressions are the sum of two terms. The first term in Eqs. (13) and (14) represents the cost of bidi-rectional reading or writing. The second term is the time for unidirectionally transferring
pixels. Equation (15) is a hybrid of Eqs. (13) and (14) as it mixes reading and writing operations.
A key point concerns which pixels are active. In MRC, we have to read full resolution partial images since there is no information about active pixels. MWC and MWRC differ because the local GPU knows which pixels are active and can therefore avoid writing blank pixels. This is why the term N appears in Eq. (13) but not in Eqs. (14) and (15) . On the other hand, if we have already read depth images, only active pixels need be transferred when reading color images.
We have discussed the case of Z-buffer composition above. But most parallel volume rendering composites images by a sorting processes, which does not need a Z-buffer. If we directly composite images without depths, the compositing time expressions become
We can ignore the mutual write-read method in this case because of the absence of depth information. Fewer pixels are transferred between GPUs. Since k α , k β , α and β are all hardware platform dependent, we can draw the conclusion that T MRC , T MWC , T MWRC , T MRC−NZ , T MWC−NZ are all determined by the input active pixels in a given system. Specifying two input images, we can evaluate the above compositing time easily, and decide which composition approach should be used.
Experimental results
During the experiments, we mainly considered two issues. Firstly, we measured the parameters of our multi-GPU platforms since they are crucial in analyzing compositing performance. Secondly, the performance of different compositing modes was experimentally determined, for various kvectors, to verify our analysis and expectations.
Test environment
We used two different platforms to test our NUMA-aware image compositing studies. One was a PC and the other was a server node, equipped with four and eight GPUs, respectively in configurations shown in Table 1 . For simplicity, we call the 4 GPU system UD9 and the 8 GPU platform ESC4000. Although the ESC4000 was equipped with eight PCIe slots, their bandwidth was only X8 as all PCIe slots were filled. However, this degradation of bandwidth did not effect the scalability performance testing. The GPU architectures of the two platforms are shown in Fig. 6 . 
Determining parameters for a GPU-pair
To achieve the differences in P2P reading and writing, we used batched reading and writing to obtain mean performance for each GPU-pair. On each platform, we used different size images as test data, and computed mean access bandwidth for simplex reading (SR), simplex writing (SW), duplex reading (DR) and duplex writing (DW). As mentioned above, there are two types of bandwidth to be determined: one is the bandwidth of paths which only include PCIe switches, the other is for paths including the I/O hub. For example, the path between GPU 0 and GPU 1 on both UD9 and ESC4000 does not include the I/O hub, while the GPU-pair GPU (0,2) on UD9 and the pair GPU (0,4) on ESC4000 include the I/O hub. Thus, we separately list their P2P bandwidth performance in Table 2 . The values for, ε, k α , k β are constant for various applications, and can help to determine how to achieve optimal performance. For example, on the UD9 platform, ε of GPU (0,1) is about 1.18, i.e. the throughput of simplex P2P reading is 1.18 times higher than writing. k α = 1.48 for GPU (0, 2) indicates that the performance of reading from GPU 0 to GPU 2 is a factor 1.48 times lower in the duplex cases than in the simplex case. On the ESC4000 system, all three parameters for both GPU-pairs are closer to 1.0 than for UD9: the NUMA architecture has less impact here. k β for both systems is close to 1.0 for any GPU-pair, meaning that there is little effect when switching from simplex writing to duplex.
Image compositing performance tests
We tested the image compositing performance on UD9 and ESC4000 by using triangle meshes with over 10 8 triangles (see Figs. 1 and 7) . The final image size was set to 4 megapixels (2 K × 2 K), providing a rendering speed about 5 ms for the three image compositing approaches discussed. Such speed allows real-time rendering of time-varying mesh sequences as in Fig. 1 . We selected UD9 as our experimental platform for further study of how the NUMA architecture impacts image compositing (page limits preclude reporting on ESC4000 too). We chose UD9 as it had greater performance variances than ESC4000 so illustrates our analysis more clearly. Given fixed k α , k β , ε (see Table 3 ) and image resolution, the compositing time is determined by the number of active pixels in each of the two input images. Thus our test used images with vary numbers of active pixels to examine the validity of our approach proposed in Sect. 4.2.
To illustrate the variations in the three compositing approaches, we implemented binary swap in different ways as described earlier on UD9. The compositing times using four GPUs on UD9 are shown in Fig. 8 . All three implementations of binary swap have better performance than direct send. When the partial image of each GPU is sparse, as in viewport-1, the time taken by MRC is almost twice that of MWRC and MWC. This is because the costs of MWRC and MWC only depend on active pixels, while MRC depends on the full resolution of the input image, not just active pixels.
To obtain different active pixels for each GPU, we used two different camera views of the Kungfu Panda model. Each object mesh rendered by one GPU, and we set them as Monkey to GPU 0 , Shifu to GPU 1 , Tiger to GPU 2 , and Panda to GPU 3 , the background gray pixels could be treated as the blank. We refer to the first scene as viewport-1, and the second as viewport-2. In viewport-1 (or viewport-2), the ratio of active pixels for GPU 0 , GPU 1 , GPU 2 , and GPU 3 were, respectively, 10 % (or 15 %), 8 % (or 13 %), 17 % (or 21 %), and 19 % (or 33 %).
With a greater proportion of active pixels in partial images (going from viewport-1 to viewport-2), the compositing times for MWRC and MWC obviously increases, as shown in Fig. 8 ; the growth rates for MWRC and MWC are higher than for direct send and MRC. This accords with the prediction of our cost approach. Another detail we observed is that MWRC takes less time than MWC for viewport-1, but longer than MWC for viewport-2. This occurs because k α is much larger than k β on UD9, and the partial images in viewport-1 are sparser than in viewport-2. The slow duplex reading operation increases the costs of MWRC when there are more active pixels to composite.
Image compositing scalability tests
To test the scalability of our compositing approach, we chose ESC4000 to render a large volume dataset by ray casting (see Fig. 9 ). The volume data was the electromagnetic field generated by a particle in cell (PIC) simulation, with dimensions 720 × 720 × 960. Each grid point is a single-precision floating-point number, so the total size of this dataset is about 1.8 GB. We used the MWC mode in this test, as it only depends on active pixels.
Before ray casting the volume data, we divided the dataset using a KD-tree strategy, and statically distributed sub data blocks onto GPUs. The number of KD-tree leaf nodes was equal to the number of GPUs. Unlike polygon mesh rendering, we composited partial images by sorting the ray casting results, and used OVER/UNDER operators to blend each pixel using the GPU. Thus, the communica- There are four different compositing strategies for eight GPUs, and each strategy exactly corresponds to one kvector in the radix-k algorithm. Figure 10 illustrates all four k-vectors used in this test and their compositing times. The k-vector [2,2,2] corresponds to binary swap, while the kvector [8] corresponds to direct send. The whole composition is divided into the same number of rounds as the length of the k-vector. For example, the length of [2,2,2] is three, so there are three rounds in binary swap, and so there are three synchronization points in the composition procedure.
As shown in Fig. 10 , binary swap is much faster than direct send, and the other two methods ( [4, 2] and [4, 2] ) lie in between. Approach [4, 2] is better than [2, 4] because there are fewer pixels passing through the I/O hub. These experimental results confirm our analysis, and we can say with confidence that binary swap is the optimal compositing method for multi-GPU systems.
To measure the overall performance of parallel rendering, we did a scalability test on ESC4000. As binary swap needs a power-of-two threads, only cases with two, four and eight GPUs were measured, as shown in Fig. 11 . The overall rendering time has two parts: ray casting time and binary swap time. The test volume data was nearly 2 GB, and cannot be entirely loaded into a single GPU memory. Thus, we took the case of two threads as the baseline, and calculated speedup relative to this case. With two threads, it only took a short time to composite the two partial images. With increasing numbers of threads, the ray casting time became shorter, while the compositing time increased gradually.
If we ignore the cost of composition, the scalability of ray casting is high (indicated by the purple line in Fig. 11 ). The frame rate with eight GPUs rendering is about 3.3 times the rate for two GPUs. However, considering the compositing cost, the overall speedup is lower (indicated by the red line in Fig. 11 ). The cost of composition becomes the bottleneck with increasing numbers of rendering threads. As the PCIe bandwidth of ESC4000 is X8, this narrow throughput hampered the benefits of parallelism.
Conclusions and future work
We have considered image compositing on NUMA architecture multi-GPU platform. In this case, we have proved that binary swap is the best NUMA-aware image compositing strategy among all radix-k methods. To make best use of P2P communication, three GPU-pair compositing modes were studied, and we quantitatively evaluated the mathematical relationship between compositing time and the number of active pixels. Two hardware platforms were used to confirm the performance differences of P2P reading and writing, and all test results on various data supported our analysis.
This research offers some avenues for further exploration. For example, the NUMA-aware cost approach could also provide a theoretical basis for further optimizing compositing algorithms, potentially leading to more efficient compositing strategies.
