We accelerate a vertical plane radio wave propagation prediction application on GPUs. The application has abundant parallelism, but it exhibits uncoalesced memory accesses, divergent thread execution and contention for GPU register and memory resources -factors that make it challenging to accelerate on a GPU. We describe three acceleration strategies and optimizations to improve performance for each strategy. We evaluate the performance of the three strategies and their associated optimizations on GTX275 and GTX480 GPUs. We show that it is possible to obtain kernel speedups of up to 76× and overall application speedups of up to 48× over the sequential CPU implementation. However, the dynamic nature of data accesses in the algorithm makes the best strategy/optimizations dependent on input data and target GPU.
Introduction
General Purpose Graphics Processing Units (GPUs) are used to accelerate many applications in science and engineering. However, to achieve good performance, programmers must be aware of low-level architectural details of the GPU and restructure their code to ensure that memory accesses are coalesced, fast local memory is used, threads have no divergent execution and that resource usage is balanced for optimal performance [13] . For "GPU-friendly" applications that use simple data structures and exhibit "regular" behavior [16] , this restructuring is relatively well understood, application-independent and in some cases automated [3, 18, 19, 21] . However, such restructuring remains challenging and application-specific for large "real-world" applications that use complex data structures and have less regular behavior.
We explore the acceleration of one such application that implements an algorithm for predicting the propagation of radio waves. The algorithm is used to predict the strength of radio signals in an area around a transmitter, given the terrain/obstacles between the transmitter and each point in the area. The algorithm represents a critical component in commercial tools used to plan and optimize radio access networks, such as 4G cellular and Digital Broadcasting networks [5] . It is computationally intensive and is executed repeatedly to determine the best placement of a transmitter in a given area. Thus, it can benefit from acceleration on GPUs.
The prediction algorithm is hardly GPU-friendly. While there is abundant parallelism among the calculations of the received signal strength at points in the area of interest, these calculations exhibit considerable control-flow and dynamic memory access patterns that are both highly dependent on the distance and the terrain/obstacles between a point and the transmitter. This dependency results in highly irregular memory accesses and in divergent thread execution. This, in turn, makes it hard to achieve good acceleration on GPUs. Further, the calculations have a large working set, dictating careful balance of parallelism against the use of GPU resources, including registers and all levels of GPU memory.
We describe three strategies for accelerating the prediction algorithm on GPUs. For each of these strategies, we describe a number of code restructuring techniques, or optimizations, that address performance-degrading issues such as lack of memory coalescing, divergent execution and resource limitation. However, the optimizations do introduce overhead that may degrade their expected benefit. In the case of the prediction algorithm, the degree to which these optimizations can benefit performance and the impact of their associated overheads are dependent on both the input terrain data and on the target GPU. Such dependence makes it difficult to reason about the effectiveness of individual optimizations and makes it unclear as to which acceleration strategy is best and which set of optimizations is most beneficial to performance.
We experimentally evaluate the performance of the three strategies and their optimizations on two GPUs: GTX275 and GTX480 using 13 inputs. Our evaluation shows that: (1) in spite of its irregular nature, the prediction algorithm can benefit from GPU acceleration -we achieve a kernel speedup of up to 76× and an overall speedup of up to 48×, both with respect to a resonably-optimized sequential code and (2) the tradeoffs between the benefits and overheads of the optimizations are subtle and in some cases non-intuitive. This paper is organized as follows. Section 2 provides background on radio wave propagation and on NVIDIA GPUs. Section 3 describes the vertical plane algorithm and characterizes its sequential execution. Section 4 presents the three GPU acceleration strategies and optimizations. Section 5 reports and analyzes the performance results. Section 6 discusses related work and Section 7 concludes.
Background

Radio Wave Propagation Prediction
Radio wave propagation prediction has been an active research topic for several decades [15] . The approaches used for prediction can be empirical or physical [1, 9] . Empirical approaches use sample measurement of signal strength for prediction. They are computationally efficient, but with limited accuracy. Physical (also referred to as "deterministic") approaches rely on physical models of radio wave propagation. They take as input a digitized map of the environment to determine the signal loss between the transmitter and receiver(s) due to the impact of the terrain on the radio waves. Some physical approaches only take into account the obstacles on the direct transmitter-receiver path [10] , while other more complex methods build on the "direct path" approaches to also consider signal reflection and diffraction [11, 22] . All propagation prediction methods in rural or urban areas are based on an advanced modelling of the physical phenomena occurring in a vertical plane between the transmitter and the receivers. Consequently optimization of this part of the algorithm is paramount to reach operational use of such simulation tools. The algorithm we present in this paper takes a physical approach that only considers direct paths.
NVIDIA GPU
An NVIDIA GPU consists of multiple streaming multiprocessors (SMs), each of which contains a number of multiplyadd units, or cores, that execute instructions in a SingleInstruction Multiple-Thread (SIMT) fashion. The GPU supports a C-extended programming model, called Compute Unified Device Architecture (CUDA), that allows both host (CPU) code and device (GPU) code, or kernels, to be written in a single source program. A kernel is a regular C function that is executed by each GPU thread. Launching a kernel for GPU execution involves calling the kernel function (in the host code) along with a specification of the space of GPU threads that execute it, called a grid. A grid contains multiple thread blocks, each of which contains multiple threads. Each thread block is scheduled to run on an SM. It is split into groups of 32 threads called warps. Threads in the same warp are scheduled to run on the SM cores together.
All threads share the off-chip global memory; each thread block has a private on-chip shared memory, shared by all threads in the thread block. Each thread has private registers and local memory, which stores spilled registers. In addition, a Fermi GPU like GTX480 has a hardware-managed L1 cache in each SM [14] . The global and local memory takes hundreds of cycles to access while the shared memory is at close-to-register speed.
The first-order performance consideration on a GPU is usually how to reduce or hide the long access latencies to the global memory [13] , which can be achieved in three ways. First, the number of global memory transactions is greatly reduced if the threads in a warp (or warp threads) access the memory in a coalesced fashion so that accesses (made by warp threads) fall in a contiguous 128-byte segment whose base address is aligned to 128 bytes. Second, shared memory can be utilized to cache global memory accesses in the presence of data reuse. Shared memory does not suffer from uncoalesced accesses, but its throughput decreases in the presence of bank conflicts. Last, an SM can switch to execute other warps after issuing a blocking memory access for the current warp. Therefore, keeping a sufficient number of warps active on an SM allows global memory access latencies to be hidden behind useful work from other warps.
Another important GPU performance consideration is the execution inefficiency when warp threads take different paths [13] . Due to the SIMT nature of SM cores, warp threads must all execute the same instruction at any given time. In the presence of a data-dependent branch that causes different threads to follow different paths (known as divergence), the warp serially executes each path, disabling threads that do not take that path. Therefore, the SM cores can be under-utilized during divergent execution. Not only such divergence comes from if statements, it can also occur when each thread executes a varying number of iterations of a loop, in which case all warp threads must wait for the one that takes the maximum number of iterations. Figure 1 gives a high-level pseudo-code of the prediction algorithm. It takes as input the map data, which defines the terrain and the land usage in the area of interest, as well as the location and properties of the transmitter. The algorithm divides the map into a grid of receiver cells and calculates received power at every cell location on the map. It outputs a matrix, where each element stores the received power at the centre of the corresponding cell.
Sequential Algorithm
1. Load input maps into memory. 2. for each edge point E of the map do 3. Extract an RP from the transmitter to E.
4.
for each receiver point R on the RP do 5. Construct a VP from the transmitter to R, by truncating the RP. 6. Compute the power P at receiver R.
7.
Write P to the output matrix. 8 . 9 . end for Figure 1 . Vertical plane prediction algorithm.
end for
A key data structure for computing the received power is the Vertical Profile (VP), which is used to detect the obstacles between the transmitter and the receiver a;ong a vertical plane. It is extracted from the map data, and contains several arrays that store information about successive grid points from the transmitter to the receiver. For each point, this information includes the distance from the transmitter, the altitude, the land usage and the height above the ground of the point.
The algorithm groups the received power calculation for grid points that are along a radial, a direct path from the transmitter to an extreme receiver point on the edge of the map. This is because the VPs of these grid points can be obtained from a single Radial Profile (RP) with only small (and acceptable) approximation. The RP is no more than a VP from the transmitter to the edge point. For each radial, an RP is constructed first (line 3 of Figure 1) ; the VPs for other points on the radial are then constructed by simple truncation of the RP and slight adjustments of the last profile point (line 5), thus avoiding repeated map accesses. Figure 2 . First, the VP goes through several correction steps, e.g., to take into consideration the curvature of the Earth's surface, the imprecision of map data near the transmitter and receiver locations, and other environment-specific factors. One of these steps introduces extra points to the VP. Thus, a separate VP data structure must be constructed for each receiver cell instead of accessing the RP directly.
Second, the power attenuation due to object obstruction along the path (specified by the VP) is computed by modeling major obstacles (e.g., high buildings, hills, forest canopy) with one or more knife-edges -points that have the highest engagement with the first Fresnel ellipsoid of profile segments [7] . The main obstacle leads to a primary knifeedge. Other obstacles on its left and right lead to secondary knife-edges. The primary (first-order) knife-edge is determined with respect to the entire profile, while higher-order knife-edges are with respect to profile segments bounded by lower-order knife-edges. A knife-edge does not always exist in a profile segment. When this happens, the algorithm skips the search for higher-order knife-edges in the segment.
Finally, power attenuation at the receiver is computed by combining the attenuating effect contributed by each knifeedge using a formula adapted from [7] . This computed attenuation is then corrected based on the environment (e.g., urban vs. rural) and land shape (e.g., a hill). In the rest of the paper, we refer to all the computation associated with a VP (lines 5-7 of Figure 1) as a VP task. Figure 3a shows the breakdown of execution time into three phases: RP extraction (line 3 of Figure 1 ), VP construction (line 5) and received power computation (lines 6 and 7). The last phase is further divided into two: VP correction and the remaining knife-edge and attenuation computation (labelled "power computation"). We use two input configurations with the same terrain map but with different transmitter locations: at the center of the map and near one edge of the map. We make two observations from the figure. First, power computation is the most time-consuming phase (∼67% of the total time), and VP construction plus correction are non-negligible (∼20% of the total time). These phases are ideal targets of accelerations. Second, the program behavior is greatly affected by the input: a shift in transmitter location can change execution time by over 30%.
Runtime characteristics
Next we examine the execution time of VPs (lines 5-7 of Figure 1) . Figure 3b and 3c show the distribution of VP execution time. We observe that the execution time varies greatly across VPs and the variance is more prominent when the transmitter is off the center.
There are many causes for the high variance of VP execution time. One major factor is the VP length (i.e., the number of profile points) because the computation involves quite a few loops that iterate through VP points (Figure 2 ). Even for VPs of the same length, the processing, for example, of two third-order knife-edges (around a second-order knifeedge) iterates through only a segment of the profile, whose length depends on the location of the first-order knife-edge, which is in turn terrain-dependent. Such variance in execution time can result in thread divergence and load imbalance issues on the GPU, and must be taken in consideration.
GPU Acceleration
We elect to migrate to the GPU the entire algorithm except for the construction of RPs from map data (line 3 of Figure 1 ). The construction process takes only 6-7% of the total sequential execution time. It also makes irregularly strided accesses to map data because each RP is along a radial direction and the map data often consists of multiple structures, each representing only a portion of the terrain. This is expected to result in poor GPU performance. Thus, the program is re-structured to extract all RPs before processing any of them. Further, the two loops that iterate through the RPs and then the VPs (lines 2 and 4 of Figure 1 ) are collapsed into a single loop for better load-balancing.
The first acceleration strategy is to distribute the iterations of this collapsed loop among GPU threads, which means that each VP task is executed by a single thread. Since the execution path taken by each thread is highly dependent on the corresponding VP's length and data, this strategy suffers from divergence. Further, this strategy limits the use of shared memory on the GPU. This is because each thread needs a buffer to hold an entire VP, which can have thousands of elements. With a sufficient number of active threads per SM, the space needed by the VP buffers exceeds that of the SM's shared memory. Thus, the VP buffers are allocated in the slow global memory. We name this strategy gpu coarse and is described Section 4.1.
There is also fine-grained parallelism within a VP task, e.g., during knife-edge identification. Therefore, in the second acceleration strategy, called gpu fine, each VP task is performed in parallel by all threads in a thread block. The main advantage of this approach is that it eliminates divergence due to variation of VP lengths, because all threads in a warp always work on the same VP. However, the parallel loops in a VP task are interleaved with sequential computation, which leads to under-utilization of SM cores when executed by multiple threads. A side-effect of exploiting the fine-grained parallelism is the reduction of effective VP memory requirement for each thread, from one VP buffer per thread (in gpu coarse) to one VP buffer per thread block (in gpu fine). This opens the possibility of constructing and processing VPs directly in shared memory. We name this third strategy gpu fine smem. While it has the potential of speeding up memory accesses during knife-edge computation, it may limit the number of active thread blocks per SM due to the scarce shared memory resources.
We also parallelize the extraction of RPs on the multicore CPU, by distributing the task for individual RPs among four host threads. We also speedup the transfer of RPs from the host memory to the global memory, by packing the RPs in a single contiguous buffer.
The remainder of this section details the three GPU acceleration strategies, and the optimizations used to improve their performance and speed up host-device data transfer.
gpu coarse
In this strategy, a single thread is used to compute the power for a receiver point. Each thread needs two private buffers to store the VP and the knife-edge information respectively, both of which are large and must reside in global memory. Instead of launching as many threads as the number of receiver points, just enough of them are launched to achieve the maximum level of parallelism at each SM (under the resource constraints) and each thread processes multiple receiver points. In this way, the amount of global memory dedicated to threadprivate buffers is minimized. Since each thread process multiple VPs, potentially of different lengths, all VP buffers are large enough to hold the longest VP(s) across all receiver points. The global memory also holds all RP data that is transferred from the host memory.
Since the execution path and memory access behavior in received power computation are highly dependent on the VP's length and data, low memory throughput due to uncoalesced accesses and low ALU utilization due to divergent execution are two performance concerns. In the rest of the section, we describe three optimizations that address these concerns.
The first optimization targets the layout of VP buffers and ensures that VP accesses in received power computation (i.e., lines 5-7 of Figure 1 are as coalesced as possible. In the default layout, the VP buffer for each thread occupies a contiguous region. This layout leads to very poor memory coalescing because warp threads tend to access concurrently the same point location in their private VP buffers. The layout of the VP buffers is changed so as the ith profile point of the buffers for consecutive warp threads are placed consecutively. The new layout meets the strictest coalescing requirements as long as the base address of the entire VP buffer region is aligned to 128 bytes, which is guaranteed by cudaMalloc. This layout optimization is expected to consistently bring performance benefit across all inputs. We refer to it as Opt. 1.
It is important to note that this layout does not coalesce all VP accesses during knife-edge identification. While each warp thread always traverses its VP starting from the first point when locating the primary knife-edge, the start of such traversal for a higher-order knife-edge depends on the location of a lower-order knife-edge, which can be different across warp threads. Such "misalignment" can render all VP accesses for this knife-edge uncoalesced.
The second optimization also targets memory coalescing, but for VP and RP accesses in the phase of VP construction (line 5 of Figure 1 ). To construct a VP, a thread sequentially copies a portion of the RP, which resides contiguously in global memory, to the VP buffer. Given the optimized layout of VP buffers described above, the writes to VP buffers are coalesced because warp threads populate the same point index of their buffers at the same time. However, the reads from RP buffer(s) are not coalesced because different warp threads may read from different RPs, depending on how VP tasks (as defined at the end of Section 3) are distributed among the threads. However, by observing that the construction of a single VP can be done in parallel, we make warp threads cooperate to construct the 32 VPs they need one at a time. Since they now read from consecutive indices of the same RP, the RP reads are coalesced as long as each RP's base address is aligned to 128 bytes. Unfortunately, writes to VP buffers become uncoalesced because warp threads write to consecutive indices of the same VP, which are now strided in memory.
To resolve this issue, we redirect these uncoalesced VP writes to a shared memory buffer, which does not suffer from memory uncoalescing [19] . Instead of constructing a VP entirely before moving on to the next, warp threads load 32 consecutive elements from each of the 32 RPs, and store them in a column of a 32×32 shared memory buffer. The RP reads remain coalesced, and shared memory writes will achieve maximum throughput, as long as there are no bank conflicts. In our case, we pad the shared memory buffer to 32×33 to avoid such conflicts. Once the shared memory buffer is populated, warp threads dump the data to the VP buffers, where each thread writes to its own VP buffer, forming coalesced writes.
The above optimization improves memory access coalescing during VP construction at the expense of extra data transfer to and from the shared memory. Its performance benefit depends on the extent to which memory accesses are uncoalesced, which in turn is dependent on input data. We refer to this optimization as Opt. 2. Since it assumes the optimized layout of VP buffers, it can only be applied with Opt. 1.
The third optimization aims to reduce divergence across VP tasks, each of which is dominated loops whose tripcounts depend on the VP length. By default, tasks for consecutive VPs (of non-decreasing length) that belong to the same RP are assigned to consecutive threads. This distribution scheme causes divergence as the VPs processed by warp threads often have different lengths. Thus, we sort all VP tasks by their VP's length and distribute them cyclically among the threads. The sorting is performed on the host. While the new distribution scheme minimizes the length difference among VPs processed by warp threads, it breaks the "RP boundary" during distribution and may increase the number of different RPs accessed by warp threads during VP construction, resulting in poorer coalescing of RP reads (if Opt. 2 is not applied). Thus, for this optimization to be effective, the benefit of divergence reduction must outweigh the uncoalescing penalty, both of which depend on the input terrain. We refer to this optimization as Opt. 3 .
Length variation among concurrently processed VPs is not the only source of divergence, despite being a major one. Another source lies in the dependency among different orders of knife-edges. That is, if a knife-edge is not found, there is no need to identify higher-order knife-edges around it. While such a "branch" encloses a large body of code, its condition evaluation highly depends the terrain and can easily be different across warp threads. Since this condition cannot be pre-computed, we do not reduce such divergence.
gpu fine
The gpu fine strategy uses an entire thread block (as opposed to a single thread) to execute a VP task (as defined at the end of Section 3). This is possible because most VP traversals in VP construction and correction are parallelizable and the core loop in knife-edge identification (Figure 2 ) is essentially a parallel reduction. The iterations of these loops are simply distributed among the threads cyclically. In this way, warp threads always access consecutive points of the same VP and go through the same knife-edge dependencies convergently. This VP access pattern demands that each VP buffer occupies as contiguous memory region, so we use the default layout used in gpu coarse.
The total amount of memory needed is reduced because all threads in a thread block share a single VP buffer and a single knife-edge data structure. While the former remains in global memory, it is now possible to allocate the latter in shared memory without limiting parallelism.
Two optimizations are applied to gpu fine. First, the VP buffer layout does not meet the most strict coalescing requirement during the identification of higher-order knife-edges because these knife-edges are with respect to a segment of the profile whose base address may not be aligned to 128 bytes. Such misalignment means each warp memory access takes two transactions and effectively causes double-fetching of every group of 32 consecutive words in the segment. Since the two fetches are made in two consecutive loop iterations, it is unlikely for the L1 cache in the Fermi GPU to retain the data for such a long period. To address this inefficiency, we apply an optimization illustrated in Figure 4 . Instead of distributing profile points (among the threads) from the beginning of the profile segment, we start the distribution from a 128-bytealigned "virtual" position before the first point of the segment. In this way, we achieve full memory coalescing at the expense of having some threads of the first warp stay idle in their first iteration. While this optimization may increase the number of parallel loop iterations by one, it improves memory coalescing, the extent of which is input-dependent. We refer to it as Opt. 1. The second optimization of gpu fine addresses the problem that a VP task consists of a non-negligible amount of sequential computation interleaved between parallel computation. Such computation leads to under-utilization of GPU resources as it is done either redundantly by every thread in the thread block or by a single thread with synchronization at the end. A significant portion of such computation lies in attenuation computation, which works on at most 7 knifeedges and thus can feed at most 7 threads in a thread block. To increase execution efficiency, we use a single thread to perform the attenuation computation, as in gpu coarse. After identifying the knife-edges of a VP, the thread block dumps the knife-edge data structure to global memory (from shared memory) in parallel. In global memory, the thread block has a knife-edge buffer that is large enough to store N knife-edge data structures, where N is the thread block size (i.e., number of threads in a thread block). After processing N VPs, the thread block performs the attenuation correction phase for all these VPs, by assigning one knife-edge data structure to each thread.
While the above optimization improves execution unit utilization during power attenuation computation, it introduces the overhead of global memory accesses. Since both the benefit and overhead depend on the number of knifeedges identified in a VP and thus the input terrain, the performance impact of the optimization can only be determined empirically. We refer to this optimization as Opt. 2.
Storing knife-edge structures in shared memory, as opposed to in global memory, is certainly an alternative, but this lowers the potential gain by parallel execution because a 16KB of shared memory can hold at most 8 knife-edge structures per thread block. We do not pursue the implementation of this alternative.
gpu fine smem
Since there are multiple VP traversals in the phase of VP correction and knife-edge identification (Figure 2 ), it could be beneficial to bring the per-thread-block VP buffer to shared memory. The reason these buffers are left in global memory in gpu fine is that the small size of shared memory may not be able to fit even a single VP buffer, or at best severely limits the number of concurrent thread blocks running on an SM. However, we observe that the VPs have a wide range of lengths, and it is certainly possible to fit shorter VPs in shared memory without sacrificing the level of parallelism. This is the motivation of gpu fine smem.
In this strategy, we design a variant of the gpu fine kernel in which the VP buffer resides in shared memory only (i.e., there is no global memory counterpart). On the host, the list of VPs is divided into multiple bins, based on the VP lengths. A separate instance of the kernel is launched for each bin of VPs. For each kernel instance, the amount of shared memory used per thread block is determined (at runtime) based on the longest VP in the bin, and the kernel's launch configuration is set to achieve the highest level of parallelism under the shared memory constraint. Kernel instances for short-VP bins run at a higher level of parallelism than those for long-VP bins. For those VPs that are too long to be fit into the shared memory, the gpu fine kernel is used.
A critical aspect of this strategy is how to divide the VP lengths into bins. This is done backwards. First, a list of all possible kernel launch configurations is generated based on the register constraint (e.g., a limit of 32K registers per SM on GTX480). For each configuration, the maximum VP length it can handle is then determined based on the shared memory constraint. These maximum VP lengths (in ascending order) form the bin boundaries.
However, in this process, not all launch configurations are desirable to start from. Consider two configurations 8×32 and 6 × 64 (# of thread blocks per SM × thread block size). The second exhibits a higher level of parallelism than the first due to more threads per SM. It also allows for a longer VP buffer due to less number of thread blocks per SM. Hence the first configuration is considered "inferior" and should not be used.
Thus, to eliminate inferior configurations, we introduce a filter, that removes any configuration: 1) that compared to which there is another configuration that can handle longer VPs with a higher level of parallelism, or 2) whose level of parallelism is so low compared to that of the gpu fine kernel that it is not worthwhile to use shared memory.
However, the level of parallelism is not the sole indicator of performance. For example, a 6 × 64 configuration may not perform better than an 8 × 32 one because a larger thread block incurs a higher penalty during the execution of sequential portions of a VP processing task, and may introduce more divergence due to uneven distribution of loop iterations among threads. Thus, we also use a heuristic to estimate performance from the level of parallelism, by assigning a decreasing "score" to each additional warp added to a thread block. Clearly, there are many ways to design this heuristic and we describe one in the evaluation section.
Register usage reduction
It is important to minimize per-thread register usage for gpu coarse and gpu fine. This is the resource bottleneck that limits their levels of parallelism, which is critical to hide global memory access latency in these versions. Ensuring a high level of parallelism is more important in gpu fine due to the presence of fine-grained synchronization points between sub-phases of a VP task.
We apply two methods to reduce register usage: 1) passing hints to the compiler by setting launch bounds for each kernel, and 2) manually spilling variables with a long live range to shared memory, especially those that are invariant within the thread block, e.g., the VP length. On GTX275, we simply aim to reduce register usage until spilling occurs. However, the tuning is more complex on GTX480; the L1 cache significantly ameliorates the penalty of accessing spilled registers. Thus, increasing the level of parallelism with some register spilling may actually improve performance.
Host-device data transfer
In all three GPU acceleration strategies described above, a list of RPs is extracted from the map on the host and transferred to the global memory. In our initial implementation, each RP is allocated separately in global memory and the data transfer occurs right after the RP is constructed, thus requiring only a single RP buffer in the host memory. However, since the number of RPs is typically in the thousands, this scheme greatly under-utilizes the PCIe bandwidth. As an optimization, we construct (on the host) a single buffer that contains all RPs and perform a single transfer to the global memory. Since the RPs are of varying lengths that are only known at runtime, the buffer starts with a header that contains offsets pointing to the beginning of individual RPs. Further, we explicitly insert padding between RPs to ensure that the start of each RP is alighted to 128 bytes for coalesced accesses in VP construction.
Evaluation
We evaluate the performance of the three GPU acceleration strategies on two GPUs: an NVIDIA Geforce GTX480 (with 1.5GB of device memory) and an NVIDIA Geforce GTX275 (with 1.0GB of device memory). Both GPUs are hosted in a system with an 3.2GHz Intel Core i7-960 and 6GB of main memory and that runs Ubuntu 10.04 64-bit with CUDA 4.0 (device driver 270.41.19).
We use 13 input terrain configurations, characterized in Table 1 . For each input, the table shows the baseline execution time when the program is run single-threaded on the Core i7. The execution time is divided into three components; "kernel" represents the time of the code targeted to the GPU, including VP construction, VP correction and power computation.
For GPU performance, we report kernel speedup with respect to the equivalent "kernel" component of sequential execution shown in Table 1 , compare the three GPU strategies, and finally present the overall application speedup. The absolute GPU execution times can be obtained using these speedup numbers in combination with Table 1 . Figure 5 shows the kernel speedup of gpu coarse on both GPUs, under all possible combinations of the three optimizations described in Section 4.1: 1) VP access coalescing, 2) RP access coalescing, and 3) sorting VP tasks by VP length before distribution to GPU threads. We refer to these optimizations as Opt. 1, 2 and 3 respectively. Further, each variant of gpu coarse shown in Figure 5 is named as "K" followed by the set of optimizations applied. For example, "K12" refers to the version where Opt. 1 and 2 are applied to the base version (named "Unoptimized"). In all these experiments, GTX480 is configured with 16KB of L1 cache. The impact of the cache size is explored later. The launch configuration used in these experiments is described in Section 5.5. We make several observations from the figure. First, the unoptimized version accelerates the kernel, but only by a small amount. Opt. 1 brings a significant speedup on top of the unoptimized version (2.8× -7.8× on GTX480 and 2.4× -20.8× on GTX275). This is because the optimization reduces the number of warp memory transactions for VP accesses during received power computation from potentially 32 to 1. The speedup is lower on GTX480 because of the L1 cache that ameliorates the penalty of uncoalesced accesses.
gpu coarse
The second observation is on the individual impacts of Opt. 2 and 3. K12 performs worse than K1 for almost all inputs on both GPUs. Thus the benefit of coalescing RP reads during VP construction does not overcome the overhead of extra shared memory loads and stores introduced by Opt. 2. This is because the default distribution scheme of VP tasks often assigns VPs that belong to the same RP to warp threads, leading to coalesced RP reads. Similarly, for all inputs on both GPUs, K3 performs worse than the unoptimized version, and K13 performs worse than K1. Thus that the amount of divergence reduction is not sufficient to compensate for the penalty of uncoalesced RP reads introduced by Opt. 3. Our third observation concerns the interaction between Opt. 2 and 3. Although applying Opt. 2 degrades the performance of K1, it improves the performance of K13 by a large margin. Thus the benefit of Opt. 2 is realized only when Opt. 3 is applied. This is because the optimized distribution scheme, that sorts VP tasks by VP length, increases the number of different RPs read by warp threads, thus the degree of read uncoalescing and the benefit of Opt. 2. Similarly, although applying Opt. 3 degrades the performance of K1, it improves the performance of K12 for most inputs. This means that Opt. 3 is beneficial only when Opt. 2 is applied to reduce the uncoalescing penalty introduced by Opt. 3. Thus, Opt. 2 and 3 should be applied together.
Our fourth observation is that Opt. 3 brings noticeable benefit to K12 only for inputs 1a-2, which contain relatively short VPs. This surprising result leads us to analytically assess the impact of divergence reduction brought by Opt. 3. We model performance assuming that the execution time of 32 VPs concurrently processed by warp threads is proportional to the length of the longest VP among them. The modelled speedup of Opt. 3 confirms that the benefit of reducing divergence is large (up to 35%) for inputs with short VPs but is much less (¡ 5%) for those with long VPs. This is because the majority of execution time is consumed in processing long VPs, where the length variation among consecutive VPs in an RP (under the default VP task distribution scheme) is less significant.
Combined with the second observation, it is not hard to see why K123 performs better than K1 only for inputs 1a-2 on GTX275. Under the default VP task distribution scheme, the degree of RP read uncoalescing during VP construction and divergence due to VP length variation is not severe enough to overcome the overhead of Opt. 2 and 3.
Last, we observe that GTX480 does not perform better than GTX275: 6% faster for K1 and 8% slower for K123, on average, despite being theoretically almost twice as powerful. This is mainly due to the use of software implementations of some math operations (e.g., division and log) to ensure numeric compatibility with the sequential code. These routines for GTX480, which are fully IEEE-754 compliant, are more expensive than those for GTX275 (which are still not compliant). Indeed, should the fast math intrinsics be used on both GPUs (ignoring numeric compatibility), GTX480 is about 32% faster than GTX275. We also examine the impact of L1 cache on GTX480. Figure 6a shows the speedups of K13 with 0KB, 16KB and 48KB L1 cache and compares them against K123, which uses shared memory explicitly. For all inputs except 4a-b and 4d-e, a 48KB L1 out-performs a 16KB L1 by 50% on average. Surprisingly, not using L1 at all also out-performs a 16KB L1 by 29% on average. This is because memory accesses cached in L2 only are serviced with 32-byte transactions, while those that are also cached in L1 are serviced with 128-byte transactions [14] . Given the high degree of RP read uncoalescing in K13, shorter transactions improve performance by reducing over-fetch. In contrast, the performance for inputs 4a-b and 4d-e improves incrementally as the L1 size grows from 0KB to 48KB, and quickly out-performs the shared memory version. Further, the performance increase is more modest than that for other inputs. This suggests that the penalty of RP read uncoalescing is low in these inputs, which is confirmed by the measurement of average number of different RPs accessed by warp threads during VP construction, shown in Figure 6b .
For version K1, a 16KB L1 out-performs 0KB L1 by 9% on average, while a 48KB L1 does not bring any additional benefit over a 16KB L1. This is expected as the degree of RP read uncoalescing in K1 is low. Figure 7 shows the kernel speedup of gpu fine on both GPUs, with the two optimizations described in Section 4.2 applied incrementally. They are: 1) VP access alignment during higher-order knife-edge identification, and 2) per-thread execution of power attenuation computation. We refer to them as Opt. 1 and 2 respectively. The gpu fine variants shown in Figure 7 follow the same naming convention as in Section 5.1. GTX480 is configured with 48KB of L1 cache. The launch configuration used to obtain these results is described in Section 5.5. We make two observations from the figure. First, Opt. 2 brings an average speedup of 1.1× on GTX275, while it degrades performance by 3% on GTX480. Thus the benefit of aligned VP accesses introduced by this optimization does not outweigh the overhead of an additional loop iteration by a large margin. In the presence of L1 cache in GTX480, the benefit is further reduced. Indeed, the second fetch of a 128-byte line due to misaligned VP accesses is very likely to be from the cache, given that the memory requests made by a whole thread block are made to a single VP.
gpu fine
Our second observation is that Opt. 3 improves the benefit of Opt. 2 by an average of 5% and 10% on GTX480 and GTX275 respectively. This low speedup means that the overhead of dumping each knife-edge array structure to global memory reduces the benefit of higher execution efficiency during attenuation computation. Figure 8 shows the kernel speedup of gpu fine smem on both GPUs, with and without the launch configuration filter. The heuristic that estimates performance from the parallelism level is configured so that the first warp in a thread block is assigned a score of 10 and each additional warp in the block gets a decreasing score in steps of 2. Thus the scheme does not reward performance beyond 5 warps. Further, this fiter eliminates configuration whose performance score is less than 2/3 of that of gpu fine. Figure 8 shows that the filter does noticeably improve the performance of the kernel. The extent of this improvement depends on the input, in particular, its VP lengths. For inputs with short VPs (1a-1c, 2) the performance impact of the filter is limited. However, for those with longer VPs (1d, [3] [4] [5] , the performance improvement gained by using the filter becomes higher. It turns out that longer VPs cause some instances of the gpu fine smem kernel to lose parallelism to the extent that the filter replaces them with instances of gpu fine 1 . This suggests that gpu fine smem performs worse than gpu fine. This observation is further confirmed by the higher impact of the filter on GTX275 (compared to GTX480) for inputs 1d and 3-5; the smaller shared memory on GTX275 causes more instances of gpu fine smem to be replaced with gpu fine. Indeed for inputs with very long VPs (4d-f), the VPs are too long to fit in the shared memory and gpu fine smem instances are replaced by gpu fine, making the filter less effective than on medium inputs like 4a-b.
gpu fine smem
A comparison of Figures 7 and 8 shows that the performance of gpu fine smem is consistently worse than that of gpu fine and that the use of the filter simply improves the performance of gpu fine smem making it approach that of gpu fine. This is consistent with our observation above since the use of the filer simply replaces instances of gpu fine smem with instances of gpu fine.
The surprisingly poor performance of gpu fine smem can be attributed to two reasons. First, each iteration of the core loop during knife-edge identification has a high computation-to-data ratio. On GTX480, it has around 50 arithmetic instructions (partially due to the use of software division routines) that depend on 3 global memory loads. With 16 warps per SM, the arithmetic instructions can hide the majority of access latency of the loads. Hence the benefit of putting VPs in shared memory is significantly reduced. Second, shared memory becomes a resource bottleneck that limits the level of parallelism in gpu fine smem, which remains crucial to the performance of phases that make frequent global memory accesses (e.g., VP construction). Further, the loss of parallelism is most severe when processing long VPs, which takes the majority of kernel execution time. Therefore, this parallelism-induced performance loss can easily outweigh the small improvement of memory performance.
Register usage tuning
For all variants of gpu coarse, the reduction of register usage through the specification of launch bounds allows the level of parallelism to increase from 256 threads per SM to 320 threads per SM on GTX275. As a result, the performance of version K1 is improved by 5% (on average) for inputs 4a-c. Applying launch bounds degrades performance on GTX480 because there are already too many (16 for K1) spilled registers before its use. This optimization is not included in the performance numbers shown in Section 5.1, because the increase of parallelism level leads to more VP buffers, which can no longer fit in the global memory for some inputs.
For all variants of gpu fine, the reduction of register usage through the specification of launch bounds allows the level of parallelism to increase from 8 blocks × 64 threads/block per SM (denoted 8×64) to 8×96 on GTX480, and from 8×32 to 8×64 on GTX275. As a result, the performance of the unoptimized version is improved by 12% and 54% (on average) respectively. The lower gain on GTX480 can be attributed to 16 registers spilled to local memory. Nonetheless, it is still surprising to see such a gain under this much spilling. This shows the benefit of the L1 cache. It is actually more beneficial to increase the number of thread blocks per SM instead of the thread block size, because a larger thread block implies higher penalty when executing the sequential portion of a VP task. However, hardware constraints limit us to 8 thread blocks per SM. This optimization is included in the performance numbers shown in Section 5.2.
The above launch configuration for gpu fine on GTX480 was obtained by an empirical search through candidate configurations, shown in Figure 9 . The legend format is:
<# tblks per sm> x <tblk size> (<# regs> + <# spilled regs>)
We observe that the number of registers spilled (up to 16) incurs almost no performance penalty, as the three 8×96 configurations perform roughly the same. In contrast, the performance is mainly affected by the level of parallelism; it gets better as more thread blocks of size 128 are launched despite an increasing number of spilled registers. Figure 9 . Launch configuration tuning of gpu fine on GTX480. Figure 10 compares the performance of gpu coarse (with only Opt. 1 applied), gpu fine (with all optimizations applied) and gpu fine smem (with the filter applied), using gpu coarse on GTX480 as a reference. Figure 10 . Performance comparison of three GPU acceleration strategies.
Comparative kernel performance
gpu fine is faster than gpu coarse only for inputs 4b, c, f on GTX275. This is attributed to the fact that memory uncoalescing and thread divergence in gpu coarse are not severe for the inputs we use, and the efficiency loss in gpu fine due to sequential portion(s) of the VP processing algorithm can easily outweigh the slight reduction in divergent behavior. However, it must be emphasized that this trade-off is highly dependent on the input terrain. gpu fine smem performs no better gpu fine, despite the help of a filter that replaces sharedmemory variants of gpu fine with the original gpu fine, as was described in Section 5.3. Figure 11 . Execution time breakdown of gpu coarse. Figure 11 shows a percentage breakdown of the total execution time of gpu coarse with only Opt. 1 applied on GTX480. After acceleration, the kernel component takes less than half of the total time for all inputs and sometimes even below 10%. Host-device data transfer and RP extraction consume most of the remaining time. Thus, we apply the optimization that packs the RP list into a contiguous buffer, and parallelize the RP extraction phase on the Core i7, as described in Section 4. Figure 12 shows the impact of the two optimizations on the overall speedup of gpu coarse (with Opt. 1), when the kernel runs on GTX480. 
Overall performance
Multicore acceleration
We also accelerate the VP computations on a multicore. We start from the sequential code and only parallelize the outer loop, mainly because the iteration count of the outer loop is sufficiently high and to ensure coarse-grain paralellism. We use Pthreads [12] and distribute the iterations of the outer loop across the threads in a round-robin fashion to achieve good load balance with little run-time overhead. We use the 4-core Intel i7-960 system that hosts the GPUs and run only one thread on each core. The speedup results for the multicore platform are shown in Figure 13 . We vary the number of threads from 1 to 4. The figure shows that for all inputs, with the exception of the input 2, almost ideal speedups are observed. For the input 2, kernel execution time is very short ( Figure 13 . Speedup achieved on Intel Core i7-960 platform.
Related Work
Song et al. [17] speedup the execution of the (physical) Longley-Rice model on GPUs [10] . They also examine only direct paths between the transmitter and receivers and also use a multiple knife-edge method to calculate diffraction loss. However, they use the Epstein-Peterson method [8] , which is faster but less accurate than the Deygout method we use [6] . To our best knowledge, our work is the first to present a parallel implementation of Deygout's method. They experiment only with using a single thread to calculate the path loss of one output point. In contrast, we investigate fine-grain parallelism and use multiple threads to calculate knife-edges of a single point, which is beneficial for some inputs. Further, we achieve higher overall application performance. Ray-tracing [11] and finite-difference time-domain methods [22] give more accurate radio wave propagation prediction, but they are very computationally intensive, and until recently, have been limited to small-scale areas. In recent years, these techniques have been mapped to GPUs, using CUDA [2, 20] and traditional graphics API [4] .
Conclusion and Future Work
We explore GPU acceleration of a vertical plane algorithm for radio wave propagation. The algorithm has abundant parallelism, but it exhibits uncoalesced memory accesses, divergent thread execution and contention for GPU register and memory resources; factors that limit performance. We describe three acceleration strategies and optimizations to improve their performance. The benefits and overheads of the strategies and their optimizations depend on the input data and the target GPU, leaving it unclear which strategy and which set of optimizations result in better performance. Thus, we experimentally evaluate the performance of the three strategies in an application that implements the algorithm on GTX275 and GTX480 GPUs. We show that it is possible to obtain kernel speedups of up to 76× and overall application speedups of up to 48× over a highly-optimized CPU implementation. However, the best acceleration strategy depends on the input and the target GPU.
There are several directions for future work. An autotuning framework can be explored to determine the best strategy for given input data and GPU. It is also possible to further reduce memory uncoalescing and divergent execution in some parts of the algorithm that we did not explore. Finally, it is interesting to examine the acceleration of other algorithms for radio wave propagation prediction, particularly those based on ray tracing.
