Performance Impact of Data Layout on the GPU-accelerated IDW
  Interpolation by Mei, Gang & Tian, Hong
ar
X
iv
:1
40
2.
49
86
v1
  [
cs
.D
C]
  2
0 F
eb
 20
14
Noname manuscript No.
(will be inserted by the editor)
Performance Impact of Data Layout on the GPU-accelerated
IDW Interpolation
Gang Mei · Hong Tian
Received: date / Accepted: date
Abstract This paper focuses on evaluating the performance
impact of different data layouts on the GPU-accelerated IDW
interpolation. First, we redesign and improve our previous
GPU implementation that was performed by exploiting the
feature CUDA Dynamic Parallel (CDP). And then, we im-
plement three versions of GPU implementations, i.e., the
naı¨ve version, the tiled version, and the improved CDP ver-
sion, based on five layouts including the Structure of Arrays
(SoA), the Array of Sturcutes (AoS), the Array of aligned
Sturcutes (AoaS), the Structure of Arrays of aligned Struc-
tures (SoAoS), and the Hybrid layout. Experimental results
show that: the layouts AoS and AoaS achieve better perfor-
mance than the layout SoA for both the naı¨ve version and
tiled version, while the layout SoA is the best choice for the
improved CDP version. We also observe that: for the two
combined data layouts (the SoAoS and the Hybrid), there
are no notable performance gains when compared to other
three basic layouts. We recommend that: in practical appli-
cations, the layout AoaS is the best choice since the tiled
version is the fastest one among the three versions of GPU
implementations, especially on single precision.
Keywords GPU · Data Layout · IDW Interpolation ·
CUDA Dynamic Parallelism
G. Mei
Institute of Earth and Environmental Science, University of Freiburg,
Albertstr.23B, D-79104, Freiburg im Breisgau, Germany
Tel.: +49 761 203 6477
Fax: +49 761 203 6496
E-mail: gangmeiphd@gmail.com
H. Tian
Faculty of Engineering, China University of Geosciences (Wuhan),
No.388 Lumo Road, 430074, Wuhan, China
E-mail: htian2011@hotmail.com
1 Introduction
Data layout is the form in which data should be organized
and accessed in memory when operating on multi-valued
data such as sets of 3D points. The selecting of appropriate
data layout is a crucial issue in the development of GPU-
accelerated applications. The efficiency performance of the
same GPU application may drastically differs due to the use
of different types of data layout; see the example of sorting
structures demonstrated with Thrust [3].
Typically, there are two major choices of the data layout:
the Array-of-Structures (AoS) and the Structure-of-Arrays
(SoA) [4]; see Figure 1. Organizing data in AoS layout leads
to coalescing issues as the data are interleaved. In contrast,
the organizing of data according to the SoA layout can gen-
erally make full use of the memory bandwidth due to no
data interleaving. In addition, global memory accesses based
upon the SoA layout are always coalesced. The above two
layouts are probably the most basic and simplest memory
access patterns. More complex data layouts such as AoSoA
[1] and SoAoS [18] can be formed by combining the basic
layouts AoS and SoA.
As noted above, the memory access patterns are critical
for the performance of GPU-accelerated applications. How-
ever, it is not always obvious which data layout will achieve
better performance for a specific application. For example,
in order to evaluate the performance of the SoA and AoS
layouts, Govender, et al. [6] ran a simulation of 2 million
particles using their discrete element simulation framework,
and found that AoS is three times slower than SoA, while
an opposite argument was presented in [5]. In the library
framework OP2, Giles, et al. [5] preferred to use the AoS
layout to store mesh data for better memory accesses per-
formance. In practice, a common solution is to implement a
specific application using above two layouts separately and
then compare the performance.
2 Gang Mei, Hong Tian
The Inverse Distance Weighting (IDW) interpolation al-
gorithm, which was originally proposed by Shepard [17], is
one of the most commonly used spatial interpolation meth-
ods in Geosciences. Typically, the implementation of spatial
interpolation within the conventional sequential program-
ming patterns is computationally expensive for a large num-
ber of data sets. In order to improve the computational ef-
ficiency, some efforts have been carried out to develop effi-
cient implementations of the IDW interpolation in various
massively parallel computing environments on multi-core
CPUs [2,7,11] and/or GPUs platforms [8,10,12,22].
In our previous work [13], we presented two GPU im-
plementations of the standard IDW interpolation algorithm,
the tiled version that took advantage of shared memory and
the CDP version that was implemented by exploiting CUDA
Dynamic Parallelism (CDP). We found that the tilted version
achieved the highest speedups over the CPU version. How-
ever, the CDP version is 4.8x ∼ 6.0x slower than the naı¨ve
GPU version. Those experimental tests were performed only
on single precision.
In this paper, we focus on evaluating the performance
impact of different data layouts when implementing the IDW
interpolation on the GPU. We first redesign the CDP version
to avoid the use of the atomic operation atomicAdd(),
and then test three GPU implementations, i.e., the naı¨ve GPU
version presented in [12], the tiled version described in [13],
and the improved CDP version introduced in this paper, on
both single precision and double precision. In our previous
work [13], the above three GPU implementations are devel-
oped according to the data layout SoA. In order to evaluate
the impact of other data layouts such as AoS, we also im-
plement these GPU versions based upon the AoS layout and
other combined data layouts such as SoAoS [18], and then
test their performance on single and double precision.
In summary, we make the following contributions in this
paper:
(1) Redesign the CDP version that is originally presented in
[13] to improve its efficiency.
(2) Implement several groups of those three GPU versions
based upon several different data layouts on both single
and double precision.
(3) Evaluate the performance of sets of GPU implementa-
tions that are developed according to several different
layouts on both single and double precision.
This paper is organized as follows. Section 2 gives a
brief introduction to the IDW interpolation and two basic
data layouts, the SoA and AoS. Section 3 concentrates on
the GPU implementations that are performed by using five
different data layouts. Section 4 presents some experimen-
tal tests that are performed on both single and double preci-
sion, and discusses the experimental results. Finally, Section
5 draws some conclusions.
2 Background
2.1 IDW Interpolation
The IDW algorithm is one of the most commonly used spa-
tial interpolation methods in Geosciences, which calculates
the interpolated values of unknown points (prediction points)
by weighting average of the values of known points (data
points). The name given to this type of methods was moti-
vated by the weighted average applied since it resorts to the
inverse of the distance to each known point when calculating
the weights. The difference between different forms of IDW
interpolation is that they calculate the weights variously.
A general form of predicting an interpolated value Z at
a given point x based on samples Zi = Z(xi) for i = 1, 2,
. . . , n using IDW is an interpolating function:
Z(x) =
n∑
i=1
ωi(x)zi
n∑
j=1
ωj(x)
, ωi(x) =
1
d(x, xi)p
(1)
The above equation is a simple IDW weighting function,
as defined by Shepard [17], where x denotes a predication
location, xi is a data point, d is the distance from the known
data point xi to the unknown prediction point x, n is the
total number of data points used in interpolating, and p is an
arbitrary positive real number called the power parameter
(typically, p = 2).
2.2 Data Layout
In GPU computing, an optimal pattern of accessing data can
significantly improve the overall efficiency performance by
minimizing the number of memory transactions on the off-
chip global memory. Thus, one of the key design issues for
generating efficient GPU code is the selecting of proper data
layouts when operating on multi-valued data such as sets
of points or pixels. In general, there are two major choices
of the data layout: the Array-of-Structures (AoS) and the
Structure-of-Arrays (SoA); see Figure 1.
struct Pt {
float x;
float y;
float z;
};
struct Pt myPts[N];
(a) AoS
struct Pt {
float x[N];
float y[N];
float z[N];
};
struct Pt myPts;
(b) SoA
Fig. 1 Data layouts: Array-of-Structures (AoS) and Structure-of-
Arrays (SoA)
Organizing data in AoS layout leads to coalescing issues
as the data are interleaved. Multi-dimensional and multi-
valued data containers lead to strided memory accesses in
Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation 3
the one dimensional address space, and cause exactly this
problem. For example, performing an operation on a set of
3D points illustrated in Figure 1(a) that only requires the
variable x will result in about a 66 percent loss of bandwidth
and waste of L2 cache memory.
In contrast, the organizing of data according to the SoA
layout can typically make full use of the memory bandwidth
since there is no data interleaving; see Figure 1(b). Further-
more, global memory accesses are always coalesced when
using this type of data layout; and usually higher global
memory performance can be achieved.
The SoA data layout is beneficial in many cases. Far-
ber [4] suggested that from a GPU performance perspective,
it is preferable to use the SoA layout. This argument was
demonstrated by the example of sorting SoA and AoS struc-
tures with Thrust; it was reported that a 5-times speedup can
be achieved by using a SoA data structure over a AoS data
structure [3].
Similarly, in order to gauge the effective performance of
the two representations, i.e., the SoA and AoS layouts, on
the GPU, Govender, et al. [6] ran a simulation of 2 million
particles using their discrete element simulation framework
BLAZE-DEM, and found that AoS is three times slower
than SoA.
However, an opposite argument was presented in [5]. In
the library framework for the solution of unstructured mesh
applications OP2, Giles, et al. [5] and Mudalige, et al. [15]
preferred to use the AoS layout to store mesh data for better
memory accesses performance.
The above mentioned applications indicate that memory
access patterns (e.g., AoS and SoA) are critical for perfor-
mance, especially on parallel architectures such as GPUs.
However, it is not always obvious which data layout will
achieve better performance in a particular application. The
selection of a proper data layout for a specific application
depends on its underlying algorithm. In general, the usual
language syntax and standard container types lead naturally
to the AoS layout while SIMD units much prefer the SoA
format [20].
In order to improve the efficiency of accessing memories
on the GPU, many studies have been performed to transform
different types of layouts to others, e.g., from AoS to SoA, or
vice versa [14,19,20,21]. Furthermore, the major choices of
AoS and SoA can be further refined to form hybrid formats,
e.g., arrays of structures of arrays [1] or structures of arrays
of structures [18].
In this work, we will evaluate the performance impact
of the above two basic data layouts and other layouts that
are derived from the above two layouts. A group of GPU
implementations of those three versions will be developed
particularly by using one type of data layout, and then com-
pared to other groups of implementations.
3 GPU Implementations
3.1 The SoA Group of Implementations
In our previous work [13], we have introduced three GPU
implementations of the standard IDW interpolations, i.e., the
naı¨ve version, the tiled version, and the CDP version. These
GPU implementations are completely developed according
to the data layout SoA. In [13], we also tested these three
versions using several sets of data on single precision.
In this section, we first describe an improved version
of the CDP implementation. The CDP version presented in
[13], which is referred to as the original CDP version in this
section, has two levels of nested parallelism: (1) level 1: for
all prediction points, the interpolated values can be calcu-
lated in parallel; (2) lever 2: for each prediction point, the
distances to all data points can be calculated in parallel. The
parent kernel is responsible for performing the first level of
parallelism, while the child kernel takes responsibility for
realizing the second level of parallelism.
In the original CDP version, each child kernel is respon-
sible for calculating the distances from all data points to a
predication point. More specifically, first each thread within
a child grid is invoked for calculating: (1) the distance from
one data point to a predication point, (2) the correspond-
ing weight, and (3) the weighted value (see Equation (1));
and then the weights and weighted values calculated within
the same thread block will be locally accumulated using the
parallel reduction [9]; finally all weights and weighted val-
ues that have been obtained within different blocks will be
accumulated using the atomic operation atomicAdd().
The atomic operations such as atomicAdd() cannot
be performed on double precision. In order to enable the
CDP version to be executed on double precision, we re-
design and improve this GPU implementation to avoid the
use of atomicAdd(). The basic idea behind this improve-
ment is as follows.
In the improved CDP version, we no longer allocate n
threads within a child grid (where n is the number of data
points), but only allocate one thread block with 1024 threads.
Within this single thread block, each thread is responsible
for calculating the distances of several data points rather
than only one data point to a predication point. For exam-
ple, assuming there are 3000 data points, for each predica-
tion point, it is needed to calculate all the distances from
the predication point to those 3000 data points. Each thread
will take responsibilities for calculating three (i.e., (3000 +
1024 − 1)/1024) distances. These three distances and cor-
responding weights will be locally accumulated within each
thread; and when all threads within the only one block fin-
ish calculating all distances, the accumulation of all weights
and weighted values will be achieved by performing a par-
allel reduction [9] within the thread block. Thus, in this sit-
4 Gang Mei, Hong Tian
uation, the operation atomicAdd() is not needed for ac-
cumulating all weights and weighted values that have been
calculated within different blocks of threads.
We test the performance of the improved CDP version
using five sets of data. In each set of test data, the numbers
of data points and predication points are to be identical. We
create five groups of sizes, i.e., 10K, 50K, 100K, 500K, and
1000K (1K=1024). And five tests are performed by setting
the numbers of both the data points and prediction points as
the above listed five groups of sizes.
The performance of the original and the improved CDP
versions is illustrated in Figure 2. These experimental tests
show that the improved CDP version achieves the speedups
of 2.9x and 1.5x over the original CDP version when the
power parameter p is set to 2 and 3.0, respectively. Notice-
ably, for the original CDP version, the performance in the
two cases where the power parameter p is set as 2 and 3.0 is
almost the same; thus, in the Figure 2(a), the two lines rep-
resenting the execution time of the old version are almost
overlapped.
In this paper, the naı¨ve version presented in [12], the
tiled version developed in [13], and the improved CDP ver-
sion described above are accepted to be implemented ac-
cording to different data layouts for benchmark tests on both
single precision and double precision.
0.01
0.1
1
10
100
1000
10000
10K 50K 100K 500K 1000K
T
im
e
 (
/s
)
Data size (1K = 1024)
Old CDP 
(p = 2)
New CDP 
(p = 2)
Old CDP 
(p = 3.0)
New CDP 
(p = 3.0)
(a) Execution time of the new and old versions
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
p = 2
p = 3.0
(b) Speedups of the new version over the old version
Fig. 2 Performance comparison of the original (old) and the improved
(new) CDP versions
3.2 The AoS Group of Implementations
The implementations of the three GPU versions according
to the layout AoS is quite straightforward, which can be re-
alized by simply modifying the SoA group of the three GPU
implementations. First, two arrays of structures that are used
to store the coordinates of all data points and predications
points are allocated; and then the references to the points’
coordinates in the SoA version of the GPU implementations
are replaced by using the arrays that are represented in the
AoS format.
Note that, in this group of implementations, the structure
for representing 3D points are misaligned; in other words,
the structure is not forced to be aligned using the specifier
align (); see Figure 1(a).
3.3 The AoaS Group of Implementations
In the AoS group of GPU implementations described above,
the data structure for representing 3D points is not forced to
be aligned. Operations using the misaligned structure may
requires much more memory transactions when accessing
global memory, and thus decreases the overall efficiency
performance [16].
In order to benefit from the aligned memory accesses,
we simply add the specifier align into the data struc-
tures; see Figure 3. Noticeably, on single precision, a hidden
32 bit (i.e., 128 − 3 ∗ 32 = 32) padding element is im-
plicitly inserted into the structure Pt to meet the 128 bit
size requirement for alignment; while on double precision,
the hidden padding element is 64 bit, and the access to this
structure needs two 128 bit read or write instructions (i.e.,
64 ∗ 3 + 64 = 2 ∗ 128).
struct
__align__(16) Pt
{
float x, y, z;
/* plus hidden 32bit
padding element */
};
struct Pt myPts[N];
(a) Single precision
struct
__align__(16) Pt
{
double x, y, z;
/* plus hidden 64bit
padding element */
};
struct Pt myPts[N];
(b) Double precision
Fig. 3 Data layout: Array of aligned Structures (AoaS)
Another notable issue in exploring the AoaS layout is
the use of build-in data types. CUDA has provided various
build-in data types; see Figure 4 for three examples. The size
requirement for alignment is automatically fulfilled for some
built-in data types like float2, float4, or double2.
We also use the build-in types float4 and double4
to develop a build-in group of GPU implementations. This
Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation 5
build-in group of GPU implementations is quite easily im-
plemented by replacing the structure Pt with float4 or
double4. The only difference between the AoaS format
data types illustrated in Figure 3 and those build-in types
shown in Figure 4 is that: in the AoaS format data types, a
hidden padding element is implicitly added, while the com-
ponent w is explicitly defined in float4 or double4 to
be used as a padding element.
We test the build-in group of GPU implementation and
compare the performance with that of the AoaS group. We
find that there are no remarkable performance gains. Hence,
we do not adopt these build-in data types to form combined
data types, but choose the user-defined data types to create
hybrid types; see Figure 5 and Figure 6.
struct __device_builtin__ 
__builtin_align__(16) 
float4 {
float x, y, z, w;
};
struct __device_builtin__ 
__builtin_align__(16) 
double2 {
double x, y;
};
struct __device_builtin__ 
__builtin_align__(16) 
double4 {
double x, y, z, w;
};
Fig. 4 Several build-in data types in CUDA. (The data type double4
is aligned into two 16 bytes words)
3.4 The SoAoS Group of Implementations
When operating on structures residing in global memory,
typically there are two major optimization strategies [18]:
• Accessing consecutive elements to guarantee for coa-
lesced reads.
• Alignment of data structures to allow for fewer reads.
The above two strategies can generally achieve perfor-
mance improvements in the use of global memory. In order
to benefit from both methods, a combined data layout, Struc-
ture of Arrays of aligned Structures (SoAoS), is proposed in
[18]. By organizing aligned structures that don’t exceed the
alignment boundary in multiple arrays, it is able to reduce
the overall number of issued reads by using 64 or 128 bit
memory accesses while guaranteeing that all the memory
accesses of the single threads within the same warp (or half-
warp on some devices) are coalesced; see Figure 5. Note
that the component p in the structure Ptb is just an explicit
padding element that will never be used in calculating.
The data structures illustrated in Figure 5 are particularly
designed for double precision. In this paper, we only imple-
ment the three GPU implementations of the IDW interpola-
tion on double precision according to the SoAoS layout and
related data structures.
struct __align__(16) Pta
{
double x, y;
};
struct __align__(16) Ptb
{
double z, p;
};
struct Pt
{   Pta xy[N];
Ptb zp[N];
};
struct Pt myPts;
Fig. 5 Data layout: Structure of Arrays of aligned Structures (SoAoS)
3.5 The Hybrid Group of Implementations
The SoAoS layout described above is a combination of the
layouts SoA and AoS. In this paper, specifically for the IDW
interpolation, we also introduce another combined data lay-
out which is a combination of the AoS and the Array of Val-
ues (AoV); see Figure 6. A major difference between this
hybrid layout and the SoAoS layout is the use of an AoV
format array (i.e., double z[N] in Figure 6) to replace
an AoS format array (i.e, Ptb zp[N] in Figure 5). Another
difference is that in this hybrid layout there is no explicit or
implicit padding element.
Similar to the layout SoAoS, the hybrid layout is only
applicable on double precision. Thus, we also implement
the three GPU implementations only on double precision
according to this hybrid layout and related data structures
illustrated in Figure 6.
struct __align__(16) Pta
{
double x, y;
};
struct Pt
{   Pta xy[N];
double z[N];
};
struct Pt myPts;
Fig. 6 The hybrid data layout by combining AoS and AoV
6 Gang Mei, Hong Tian
4 Results and Discussion
4.1 Results
The GPU implementations are evaluated using the NVIDIA
GeForce GT640 (GDDR5) graphics card and the CUDA 5.5.
Note that the GeForce GT640 card with memory GDDR5
has the Compute Capability 3.5, while it only has Compute
Capability 2.1 with the memory DDR3. For each set of the
testing data, we carry out all GPU implementations on both
single precision and double precision.
For the CPU implementations, we directly adopt our pre-
vious results that were performed on single precision. These
results have been presented in [13]; and in this paper, they
are directly accepted to be used as the baseline. The effi-
ciency performance of all GPU implementations is bench-
marked by comparing to the baseline results.
As described in [13], for each GPU implementation, we
tested two different forms that have different values of the
power parameter p. In the first form, the power p, see Equa-
tion 1, is set to an integer value 2, while this value is set to
3.0 in the second form. In this paper, we only consider the
first form (i.e., p = 2).
The input of the IDW interpolation is the coordinates of
data points and prediction points. The performance of the
CPU and GPU implementations may differ due to different
sizes of input data [8,10]. However, the motivation of this
work is focused on evaluating the performance impact of
different data layouts; thus, we only consider a special situ-
ation where the numbers of prediction points and data points
are identical.
We create five groups of sizes, i.e., 10K, 50K, 100K,
500K, and 1000K (1K=1024). And five tests are performed
by setting the numbers of both the data points and prediction
points as the above listed five groups of sizes.
4.1.1 Single Precision
On single precision, we implement those three GPU imple-
mentations of the IDW interpolation using three types of
data layouts, including the SoA, the AoS, and the AoaS. The
benchmark results (i.e., speedups generated by comparing to
the baseline CPU results) of the naı¨ve version, the tiled ver-
sion, and the CDP version are shown in Figure 7.
According to the results generated in above three experi-
mental tests, we have found that: for both the naı¨ve and tiled
implementations, the layout AoaS achieves the best perfor-
mance and the layout SoA obtains the worst results; see Fig-
ures 7(a) and 7(b). However, for the CDP version, the layout
SoA gets the best performance, and the second best is the
layout AoaS, while the AoS layout leads the worst results;
see Figure 7(c).
0
20
40
60
80
100
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the Naïve version
SoA
AoS
AoaS
(a)
100
105
110
115
120
125
130
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the Tiled version
SoA
AoS
AoaS
(b)
0
5
10
15
20
25
30
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the CDP version
SoA
AoS
AoaS
(c)
Fig. 7 Performance of GPU implementations on single precision
4.1.2 Double Precision
On double precision, we implement two groups of those
three GPU implementations using additional two types of
combined data layouts, the SoAoS and the Hybrid; we also
implement the GPU implementations using the layouts SoA,
AoS, and AoaS. The experimental results in this case are
presented in Figure 8.
For the naı¨ve version, the speedups generated by the
GPU implementation according to the layout SoA are the
lowest, while the other four layouts achieve almost the same
performance although the speedups are slightly varied; see
Figure 8(a).
For the tiled version, all of the five different data layouts
obtain nearly the same performance. There are only several
slight differences among those speedups ; see Figure 8(b).
Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation 7
For the CDP version, the layout AoS leads the worst per-
formance; and the second worst results are generated by the
layout AoaS. The other three layouts including the SoA, the
SoAoS, and the Hybrid obtain almost the same performance;
see Figure 8(c).
11.0
11.2
11.4
11.6
11.8
12.0
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the Naïve version
SoA
AoS
AoaS
SoAoS
Hybrid
(a)
13.0
13.1
13.2
13.3
13.4
13.5
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the Tiled version
SoA
AoS
AoaS
SoAoS
Hybrid
(b)
0
2
4
6
8
10
12
10K 50K 100K 500K 1000K
S
p
e
e
d
u
p
Data size (1K = 1024)
Speedups by the CDP version
SoA
AoS
AoaS
SoAoS
Hybrid
(c)
Fig. 8 Performance of GPU implementations on double precision
4.2 Discussion
Recently, the GPU-computing programming models such as
CUDA are popularly used to speed up various scientific ap-
plications. However, fully utilizing the specific features of
the underlying GPU architecture is still a challenging work.
One of the most important responsibilities of a programmer
is to maximize the efficiency performance by optimizing the
memory hierarchy in GPU-computing.
The data layout in memory is a critical issue in develop-
ing efficient GPU code. Several efforts have been carried out
to analyze [5,18,20] or transform [14,19,21] different types
of data layouts.
Based upon our previous work [13], in this paper we fo-
cus on evaluating the performance impact of different data
layouts on the IDW interpolation. First, we develop several
sets of the GPU implementations of the standard IDW inter-
polation according to a set of data layouts, and then test their
efficiency performance on both single precision and double
precision.
On single precision, we implement three groups of the
GPU implementations using three types of data layouts, in-
cluding the SoA, the AoS, and the AoaS. We find that: for
the AoS and AoaS layouts, the second one always obtains
better performance than the first. This positive impact on
performance is due to minimizing the number of memory
transactions by aligning the data structures.
We also observe that: for all the three versions of GPU
implementations, the performance impact due to the use of
the AoaS over the AoS for both the naı¨ve version and the
CDP version is much more significant than that for the tiled
version; see Figure 7.
The above performance result is perhaps because of the
effective optimization in the use of global memory by mini-
mizing the number of memory transactions. In both the naı¨ve
version and the CDP version, each thread needs to read the
coordinates of all data points; in other words, the coordi-
nates of all data points are needed to be read n times, where
n is the number of predication points. In contrast, the co-
ordinates of all data points are only needed to be read (n
/ threadsPerBlock) times due to accepting the optimization
strategy “tiling”. Thus, there are much more global mem-
ory accesses in both the naı¨ve and the CDP versions than
that in the tiled version. And the impact of optimizing the
use of global memory by minimizing the number of trans-
actions on a larger number of global memory accesses is
obviously more significant than that on a smaller number of
global memory accesses.
The above two layouts (AoS and AoaS) achieve higher
speedups than the layout SoA for both the naı¨ve and tiled
implementations, but get lower speedups than the SoA for
the CDP implementation. We cannot explain this strange be-
havior. Perhaps this behavior is due to the nested parallelism
when programming with CUDA dynamic parallelism.
Another notable issue in exploring the AoaS layout is
the use of build-in data types. CUDA has provided various
build-in data types. The size requirement for alignment is
automatically fulfilled. Compared to those user-defined data
types in AoaS formant (see Figure 3), we have found the
counterparts of the build-in data types provided by CUDA
8 Gang Mei, Hong Tian
do not achieve notable advantages. In addition, the user-
defined AoaS data types are suggested to be used for the
convenience in programming.
On double precision, we also observe some performance
results that are as the same as those on single precision:
1. For the naı¨ve version, both the layouts AoS and AoaS
are better than the SoA.
As explained above, this positive result is because of
aligning the data structures to allow for fewer reads or
writes.
2. For the tiled version, those three layouts, SoA, AoS, and
AoaS achieve almost the same performance.
This result is due to the fact that the accesses to global
memory have been optimized using the strategy “tiling”
and the impact of different data layouts on accessing
global memory is not significant.
3. For the CDP version, the layout SoA still obtains best
results when compared to the layouts AoS and AoaS.
We cannot give reasonable explanations for this strange
behavior. We guess that the coalesced access to global
memory in nested parallelism (CUDA dynamic paral-
lelism) has a very positive performance impact.
Furthermore, we find several additional results on dou-
ble precision.
1. For the naı¨ve version, all the data layouts except the SoA
achieve nearly the same speedups. Noticeably, among
these four layouts, i.e., the AoS, the AoaS, the SoAoS,
and the Hybrid, the best one is the AoS, in which the
alignment is not used.
This result is perhaps due to two reasons: the first is that
the aligning for data structures on double precision is not
as effective as that on single precision (see Figure 7(a)
and Figure 8(a)); the second potential cause is that there
is probably a performance penalty when aligning data
structures on double precision. However, the advantage
of the AoS layout is not obvious.
2. For all the three versions, the performance differences
between the SoAoS layout and the Hybrid layout are
quite small. This illustrates that the use of the AoS or
the AoV in a combined layout on double precision does
not lead to heavy impact on performance.
Considering the overall performance on single and dou-
ble precision, we recommend that: for both the naı¨ve ver-
sion and the tiled version, the best choice is the data layout
AoaS, while the layout SoA is the best one for the CDP ver-
sion. From the perspective of GPU performance in practical
applications, the layout AoaS is suggested to be the only op-
tion since that the tiled version is the fastest one among the
three versions of GPU implementations.
In this paper, all the experimental tests are performed
and evaluated on a single GPU. In some related work [7,11],
efficient implementations of the IDW interpolation were de-
veloped on the platforms of multiple GPUs or on clusters.
When intending to benefit from multi-GPUs or clusters, it
is needed to carefully analyze and select the optimal data
layout. Future work should therefore include the implemen-
tation of the IDW interpolation and the performance eval-
uation of different data layouts under the environment of
multi-GPUs or clusters.
5 Conclusion
We have redesigned and improved the CDP version of the
GPU implementations of the standard IDW interpolation al-
gorithm by exploiting the feature CUDA Dynamic Paral-
lelism. We have demonstrated that the improved CDP ver-
sion has the speedups of 2.9x and 1.5x over the original CDP
version when the power parameter p is set to 2 and 3.0, re-
spectively. In further, in order to evaluate the performance
impact of different data layouts, we have implemented the
naı¨ve version, the tiled version, and the improved CDP ver-
sion based upon three basic layouts (SoA, AoS, and AoaS)
and two combined layouts. We have observed that: (1) For
both the naı¨ve version and tiled version, the layouts AoS
and AoaS achieve better performance than the layout SoA;
(2) For the improved CDP version, the layout SoA is the best
choice among the three basic layouts; (3) For the two com-
bined data layouts, there are no notable performance gains
when compared to those three basic layouts. We recommend
that: in practical applications, the layout AoaS is the best
choice since the tiled version is the fastest one among the
three versions of GPU implementations, especially on sin-
gle precision.
Conflict of Interests The authors declare that there is no
conflict of interests regarding the publication of this article.
Acknowledgements The authors are grateful to the anonymous ref-
eree for helpful comments that improve this paper.
References
1. J. Abel, K. Balasubramanian, M. Bargeron, T. Craver, and
M. Phlipot, “Applications tuning for streaming simd extensions,”
Intel Technology Journal Q, vol. 2, pp. 1–12, 1999.
2. M. P. Armstrong and R. J. Marciano, “Massively parallel strate-
gies for local spatial interpolation,” Computers and Geosciences,
vol. 23, no. 8, pp. 859–867, 1997.
3. N. Bell and J. Hoberock, Thrust: Productivity-Oriented Library
for CUDA. Morgan Kaufmann, 2011, ch. 26, pp. 359–371.
4. R. Farber, CUDA Application Design and Development. Morgan
Kaufmann, 2011.
5. M. B. Giles, G. R. Mudalige, B. Spencer, C. Bertolli, and I. Reg-
uly, “Designing op2 for gpu architectures,” Journal of Parallel and
Distributed Computing, vol. 73, no. 11, pp. 1451–1460, 2013.
Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation 9
6. N. Govender, D. N. Wilke, S. Kok, and R. Els, “Development of
a convex polyhedral discrete element simulation framework for
nvidia kepler based gpus,” Journal of Computational and Applied
Mathematics, 2013.
7. X. Guan and H. Wu, “Leveraging the power of multi-core plat-
forms for large-scale geospatial data processing: Exemplified by
generating dem from massive lidar point clouds,” Computers and
Geosciences, vol. 36, no. 10, pp. 1276–1282, 2010.
8. F. Hanzer, “Spatial interpolation of scattered geoscientific data,”
2012.
9. M. Harris, “Optimizating parallel reduction in cuda,” 2007.
10. K. Hennebhl, M. Appel, and E. Pebesma, “Spatial interpolation in
massively parallel computing environments,” 2011.
11. F. Huang, D. Liu, X. Tan, J. Wang, Y. Chen, and B. He, “Explo-
rations of the implementation of a parallel idw interpolation algo-
rithm in a linux cluster-based parallel gis,” Computers and Geo-
sciences, vol. 37, no. 4, pp. 426–434, 2011.
12. L. Huraj, V. Sildi, and J. Sili, “Comparison of design and perfor-
mance of snow cover computing on gpus and multi-core proces-
sors,” WSEAS Transactions on Information Science and Applica-
tions, vol. 7, no. 10, pp. 1284–1294, 2010.
13. G. Mei, “Evaluating the power of gpu-acceleration for idw inter-
polation algorithm,” The Scientific World Journal, 2014.
14. P. Mistry, D. Schaa, B. Jang, D. Kaeli, A. Dvornik, and D. Meglan,
Data Structures and Transformations for Physically Based Sim-
ulation on a GPU, ser. Lecture Notes in Computer Science.
Springer Berlin Heidelberg, 2011, vol. 6449, ch. 17, pp. 162–171.
15. G. R. Mudalige, M. B. Giles, J. Thiyagalingam, I. Z. Reguly,
C. Bertolli, P. H. J. Kelly, and A. E. Trefethen, “Design and initial
performance of a high-level unstructured mesh framework on het-
erogeneous parallel systems,” Parallel Computing, vol. 39, no. 11,
pp. 669–692, 2013.
16. NVIDIA, “Cuda c programming guide v5.5,” 2013.
17. D. Shepard, “A two-dimensional interpolation function for
irregularly-spaced data,” pp. 517–524, 1968.
18. J. Siegel, J. Ributzka, and X. Li, “Cuda memory optimizations for
large data-structures in the gravit simulator,” pp. 174–181, 22-25
Sept. 2009 2009.
19. R. Strzodka, Abstraction for AoS and SoA layout in C++. Morgan
Kaufmann, 2011, pp. 429–441.
20. ——, “Data layout optimization for multi-valued containers in
opencl,” Journal of Parallel and Distributed Computing, vol. 72,
no. 9, pp. 1073–1082, 2012.
21. I.-J. Sung, G. D. Liu, and W.-M. W. Hwu, “Dl: A data layout trans-
formation system for heterogeneous computing,” pp. 1–11, 2012.
22. Y. Xia, L. Kuang, and X. Li, “Accelerating geospatial analysis on
gpus using cuda,” Journal of Zhejiang University SCIENCE C,
vol. 12, no. 12, pp. 990–999, 2011.
