Abstract-The Finite Element Method (FEM) is a numerical technique widely used in finding approximate solutions for many scientific and engineering problems. The Data Assembly (DA) stage in FEM can take up to 50% of the total FEM execution time. Accelerating DA with Graphics Processing Units (GPUs) presents challenges due to DA's mixed compute-intensive and memory-intensive workloads. This paper uses a representative finite element mini-application to explore DA acceleration on CPU+GPU platforms. Implementations based on different thread, kernel and task design approaches are developed and compared. Their performance and energy consumption are measured on four CPU+GPU and two CPU only platforms. The results show that (i) the performance and energy for different implementations on the same platform can vary significantly but the performance and energy trends are the same, and (ii) there exist performance and energy tradeoffs across some platforms if the best implementation is chosen for each of the platforms.
I. INTRODUCTION
This paper studies the acceleration of the data assembly stage (referred to as DA) in the Finite Element Method (FEM) on heterogeneous platforms containing both CPUs and GPUs. FEM is a numerical technique widely used in finding approximate solutions for a large number of important scientific and engineering problems, such as simulation of fluid dynamics and particle transport. FEM is roughly decomposed into two stages: (i) DA and (ii) solving a sparse linear system of equations (simply referred to as solver). Depending on the sizes of the particular problems, DA execution can take up to 50% of FEM's total execution time [10] .
FEM often needs to deal with problems at extremely large scales, and effective acceleration techniques for FEM are highly sought after in the scientific computing community. Due to their efficient single instruction multiple data (SIMD) architectures, GPUs are gaining popularity for accelerating FEM (e.g., [15] , [16] , [17] ). While much work has been done on GPU acceleration for various solvers (e.g., [11] , [12] , [13] ), only limited research has been conducted on GPU acceleration for DA. Accelerating DA is a challenging task because DA execution involves a mixture of compute-intensive and memory-intensive workloads. Komatitsch et al. [16] presented Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. a DA method for GPU based on element coloring. However, memory usage is not adequately considered. Cecka et al. [5] investigated GPU implementations of DA for unstructured grids. They focus on finding a proper match between the problem (i.e., mesh) size and the type of (i.e., global or shared) memory in order to achieve the best performance improvement. Markall et al. [14] studied several DA designs on both GPU and multi-core architectures, and compared existing algorithms of DA on CPU and GPU architectures.
As energy consumption has become a first-class design consideration, it is important to study the energy impact of performance-oriented design strategies. When mapping FEM applications to GPUs, application developers typically focus on improving the FEM's overall performance without paying attention to energy consumption. Though such performanceoriented GPU programming strategies tend to also reduce energy, some of them may actually lead to higher energy consumption as GPUs typically have high instantaneous power (up to 360W [1]) with very limited power management technology [19] .
Some existing work examined the energy impact of using GPUs as accelerators. For example, Rofouei et al. [7] explored the power-performance break-even point for using GPUs to accelerate applications, which could help choose appropriate platforms for both high performance and low energy cost. Huang et al. [8] used GEM, an application of computing biomolecular electrostatic potential, to evaluate the performance, energy consumption and energy efficiency of CPU+GPU heterogeneous platforms. Their results show that running applications on CPU+GPU platforms could reduce energy consumption compared to a homogeneous CPU platform. Recently, Yuki et al. [19] employed Gdev, an open-source runtime resource management system for GPUs, to examine the energy impact of GPUs and their causal relation with CPUs in the CPU+GPU heterogeneous platforms. This work reveals that the CPU is a weak factor for the total system energy reduction. However, these papers consider neither the situation where the CPU and GPU work simultaneously nor the energy impact of using different GPU design strategies.
To our best knowledge, there is no prior work on systematic study of accelerating DA on CPU+GPU heterogeneous platforms and their corresponding energy efficiency. This paper intends to fill this gap.
We explore different GPU programming and partitioning approaches to accelerating DA on CPU+GPU platforms, and study their performance and energy behavior. Since there are many FEM based applications and some can be extremely large in size, we use miniFE [2] , a proxy FEM application, as our target application. MiniFE solves the steady-state 3D heat poisson equation and can be used to predict the performance trend of Charon [20] , a code with over 700K lines used in simulating semiconductor devices and magnetohydrodynamics. Using miniFE instead of DA benchmarks allows us to investigate a number of different design strategies for trading off performance and energy in DA implementations.
In this paper, we propose and implement two versions of DA using GPUs and two versions of DA using both CPUs and GPUs. Through these implementations, we show how thread design (i.e., selecting the basic parallel execution unit within a kernel), kernel design (i.e., determining the data communication between kernels), and task design (i.e., partitioning work between CPU and GPU) impact performance and energy tradeoffs. We use direct performance and power measurement to gather the relevant data. To further understand the influence of hardware platforms, we have studied different combinations of a low-power CPU (Intel Atom 330), a high-performance CPU (Intel i7 2600K) and two GPU cards (NVIDIA GeForce GTX570 and GTX670, with different GPU architectures).
Our results indicate that different design approaches on different platforms have significantly different performance and energy behavior. Good thread design leads to higher speedups for both compute-and memory-intensive workloads, while they offer even greater energy benefits for memoryintensive workloads. Kernel designs can make use of good thread designs but they are also sensitive to data communication. Our best kernel design displays similar trends as that for thread designs. Task designs can benefit from both kernel and thread designs. Our task designs reveal that there exist performance and energy tradeoffs across some platforms if the best implementation is chosen for each of the platforms. These findings provide valuable guidelines to application developers working on FEM accelerations about what combination of design strategies and platforms offer the best performance and energy tradeoff.
The rest of the paper is organized as follows. Section II provides a brief background of the DA stage in miniFE and the GPU architectures with their programming model. The different GPU implementations are proposed in Section III. The energy measurement system and experiment configurations are discussed in Section IV. Section V discusses the performance and energy results for the DA implementations. Section VI concludes the paper.
II. BACKGROUND
We use the DA stage in miniFE from the Mantevo project [2] as the representative DA. As typical scientific and engineering applications are large and complex, the Mantevo project was initiated by Sandia National Labs for creating several mini-applications that possess essential performance characteristics of key scientific and engineering applications. Among these programs, miniFE replicates the unstructured grid finite-element generation, data-assembly and problem solving in a configurable 3D space, and has been shown that it is able to predict the performance trend of a real scientific application [20] .
The simulated physical domain in miniFE is a 3D space with a configurable size in each dimension. This domain is decomposed into hexahedron elements by the recursive coordinate bisection (RCB) technique which also maps the input object as well as the nodes of each element to a 3D coordinate system [2] . MiniFE has two main stages: (1) DA and (2) Conjugate-Gradient (CG) based solver. DA generates a discrete linear system of equations based on the input objects. This system of equations is then solved by the CG solver. The CG method is an iterative method for solving linear system of equations, and has been well studied on GPUs [11] .
The original serial DA implementation in miniFE [2] is used as the starting point of our study. The workflow of the DA implementation is illustrated in Figure 1 . DA outputs a large global matrix and a vector forming the discrete linear system of equations. The global matrix is sparse and symmetric, and is stored in the format of compressed sparse row (CSR) which uses row starting indices and column indices to index solely non-zero (NZ) data items of the global matrix. The original implementation of DA in miniFE has two separate functions: (i) stiffness matrix computation (STC) and (ii) stiffness matrix assembly (SMA). STC computes and generates a local stiffness matrix for each element generated by the RCB technique to define the geometric and material properties of that element. The size of the local matrix is 8 × 8 as each element in miniFE is a hexahedron with eight nodes. A data item in the local matrix describes the physical relationship between the two nodes represented by the row and column indices. STC is mainly composed of add and multiply operations resulting in performance which correlates closely to processor speed, as it is more compute-intensive. On the other hand, SMA, mainly composed of add and memory operations, assembles all the local matrices into the global matrix by accumulating data items in local matrices to the corresponding positions in the global matrix. Hence, SMA is more memory-intensive.
GPU architectures belong to the SIMD class of processors, and usually consist of 100s of streaming processor (SP) cores partitioned into 10s of streaming multiprocessors (MPs). Its memory hierarchy generally includes register files and (configurable) shared memory/cache local to each SM, and global off-chip memory shared by all SMs. In this work, we use NVIDIA's Fermi [3] and Kepler [4] as the target GPU architectures. The Kepler GPU architecture doubles the amount of registers and L2 cache in Fermi and introduces a read-only cache. Although the raw computing performance delivered by Kepler is not increased a lot compared to Fermi, the performance per watt could be 3X higher than Fermi [4] .
We use Compute Unified Device Architecture (CUDA) [6] as the programming model in this work. CUDA can be viewed as an extension of the C language facilitating the use of GPUs for computation. The extension includes languagelevel constructs for defining kernels, execution threads, thread blocks and grids. A CUDA kernel is a program function to be executed on the physical GPU. A thread block contains multiple threads executing the same kernel, and one thread is mapped to one SP core that can access its private register files and shared memory. The actual construction of threads and kernels for an application greatly impacts the available parallelism and memory bandwidth usage, and hence leads to drastically different performance/energy values.
III. GPU ACCELERATION OF DA
In this section, we present different thread, kernel and task designs of the DA stage in miniFE on GPU and CPU+GPU platforms. Figure 2 illustrates the main ideas of different design strategies. The different implementations are developed with the initial aim of improving DA's performance. The impact of each proposed strategy on energy is then analyzed based on measured data and will be discussed in Section V.
A. Thread Design for STC and SMA
A thread design offering high parallelism and memory friendly accessing patterns is desirable for improving both performance and energy efficiency of GPU implementations. The original STC and SMA functions of the legacy serial version in DA process each hexahedron element independently. As a baseline implementation, we have followed the element-wise approach in the serial version and simply assigned one thread to one element for the BaseSTC and BaseSMA kernels (Figure 2 (a)) running on GPU. To improve the baseline GPU implementations, we have devised three alternative thread designs for the STC and SMA functions, i.e., SmemSTC, RowSMA and GnzSMA (Figure 2(a) ). Below we elaborate on these designs.
Consider first the STC function. Based on the discussion in Section II, a thread in BaseSTC includes a loop with eight iterations to compute the eight 8 × 8 local matrix contributors and accumulate them to one 8 × 8 local matrix associated with its assigned element. Therefore to increase parallelism, we could break the original single BaseSTC thread into 8 threads, each thread handling one iteration of the loop (i.e., operating on one 8 × 8 local matrix) in the BaseSTC thread. However, executing these 8 threads in parallel causes data races as all these threads will eventually accumulate their matrix contributors to the same local matrix residing in the global memory. Though data races can be handled by atomic operations, these can degrade performance significantly [5] .
We have developed an alternative thread assignment, SmemSTC, which eliminates data races by employing shared memory to store necessary intermediate results. SmemSTC still uses eight threads for the eight-iteration loop but each thread handles the same row of the eight local 8 × 8 matrices (hence different threads handle different rows of the matrices). That is, each thread computes the same row of the 8 local matrix contributors and accumulates them before writing the results to the local matrix. SmemSTC removes the data races caused by assigning one thread to one iteration for the loop in BaseSTC and uses the shared memory for sharing intermediate data in one iteration, which leads to performance increase of about 35%.
Improving the performance of BaseSMA presents different challenges. Unlike the STC function which is dominated by computation, the SMA function is memory-bound. Each thread in BaseSMA reads an 8 × 8 local matrix and the associated coordinates from the global memory, and then accumulates the 64 data items to the global matrix in the global memory. The performance bottleneck for the SMA's GPU implementation lies in the exploitation of memory bandwidth.
To improve the performance of BaseSMA, we reorganize the memory accesses required in the SMA function. Specifically, one thread is responsible for accumulating the 8 data items in one row of the 8 × 8 local matrix to the global matrix. A similar idea has been proposed in [5] . We refer to this design as RowSMA. Since the 8 threads of each local matrix synchronously access 8 data items in a row of the local matrix, these accesses have close global memory addresses and may be coalesced into a single memory request [6] . Though RowSMA improves memory coalescing, it may experience data races because different local matrices may have their data items in the same position in the global matrix and these data items may be accumulated by concurrent threads.
To avoid data races in RowSMA, we developed another thread design, referred to as GnzSMA. The key idea is to assemble the local matrices by the non-zero (NZ) data items in the global matrix instead of by the data items in local matrices. This approach is similar to the concept outlined in [5] . One design decision to make is how many NZ data items should be assigned to each thread. We choose to assign the NZ data items in one row of the global matrix to one thread. This is because that one row of the global matrix has 27 NZ data items at most and two neighboring rows have their NZ data items stored closely in the global memory, which increases the GPU performance through having more coalesced memory requests. Each thread retrieves data items of the local matrices belonging to its assigned NZ data items and accumulates them before writing to the global matrix. Consequently, there are no threads competing to update NZ data items in the global matrix.
B. Kernel Design
Kernel design determines the memory space where data communication takes place between data dependent GPU kernels for accomplishing a complete task. A similar idea has been studied in [21] . For DA acceleration, we consider two different kernel designs: (1) GmemDA and (2) SmemDA (Figure 2(b) ). GmemDA uses separate kernels for STC and SMA and uses the global memory for data communication between the two kernels. SmemDA, on the other hand, fuses the STC and SMA kernels into one single kernel through the use of shared memory for data communication.
The implementation of GmemDA can be straightforwardly obtained by combining any STC and SMA kernels discussed in the previous subsection. It is important to note that data communication between kernels can only be achieved through the global memory. For the STC and SMA kernels, a large set of local matrices must be communicated. Therefore, although GmemDA easily benefits from different thread design alternatives, data communication between the two kernels may become a performance bottleneck.
We propose an alternative kernel design, SmemDA, which fuses two separate STC and SMA kernels into a single kernel. Since the shared memory is active within a kernel, it can be used effectively for data communication between STC and SMA in the same kernel. However, care must be taken in choosing the proper thread design as it impacts the amount of data to be stored in the shared memory whose size is quite limited. In our case, we cannot use GnzSMA since GnzSMA accesses a wide range of memory locations which are often larger than the shared memory space allocated to a thread block. Therefore, SmemDA consists of SmemSTC for STC and RowSMA for SMA.
C. Task design
Thread design and kernel design both aim at improving the DA performance on GPUs. Since the GPU is often used as an accelerator in a CPU+GPU platform, to achieve the maximum performance speedup, collaboration between CPU and CPU is desirable and should be carefully orchestrated. We refer to coordinating the workload execution between CPU and GPU (i.e., task assignment and scheduling on CPU and GPU) as task design. We consider two general task design strategies: workload parallelization and code partitioning. In workload parallelization, the CPU and GPU each carry out the entire function execution but on different subsets of the input data. In code partitioning, the CPU and GPU execute different tasks and these tasks together accomplish the complete functionality. This idea has been used in [18] for mapping the JPEG2000 streaming decoder onto CPU+GPU heterogeneous platforms. Since the CPU and GPU tasks may need to communicate with each other during execution, data transfer between the host memory and the GPU global memory is necessary in code partitioning.
Applying workload parallelization to DA in miniFE is relatively straightforward. It basically requires (i) finding the fastest implementation of DA on the GPU as well as the CPU, and (ii) dividing the input data into two sets, one for CPU and one for GPU, such that the two processors complete the entire workload at the same time. We have constructed a version of workload parallelization implementation for DA and refer to it as ParaDA (Figure 2(c) ).
The CPU and GPU DA implementations in ParaDA are selected based on extensive experimental results. Specifically, ParaDA uses an OpenMP implementation of DA, OmpDA, with the maximum number of physical threads on the CPU side, and SmemDA on the GPU side. At the beginning execution of ParaDA, one part of the input data is uploaded to the GPU from the host memory. The GPU results are downloaded to the host memory when it finishes its execution. In our implementation, the data transfer is overlapped with CPU execution.
To divide the input data, recall that the input to DA is a 3D domain containing hexahedron elements with the number of Scheduling of STC and SMA on the GPU and CPU in PartitionDA. The blue and green rounded rectangles denote the STC kernel on the GPU and SMA functions on the CPU, respectively. The arrows represent data transfer operations. Once the STC kernel finishes computing a set of stiffness matrices, these matrices are transferred to the host memory. When this data transfer is done, a new set of input data are transferred to the GPU's global memory, and the STC kernel and SMA function starts concurrently. This operation repeats until all the sub-domains are processed. elements in each dimension being configurable. The workload division can be done by cutting the input 3D domain into two parts along any one (say horizontal) direction and distributing the two parts to the CPU and GPU processors. The boundary face must be redundantly loaded to both the CPU and GPU, since processing each element involves all its neighboring elements. In order to minimize the total execution time, we obtain an initial input data partitioning ratio based on the performance of OmpDA and SmemDA, and then tune the ratio so as to achieve balanced CPU and GPU execution.
In addition to using workload parallelization to divide the input data for balanced CPU and GPU workload distribution, we also employ code partitioning to exploit the individual strengths of the CPU and GPU for improving the DA performance. Effective code partitioning depends on not only the task assignment (i.e., matching the task characteristics with the processor) but also task scheduling (i.e., overlapping or pipelining CPU and GPU operations as much as possible). Our code partitioning approach for DA, PartitionDA (Figure 2(c) ), considers both of these aspects.
In PartitionDA, we use SmemSTC for STC on the GPU and OmpSMA for SMA on the CPU. This partition is based on the observation that (i) STC is a compute-intensive function for which GPU can offer high throughput, and (ii) SMA is memory-intensive and can be handled relatively well by CPU. To support pipelining, we partition the input 3D domain of DA into smaller sub-domains and make CPU and GPU iteratively execute their own tasks on these sub-domains. By making the CPU and GPU work on different sub-domains, their execution can be overlapped.
The pipelining scheme in PartitionDA is depicted in Figure 3 . Since large data sets lead to better data transfer performance than more small data sets with the same total size, we have determined that a good operating point is to let each STC kernel and SMA function process an input subdomain of size 25
3 . This size also keeps both the CPU and GPU busy for most of the execution duration.
IV. EXPERIMENTAL METHODOLOGY
To explore the landscape of performance/energy tradeoffs for DA, we consider six different hardware platforms covering Table I shows the configuration details. All the platforms use the same 8GB DDR3 memory and CORSAIR CX600 power supply unit, and run the same CentOS 6.3 64-bit operating system. The GCC 4.3 compiler is used to produce optimized host-side binaries at the -O3 optimization level. The NVCC compiler of CUDA 5.0 is used to generate the kernel binaries on the GPUs.
In this work, we use CPU and GPU card energy to explore the energy behavior of different DA implementations. The reason for using CPU energy instead of system energy is that the energy of running the OS dominates the system energy and the GPU card could be viewed as an add-on processor for compute only. For collecting energy data, we have built a power measurement system. Figure 4 illustrates the system schematic using the CPU+GPU platform as an example, where the CPU motherboard connects with the GPU card through the PCI-Express interface.
The Core i7 motherboard has a dedicated 12V DC input for CPU power supply and a voltage regulator module (VRM) for dynamically converting the 12V DC voltage to the actual processor operating voltage. Although the 12V supply includes the energy overhead of the VRM, such inclusion is reasonable when deriving the GPU card energy because the GPU card energy includes the energy loss of the onboard VRM. Since there is no dedicated power input for the Atom processor, we insert a wire (the V core arrow in Figure 4 ) between the VRM and the CPU in order to deliver the V core DC input to Atom. To ensure a fair comparison, we add an additional 20% VRM energy loss (similar to that used in other papers, e.g., [9] ) to the measured Atom energy. The GPU power supply is roughly decomposed into a 12V auxiliary rail (the AUX arrow in Figure 4) on top of the GPU card and the power lane from the PCI-Express interface which can be further classified into the 3.3V and 12V lanes.
We use measured DC currents to derive energy usage. Four FLUKE 80i-110s clamps are employed to continuously capture the current values on different power lanes. A PCI-Express riser card is inserted between the PCI-Express interface and the GPU card to separate the power pins of the GPU card. An NI USB 6126 data acquisition system is used to collect the readings of clamps and deliver the data to a computer for record keeping. The sampling rate is 10,000 samples-persecond.
To obtain energy data, we use the measured current values and leverage the fact that the observed supply voltages fluctuate up to 5% of the voltage specifications. The energy usage is thus calculated by
where ∆ t = 0.0001 seconds (the inverse of the sampling rate), and i represents the i th sampling period. We use this equation to compute the energy of both CPU and GPU during the execution, but we use the total energy of CPU and GPU (if equipped with GPU) as the energy consumption in the following sections. For performance data, we obtain the execution time of a program by using timers inserted directly into the program.
V. EVALUATION
We have conducted a number of experiments to evaluate the performance and energy consumption of the implementations discussed in Section III running on the hardware platforms discussed in Section IV. Besides the GPU and CPU+GPU implementations, we have also evaluated the legacy Serial and multi-threaded Omp (where Omp represents the usage of OpenMP with maximum physical CPU threads launched) implementations of STC, SMA and DA running entirely on the host CPU. The performance (resp., energy) data are captured by the speedup (resp., energy saving) of an implementation over the Serial implementation on the HiPerf platform, where
We have studied two input problem sizes: 50 3 and 100 3 and use single-precision floating point variables, and noted that most observed trends are consistent for both problem sizes. In the rest of this section, we omit discussion on the impact of the problem size unless it makes a significant difference. Below, we present the performance and energy data for the different implementations.
We first investigate how different GPU thread designs impact performance and energy. Figure 5 (a) summarizes the performance results of three STC implementations (OmpSTC, BaseSTC and SmemSTC) and four SMA implementations (OmpSMA, BaseSMA, RowSMA and GnzSMA) on four hardware platforms (LoPwr, HiPerf, HiPerf+Fermi and HiPerf+Kepler). As shown in the figure, regardless of the GPU processor used, the baseline GPU implementations of STC (BaseSTC) achieve over 2X speedup but are not as fast as OmpSTC. However, due to the memory-intensive nature of SMA, the baseline GPU implementation of SMA, BaseSMA, performs actually slower than the serial implementation of SerialSMA on HiPerf. Based on thread design strategies that expose more parallelism, SmemSTC, RowSMA, and GnzSMA achieve large performance improvements. Specifically, RowSMA (resp., SmemSTC) outperforms BaseSMA (resp., BaseSTC) by about 120% and 160% (resp., 55.1% and 33.8%) on Fermi and Kepler, respectively. (Note that the percentages are obtained using the corresponding baseline implementations on each platform.) Better memory coalescing and the use of shared memory attribute to these improvements. GnzSMA avoids data races by using NZ data items to assemble data items in the global matrix, and further improves the performance of RowSMA by about 97% and 67% on Fermi and Kepler.
Comparing across the hardware platforms, we note that RowSMA has a higher speedup while SmemSTC and GnzSMA have lower speedups on the Kepler platform. These are due to the fact that Kepler provides more efficient handling of atomic operations which occur more frequently in RowSMA while our Fermi card provides more raw compute performance which impacts the performance of SmemSTC and GnzSMA. SmemSTC on Fermi outperforms OmpSTC on HiPerf while SmemSTC on Kepler has similar performance as OmpSTC on HiPerf. GnzSMA on Fermi and Kepler both outperform OmpSMA on HiPerf. In conclusion, judicious thread designs can improve performance for both compute-and memory-intensive kernels on GPUs. Careful consideration of assigning threads for avoiding data races is important for improving the performance of memoryintensive kernels on GPUs.
We now examine the energy impact of the above thread designs. The energy savings as defined earlier for STC and SMA implementations are summarized in Figure 5(b) . The results show that energy savings of STC and SMA implementations on the same platforms are generally correlated with their performance, i.e., higher speedups lead to higher energy savings. One exception is OmpSMA on LoPwr for the problem size of 100 3 , which is 86.7% slower but 81.7% more energy efficient than OmpSMA on HiPerf. This is largely due to the fact that dynamic voltage/frequency scaling in LoPwr is more effective at saving energy. One may also note that SmemSTC improves the energy saving of BaseSTC by about 39% and 38% on Fermi and Kepler, respectively. For SMA, GnzSMA is about 365% and 378% more energy efficient than BaseSMA on Fermi and Kepler, respectively. Such significant energy efficiency improvement indicates that thread design is a strong factor for saving energy on using GPUs for memory-intensive kernels.
When comparing across the platforms, we see that all the GPU implementations of STC and SMA (BaseSTC, SmemSTC, BaseSMA, RowSMA and GnzSMA) run faster on Fermi than Kepler, but have lower energy saving on Fermi than Kepler. The reason is that Kepler is much more energy efficient than Fermi. If we consider the energy impact of using GPU over using CPU alone, we can see that OmpSTC and OmpSMA on HiPerf are more energy efficient than all the GPU implementations of STC and SMA, respectively. This trend can be attributed to the high instantaneous GPU power and the inclusion of CPU idle energy. Next, we discuss how different GPU kernel designs impact performance and energy. Recall that we have devised two kernel designs, GmemDA which relies on the global memory for communication between the STC and SMA kernels (i.e., SmemSTC and GnzSMA), and SmemDA which leverages the shared memory to fuse the STC and SMA kernels (i.e., SmemSTC and RowSMA) into a single kernel. The performance and energy data of the two kernel designs are shown in Figure 6 (a) and (b). SmemDA removes global memory accesses between STC and SMA, and hence leads to better performance speedup (Figure 6(a) ) and higher energy saving (Figure 6(b) ) than GmemDA, even though its SMA kernel (RowSMA) is slower and consumes more energy than GnzSMA used by GmemDA. (See Figure 5 for RowSMA and GnzSMA data.) Compared with the CPU platforms, SmemDA outperforms OmpDA on HiPerf by about 125% and 52% but with about 18% and 20% more energy consumption for HiPerf+Fermi and HiPerf+Kepler, respectively.
We now examine how different task designs impact performance and energy. Specifically, we run the ParaDA and PartitionDA implementations on four different CPU+GPU platforms, LoPwr+Fermi, LoPwr+Kepler, HiPerf+Fermi and HiPerf+Kepler. For ParaDA, we have explored different workload divisions and settled to allocating 78%, 72%, 98% and 95% of the workload to the GPU for the HiPerf+Fermi, HiPerf+Kepler, LoPwr+Fermi and LoPwr+Kepler platforms, respectively. As a basis of comparison, we also run SmemDA on these four platforms, as well as OmpDA on the LoPwr and HiPerf platforms.
The performance and energy data from these different runs are shown in Figure 7 (a) and (b). It can be observed that ParaDA consistently delivers the highest performance on every one of the four platforms. Several factors contribute to this conclusion: (i) OmpDA and SmemDA used in ParaDA are the fastest implementation on the CPU and GPU respectively, hence PartitionDA leads to the most efficient usage of each device; (ii) ParaDA has a balanced CPU and GPU workload distribution so that both CPU and GPU capabilities are fully exploited. PartitionDA, on the other hand, always runs slower than ParaDA on the same platforms. This performance gap between PartitionDA and ParaDA comes from the limited bandwidth between the host memory and the global memory and the significant data communication between STC and SMA. However, PartitionDA does show performance improvement over OmpDA.
Regarding the impact of task designs on energy, Figure 7 (b) shows that the CPU+GPU implementations' higher speedups also lead to lower energy. For example, for the problem size of 100 3 running on HiPerf+Fermi, ParaDA achieves a speedup of 8X and energy saving of about 1.2X while PartitionDA achieves a speedup of about 5.5X and energy saving of about 0.65X (i.e., consumed more energy than the Serial on HiPerf implementation).
A tradeoff between performance and energy is clearly visible between the implementations on a CPU only platform (OmpDA) and the implementations on a CPU+GPU platform. For example, ParaDA achieves a speedup of 8.1X and energy saving of about 1.2X on HiPerf+Fermi while OmpDA achieves a speedup of about 2.9X and energy saving of about 1.35X for the problem size of 100 3 . Therefore, ParaDA could be a preferred solution in this case.
VI. SUMMARY
This paper aims to understand how GPU and CPU+GPU implementations impact the performance and energy of accelerating DA in FEM. We developed a number of DA implementations employing different thread, kernel and task design approaches. Detailed performance and energy data were gathered on real hardware platforms. The results show that the performance and energy for different implementations on the same platform can vary significantly but the two metrics of speedup and energy saving follow the same trend. However, across different platforms, there may exist interesting performance and energy tradeoffs if the best implementation is chosen for each of the platforms.
