Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu Computing  Systems: Application To Composite Process Flow Modeling by Haney, Richard Harrison
North Carolina Agricultural and Technical State University 
Aggie Digital Collections and Scholarship 
Dissertations Electronic Theses and Dissertations 
2013 
Architecture--Performance Interrelationship Analysis In Single/
Multiple Cpu/Gpu Computing Systems: Application To Composite 
Process Flow Modeling 
Richard Harrison Haney 
North Carolina Agricultural and Technical State University 
Follow this and additional works at: https://digital.library.ncat.edu/dissertations 
 Part of the Computational Engineering Commons, and the Computer Engineering Commons 
Recommended Citation 
Haney, Richard Harrison, "Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu 
Computing Systems: Application To Composite Process Flow Modeling" (2013). Dissertations. 116. 
https://digital.library.ncat.edu/dissertations/116 
This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at Aggie 
Digital Collections and Scholarship. It has been accepted for inclusion in Dissertations by an authorized 
administrator of Aggie Digital Collections and Scholarship. For more information, please contact iyanna@ncat.edu. 
Architecture–Performance Interrelationship Analysis in Single/Multiple CPU/GPU Computing 
Systems: Application to Composite Process Flow Modeling 
Richard Harrison Haney 




A dissertation submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
DOCTOR OF PHILOSOPHY 
Department: Computational Science and Engineering 
Major: Computational Science and Engineering 
Major Professor: Dr. Ram Mohan 




School of Graduate Studies 
North Carolina Agricultural and Technical State University 
This is to certify that the Doctoral Dissertation of 
 
Richard Harrison Haney 
 
has met the dissertation requirements of 
North Carolina Agricultural and Technical State University 
 






Dr. Ram Mohan 
Major Professor 
 
Dr. K.M. Flurchick 
Committee Member 
 
Dr. Ajit Kelkar 
Committee Member 
 
Dr. Sanjiv Sarin 
Dean, The Graduate School 
 
Dr. Marwan Bikdash 
Department Chair 
 
Dr. Dukka K.C.  
Committee Member 
 
























© Copyright by 





 Richard Harrison Haney was born in 1969 in West Liberty, Kentucky.  He received his 
Bachelor of Science in Computer Science from Western Carolina University in 2002 and his 
Master of Science in Computational Science and Engineering from North Carolina Agriculture 
and Technical State University in 2006.  He is a candidate for the Ph.D. in Computational 




I would like to dedicate this to my lovely wife, Lydia, who has been so supportive of all 
my endeavors.  I would also like to include our daughter Anna in this dedication for all the times 




 I would like to acknowledge the help and guidance of Dr. Ram Mohan.  Dr. Mohan is one 
of the most knowledgeable people in the area of physics based modeling and simulations, high 
performance computing and computational finite element methods. I would not have been able to 
complete this dissertation without his guidance, assistance and support. I would also like to 
acknowledge the help of Dr. Yu Liang for some of the tricky Computer Science issues. I thank 
and appreciate the support and guidance of my committee members Dr. Kelkar, Dr. Bikdash, and 
Dr. Dukka K.C. The financial support in part from Office of Naval Research, Clarkson 
Aerospace Corporation via a prime award from Air Force Research Laboratory through 
grants/contracts (Dr. Mohan, PI) is highly appreciated. Prior composite process flow modeling 
single and multiple processor code developments referenced in this work are acknowledged. The 
support from the Computational Science and Engineering program and the Department of 
Nanoengineering at the Joint School of Nanoscience and Nanoengineering are acknowledged. I 
thank the computer access to the GPU systems at University of North Carolina, Chapel Hill 
(Bass); Ohio Supercomputer Center (Oakley) through Dr. Liang, Central State University, and 
Hermes in the campus of North Carolina A&T State University. 
vi 
 
Table of Contents 
List of Figures ................................................................................................................................. x 
List of Figures ................................................................................................................................. x 
List of Tables ................................................................................................................................ xv 
Abstract ........................................................................................................................................... 2 
CHAPTER 1 Introduction............................................................................................................... 3 
1.1 Background and History........................................................................................................ 3 
1.1.1 CPU-GPU hardware parallelism. ...................................................................................... 4 
1.1.2 CPU-GPU software parallelism. ....................................................................................... 7 
1.2 Focus and Objective ............................................................................................................ 10 
CHAPTER 2 ................................................................................................................................. 12 
2.1 Physical Description ............................................................................................................ 15 
2.1.1 Resin Mass Conservation. ............................................................................................... 16 
2.1.2 LCM Solution Strategy ................................................................................................... 17 
2.2 Key Computationally Intensive Functions .......................................................................... 17 
2.2.1 The Iterative Solver. ........................................................................................................ 18 
2.3 Data-Structures/Layouts ...................................................................................................... 18 
2.4 Hardware Discussions of Computing Architectures Used .................................................. 20 
2.4.1 CPU Hardware Architectures .......................................................................................... 21 
2.4.2 GPU Hardware Architectures. ......................................................................................... 22 
2.4.3 Hardware Design Summary ............................................................................................ 23 
vii 
 
CHAPTER 3 Computational Potential of GPU – Sparse Matrix-Vector Multiplication ............. 25 
3.1 Mapping Sparse Matrix-Vector Multiplication to GPU ...................................................... 25 
3.1.1 The CUDA Software/Architecture API. ......................................................................... 26 
3.1.1.1 CUDA API thread hierarchy .................................................................................. 26 
3.1.1.2 CUDA API memory hierarchy .............................................................................. 28 
3.1.1.3 CUDA API basic syntax ........................................................................................ 30 
3.1.2 Algorithmic Strategies for SpMV Mapping. ................................................................... 30 
3.2 GPU Initial SpMV Performance Results ............................................................................ 31 
3.2.1 System A. ........................................................................................................................ 32 
3.2.2 System B. ......................................................................................................................... 32 
3.3 Software Data-Structures/Layouts Factors.......................................................................... 33 
3.4 Architectural Hardware Factors .......................................................................................... 36 
3.5 Interdependence of Software and Hardware Factors .......................................................... 38 
CHAPTER 4 Full Candidate Application – Single CPU/GPU Computing System ..................... 42 
4.1 Mapping Full Candidate Application to GPU ..................................................................... 42 
4.1.1 Key Computationally Intensive Kernels. ........................................................................ 43 
4.1.2 GPU Code Developments. .............................................................................................. 44 
4.1.2.1 CUDA Kernels of PCG solver ............................................................................... 44 
4.2 Validation of Full Candidate Application on Single CPU/GPU ......................................... 48 
4.3 Initial Full Candidate Application Performance on Single CPU/GPU ............................... 52 
viii 
 
4.3.1 System A. ........................................................................................................................ 53 
4.3.2 System B. ......................................................................................................................... 54 
4.3.3 Initial performance analysis. ........................................................................................... 55 
4.4 Software Data-Structures/Layouts Factors.......................................................................... 55 
4.5 Hardware Architectural Factors .......................................................................................... 59 
4.6 Computational Complexity Analysis .................................................................................. 61 
4.6.1 Single call to PCG solver. ............................................................................................... 61 
4.6.2 Full Solution Cost – Single GPU. ................................................................................... 66 
4.6.3 Contribution of Hardware Factors. .................................................................................. 71 
4.6.3.1 Single PCG Call. .................................................................................................... 71 
4.6.3.2 Full solution – single CPU/GPU. ........................................................................... 73 
4.6.4 Contribution of Software Factors. ................................................................................... 75 
4.6.4.1 Single PCG Call. .................................................................................................... 75 
4.6.4.2 Full solution – single CPU/GPU. ........................................................................... 76 
4.7 Performance and Relation to Software and Hardware Factors ........................................... 78 
CHAPTER 5 Full Candidate Application – Multiple CPU/GPU Computing System ................. 84 
5.1 Mapping Full Candidate Application to GPU ..................................................................... 85 
5.1.1 Key Computationally Intensive Kernels. ........................................................................ 87 
5.1.2 GPU Code Developments. .............................................................................................. 87 
5.2 Validation of Full Candidate Application on Multiple CPU/GPU ..................................... 88 
ix 
 
5.3 Initial Full Candidate Application Performance on Multiple CPU/GPU ............................ 91 
5.3.1 System A. ........................................................................................................................ 92 
5.3.2 System B. ......................................................................................................................... 96 
5.3.3 Initial performance analysis. ........................................................................................... 99 
5.4 Software Data-Structures/Layout Factors ......................................................................... 101 
5.5 Hardware Architectural Factors ........................................................................................ 109 
5.6 Computational Complexity Analysis ................................................................................ 110 
5.6.1 Relationship of MPI-GPU and CPU-GPU communication. ......................................... 111 
5.6.2 Comparison of performance modeling. ......................................................................... 119 
5.6.3 Contribution of Hardware Factors. ................................................................................ 122 
5.6.4 Contribution of Software Factors. ................................................................................. 126 
5.7 Performance and Relation to Software and Hardware Factors ......................................... 131 
CHAPTER 6 Summary and Future Directions ........................................................................... 133 
References ................................................................................................................................... 137 
Appendix A ................................................................................................................................. 153 
Appendix B ................................................................................................................................. 162 
Appendix C ................................................................................................................................. 166 




List of Figures 
Figure 1. The general models followed by the GPU (left) and CPU (right). .................................. 4 
Figure 2. Generic graphics rendering pipeline. ............................................................................... 5 
Figure 3. Rasterizing a simple triangle. .......................................................................................... 6 
Figure 4. An example 4-by-4 matrix utilized by OpenGL.............................................................. 8 
Figure 5. Vertex point P is shifted left and down with translation values. ..................................... 9 
Figure 6. Schematic representation of the interrelationship of computational performance. ....... 10 
Figure 7. Unstructured mesh geometric configuration used by candidate application. ................ 13 
Figure 8. Unstructured mesh geometric configuration of 10FT model. ....................................... 14 
Figure 9. Sparse matrix non-zero entry distribution for mesh MA. ............................................. 14 
Figure 10. Sparse matrix non-zero entry distribution for mesh MB............................................. 15 
Figure 11. Matrix sparsity pattern of mesh 10FT. ........................................................................ 15 
Figure 12. The CSR format. .......................................................................................................... 20 
Figure 13. The BCSR2x2 format. ................................................................................................. 20 
Figure 14. Diagram of CUDA GPU thread grouping. .................................................................. 27 
Figure 15. CUDA hardware to logical construct mapping. .......................................................... 27 
Figure 16. CUDA hardware with logical Grid overlay. ................................................................ 28 
Figure 17. CUDA GPU mapping of the physical and logical memory structure. ........................ 29 
Figure 18. Example defining CUDA Kernel function signature and execution space. ................ 30 
Figure 19. System A performance as threads per block increase (CSR and BCSR2x2). ............. 35 
Figure 20. System B performance as threads per block increase (CSR and BCSR2x2). ............. 36 
Figure 21. Block diagram of memory bank conflict for generic CUDA device........................... 40 
Figure 22. System A and System B Speedup factors for execution of SpMV operation ............ 41 
xi 
 
Figure 23. The CNC operational flow from input to output. ........................................................ 43 
Figure 24. Presented candidate application diagramming the placement of CNC Package. ........ 45 
Figure 25. IEEE conversion from base-10 to base-2 with normalization. .................................... 47 
Figure 26.  2D circular plate validation model (not to scale). ...................................................... 49 
Figure 27. Validation of single CPU/GPU for flow-front progression (System A). .................... 51 
Figure 28. Validation of single CPU/GPU for inlet injection pressure (System A)..................... 51 
Figure 29. Validation of single CPU/GPU for flow-front progression (System B). .................... 52 
Figure 30. Validation of single CPU/GPU for inlet injection pressure (System B). .................... 52 
Figure 31. Full candidate solution performance (System A). ...................................................... 58 
Figure 32. Full candidate solution performance (System B). ....................................................... 58 
Figure 33. Full candidate solution KFLOPs performance (System A). ....................................... 59 
Figure 34. Full candidate solution KFLOPs performance (System B). ....................................... 59 
Figure 35. Performance modeling of single call to PCG solver (System A). .............................. 64 
Figure 36. Performance modeling of single call to PCG solver (System B). ............................... 64 
Figure 37. Normalized error for single call PCG modeled performance (System A). ................. 65 
Figure 38. Normalized error for single call PCG modeled performance (System B). ................. 65 
Figure 39. Performance modeling single CPU/GPU full solution (System A). ........................... 68 
Figure 40. Performance modeling single CPU/GPU full solution (System B). ........................... 69 
Figure 41.  Error single CPU/GPU full solution modeled performance (System A). .................. 69 
Figure 42. Error single CPU/GPU full solution modeled performance (System B). ................... 70 
Figure 43. Performance model for single PCG solver on System A (SMP adjusted). ................. 72 
Figure 44. Performance model for single PCG solver on System B (SMP adjusted). ................. 73 
Figure 45. Performance model for full candidate solution on System A (SMP adjusted). .......... 74 
xii 
 
Figure 46. Performance model for full candidate solution on System B (SMP adjusted). .......... 75 
Figure 47. Performance model for single PCG solver on System A (Threads adjusted). ............ 76 
Figure 48. Performance model for single PCG solver on System B (Threads adjusted). ............ 77 
Figure 49. Performance model for full candidate solution on System A (SMP adjusted). .......... 77 
Figure 50. Performance model for full candidate solution on System B (SMP adjusted). .......... 78 
Figure 51.  2D circular plate validation model (not to scale). ...................................................... 89 
Figure 52. Validation of multiple CPU/GPU for flow-front progression (System A). ................ 90 
Figure 53. Validation of multiple CPU/GPU for inlet injection pressure (System A). ................ 90 
Figure 54. Validation of multiple CPU/GPU for flow-front progression (System B). ................ 91 
Figure 55. Validation of multiple CPU/GPU for inlet injection pressure (System B). ................ 91 
Figure 56.  Multiple CPU/GPU computing system - mesh MA (System A). .............................. 95 
Figure 57. Multiple CPU/GPU computing system - mesh MB (System A). ............................... 95 
Figure 58. Multiple CPU/GPU computing system - mesh 10FT (System A).............................. 96 
Figure 59. Multiple CPU/GPU computing system - mesh MA (System B). ............................... 98 
Figure 60. Multiple CPU/GPU computing system - mesh MB (System B). ............................... 99 
Figure 61. Multiple CPU/GPU computing system - mesh 10FT (System B). ............................. 99 
Figure 62. Multiple CPU/GPU performance BCSR2x2 compression (System A). ................... 105 
Figure 63.  Multiple CPU/GPU performance BCSR2x2 compression (System B). .................. 106 
Figure 64. Multiple CPU/GPU performance mixed compression (System A). ......................... 106 
Figure 65. Multiple CPU/GPU performance mixed compression (System B). ......................... 107 
Figure 66. Multiple CPU/GPU performance mixed compression - 10FT (System A). ............. 108 
Figure 67. Multiple CPU/GPU performance mixed compression - 10FT (System B). ............. 108 
Figure 68. Deviation from ideal mapped with regression multiple CPU/GPU mesh MA. ........ 114 
xiii 
 
Figure 69. Deviation from ideal mapped with regression multiple CPU/GPU mesh MB. ........ 114 
Figure 70. Non-zeros held locally (mesh MA) and factors off with multiple CPU/GPU. ......... 115 
Figure 71. Non-zeros held locally (mesh MB) and factors with multiple CPU/GPU. ............... 115 
Figure 72. Asymptotic behavior of MPI and CPU-GPU communication (MA, System A). ..... 116 
Figure 73. Asymptotic behavior of MPI and CPU-GPU communication (MB, System A). ..... 116 
Figure 74. Asymptotic behavior of MPI and CPU-GPU communication (MA, System B). ..... 117 
Figure 75. Asymptotic behavior of MPI and CPU-GPU communication (MB, System B). ..... 117 
Figure 76. Asymptotic behavior of MPI and CPU-GPU communication (10FT, System A). .. 118 
Figure 77. Asymptotic behavior of MPI and CPU-GPU communication (10FT, System B). ... 118 
Figure 78.  Multiple CPU/GPU theoretical performance with input MA (System A). .............. 119 
Figure 79. Multiple CPU/GPU theoretical performance with input MB (System A). ............... 120 
Figure 80. Multiple CPU/GPU theoretical performance with input 10FT (System A). ............ 120 
Figure 81. Multiple CPU/GPU theoretical performance with input MA (System B). ............... 121 
Figure 82. Multiple CPU/GPU theoretical performance with input MB (System B). ............... 121 
Figure 83. Multiple CPU/GPU theoretical performance with input 10FT (System B). ............ 122 
Figure 84. Multiple CPU/GPU performance with MA - hardware factor (System A). ............. 124 
Figure 85.  Multiple CPU/GPU performance with MB - hardware factor (System A). ............ 124 
Figure 86. Multiple CPU/GPU performance with 10FT - hardware factor (System A). ........... 125 
Figure 87.  Multiple CPU/GPU performance with MA - hardware factor (System B). ............ 125 
Figure 88. Multiple CPU/GPU performance with MB - hardware factor (System B)............... 126 
Figure 89. Multiple CPU/GPU performance with 10FT - hardware factor (System B). ........... 126 
Figure 90.  Multiple CPU/GPU performance with MA - software factor (System A). ............. 128 
Figure 91. Multiple CPU/GPU performance with MB - software factor (System A). .............. 128 
xiv 
 
Figure 92. Multiple CPU/GPU performance with 10FT - software factor (System A). ............ 129 
Figure 93. Multiple CPU/GPU performance with MA - software factor (System B). .............. 129 
Figure 94. Multiple CPU/GPU performance with MB - software factor (System B). ............... 130 
Figure 95. Multiple CPU/GPU performance with 10FT - software factor (System B). ............ 130 
Figure 96. Time filled for unstructured mesh MA CPU-Only (System A) ................................ 169 
Figure 97. Time filled for unstructured mesh MA single CPU/GPU (System A) ..................... 170 
Figure 98. Time filled single CPU/GPU with circular plate (System A) ................................... 171 
Figure 99. Time filled multiple CPU/GPU with circular plate (System A)................................ 171 
Figure 100. Time filled single CPU/GPU with circular plate (System B).................................. 172 
Figure 101. Time filled multiple CPU/GPU with circular plate (System B) .............................. 172 
Figure 102. Input mesh model 10FT multiple partition time-filled comparison (System A) .... 173 
Figure 103. Input mesh model 10FT multiple partition time-filled comparison (System B) .... 174 
xv 
 
List of Tables 
Table 1  System A – CPU hardware architecture ......................................................................... 21 
Table 2  System B – CPU hardware architecture ......................................................................... 22 
Table 3  System A – GPU hardware architecture ......................................................................... 23 
Table 4  System B – GPU hardware architecture ......................................................................... 23 
Table 5  Time comparison for SpMV on System A (CSR compression) .................................... 32 
Table 6  Time comparison for SpMV on System B (CSR compression) ..................................... 32 
Table 7  SpMV time comparison per thread occupancy and data format (System A) ................. 35 
Table 8  SpMV time comparison per thread occupancy and data format (System B) ................. 36 
Table 9  Full solution performance with single CPU/GPU and CSR compression (System A) .. 54 
Table 10  Full solution KFLOPs with single CPU/GPU and CSR compression (System A) ...... 54 
Table 11  Full solution performance with single CPU/GPU and CSR compression (System B) 54 
Table 12  Full solution KFLOPs with single CPU/GPU and CSR compression (System B) ...... 55 
Table 13  GPU device limits for both systems ............................................................................. 63 
Table 14  Performance modeling of single calls PCG (System A) .............................................. 65 
Table 15  Performance modeling of single calls PCG (System B) .............................................. 66 
Table 16  Performance modeling single CPU/GPU full solution (System A) ............................. 70 
Table 17  Performance modeling single CPU/GPU full solution (System B) ............................. 70 
Table 18  Categorized software and hardware effects (System A) .............................................. 81 
Table 19  Categorized software and hardware effects (System B) .............................................. 81 
Table 20  Hardware factors and theoretical performance (System A) ........................................ 82 
Table 21  Software factors and theoretical performance (System A) .......................................... 82 
Table 22  Hardware factors and theoretical performance (System B) ......................................... 82 
xvi 
 
Table 23  Software factors and theoretical performance (System B) ........................................... 83 
Table 24  Multiple CPU/GPU performance in milliseconds with mesh MA (System A) ........... 94 
Table 25  Multiple CPU/GPU performance in milliseconds with mesh MB (System A) ........... 94 
Table 26  Multiple CPU/GPU performance in milliseconds with mesh 10FT (System A) ......... 94 
Table 27  Multiple CPU/GPU performance in milliseconds with mesh MA (System B) ........... 97 
Table 28  Multiple CPU/GPU performance in milliseconds with mesh MB (System B) ............ 97 
Table 29  Multiple CPU/GPU performance in milliseconds with mesh 10FT (System B) ......... 98 
Table 30  Multiple CPU/GPU performance in seconds (System A) .......................................... 102 
Table 31  Multiple CPU/GPU performance of 10FT model in seconds (System A) ................. 103 
Table 32  Multiple CPU/GPU performance in seconds (System B) .......................................... 103 
Table 33  Multiple CPU/GPU performance of 10FT model in seconds (System B) ................. 103 
Table 34  Multiple CPU/GPU performance in seconds – formats (System A) .......................... 104 




Current developments in computing have shown the advantage of using one or more Graphic 
Processing Units (GPU) to boost the performance of many computationally intensive 
applications but there are still limits to these GPU-enhanced systems. The major factors that 
contribute to the limitations of GPU(s) for High Performance Computing (HPC) can be 
categorized as hardware and software oriented in nature.  Understanding how these factors affect 
performance is essential to develop efficient and robust applications codes that employ one or 
more GPU devices as powerful co-processors for HPC computational modeling. 
The present work analyzes and understands the intrinsic interrelationship of both hardware and 
software categories on computational performance for single and multiple GPU-enhanced 
systems using a computationally intensive application that is representative of a large portion of 
challenges confronting modern HPC.  The representative application uses unstructured finite 
element computations for transient composite resin infusion process flow modeling as the 
computational core, characteristics and results of which reflect many other HPC applications via 
the sparse matrix system used for the solution of linear system of equations. This work describes 
these various software and hardware factors and how they interact to affect performance of 
computationally intensive applications enabling more efficient development and porting of High 
Performance Computing applications that includes current, legacy, and future  large scale 





 Recent years have seen the slowing of computational intensity offered by standard 
Central Processing Unit (CPU)-based systems, while scientific/engineering applications have 
inexorably grown in the need for computational power [1-3].  As the CPU approached maximum 
power sustainability around 2003, growing from 2/1 cmwatt  to 2/20 cmwatts [3, 4], the 
Graphics Processing Unit (GPU) was increasing its computational intensity while maintaining 
efficient power management [5, 6] and lowering cost to meet the high demand placed upon it 
from a thriving game industry [1, 2, 7].  The differences in CPU and GPU computational power 
can be traced back to the designs upon which the two are predicated – i.e., Instruction-Stream 
Based (ISB) and Data-Stream Based (DSB) models. 
1.1 Background and History 
 The ISB design of the CPU, whereby a single stream of instructions and data are fed to 
the device, limited any optimization of arithmetic operations since the input stream could contain 
any number of potentially complex instructions.  Therefore, the CPU accomplished the 
mitigation of latency by defining elaborate memory caches where processes could be switched 
out when needed, such as with an I/O interrupt, constraining larger numbers of transistors to the 
Memory Management Units (MMU) and logic device arrays [2, 8, 9] for complex operations 
such as speculative branching [10-12]. In comparison, the DSB design of the GPU, with 
instructions and data fed to the device as separate streams, could optimize for arithmetic 
operations as the instruction stream is committed prior to any data input.  Therefore, the GPU 
accomplished mitigation of latency by pushing as many processes as possible through the device 
at any given time, conscripting large numbers of transistors for floating-point operations and 
4 
 
assumed large arrays of uninterrupted data streams via a wide data bus [2, 8].  The DSB and ISB 
paradigms define the framework upon which the present computational architectures and 
software designs for the CPU and GPU have evolved (see Figure 1). 
 1.1.1 CPU-GPU hardware parallelism. The CPU computing architecture, as per the 
ISB model, allowed for the maintenance of increasingly complex processes [9, 13], the execution 
of which is physically and logically defined by the concept of pipelines.  CPU pipelines are 
constructs which consist of stages of processing elements executed in a series – each output of a 
stage is the input to the next [9, 13].  This single pipeline evolved to the more efficient multiple 
pipelines [9, 13], eventually leading to super-scalar systems [14] and vector processing 
machines [15] in an effort to maximize hardware oriented computational power for the CPU-
only computing architecture. 
 
Figure 1. The general models followed by the GPU (left) and CPU (right). 
 Both the vector processing machines [14, 16, 17] and super-scalar systems [9, 14] 
increase the computational power of CPU-only computing architectures using a low-level 
hardware defined parallelism but take different routes to achieve this goal.  The super-scalar 
system defines hardware parallelism via execution of multiple operations per clock cycle 
augmenting the system with large numbers of registers [9, 14], whereas the vector processing 
machine implements common operations across multiple data elements [14, 16, 17].  Vector 
processing machines, such as the CRAY [18] and the NEC SX-8 [19], obtain magnitudes of 
5 
 
speedup over scalar systems by pushing vector elements onto a special register known as a pipe 
and then execute operations across these elements simultaneously [14, 15].  These CPU 
architectural designs evolved to coerce a parallelism that is intrinsically and naturally present in 
the GPU device architecture. 
 Initially computer graphics were defined as simple vector devices executing as a separate 
process using Direct Memory Access (DMA) to bypass frequent interrupts to the CPU [9, 13],  
but as the GPU matured into a dedicated device it was free to optimize for throughput, as per the 
DSB model. The larger concentration of transistors in the floating-point operations coupled with 
a wide data bus produced the computational intensity for which the GPU computing architecture 
is widely reputed [20-23] – large numbers of data input is executed simultaneously by 
preconfigured floating-point operations [2, 8, 24].  The execution of floating-point operations 
across the set of data input is physically and logically defined by the construct of a graphics 
pipeline (see Figure 2 and Figure 3). 
 




Figure 3. Rasterizing a simple triangle. 
 The graphics pipeline construct is a feed-forward system [24] designed to perform 
operations on four fundamental entities for final rendering by the display device – vertices, 
primitives, fragments, and pixels [2].  Initially these fundamental entities were defined wholly 
within a fixed-pipeline [1, 2] but evolved to become a mix of fixed and programmable sections 
represented by an interleaved series of  fixed functions and  programmable stages across which 
the data entities flow [1, 2].  Vertex generation is a fixed function and is the first step in the 
graphics pipeline, generating a series of vertices using lists of descriptors from the graphics 
application [2].  The output vertices from the vertex generation function are passed to the 
programmable stage of vertex processing, generating sets of vertex records independently from 
each vertex – projecting the original 3-coordinate system to a 2-coordinate system [2].  The input 
vertices are grouped next into an ordered stream of primitives by the fixed function, primitive 
generation and then passed to the programmable stage of primitive processing – potentially 
merging multiple primitives for rendering [2].  The modified primitives are passed to the 
fragment generation fixed function, producing fragment records that are interpolated from 
samples of the input and passed to the programmable stage, fragment processing [2, 24, 25].  
The fragment processing programmable stage simulates the interaction of light and surface with 
the input fragment records – textures are defined at this stage as 1-D, 2-D, or 3-D arrays [2, 24, 
7 
 
25].  The final fixed function, pixel operations are next and calculate output pixels for rendering 
using the input fragment’s screen position [1, 2, 7, 24, 25]. 
 The software paradigms have also developed conjointly with the architectures for both 
the CPU and GPU devices – the two modes evolving as emphasis on higher performance and 
ease of development have matured. 
 1.1.2 CPU-GPU software parallelism. The CPU-only software paradigms have evolved 
along two general modes to increase computational ability – threaded and message passing [15, 
22, 26-28].  The former defines a methodology that concurrently runs independent threads of 
execution within a single address space [26] and the latter uses message constructs to 
communicate and pass data with different processors [26, 28].  These CPU-only software 
paradigms denote parallelization as a means of boosting application performance – a common 
practice in HPC applications that has historically been efficacious [26, 28-30].   
 The actual implementations of the threaded and message passing systems is defined by 
several standards, libraries and/or specialized languages that include Unified Parallel C (UPC), 
OpenMP, PThreads, Parallel Virtual Machine (PVM), and Message Passing Interface (MPI) 
among others [26, 28, 29].  Flynn’s Taxonomy categorizes the threaded paradigm into the Single 
Instruction Multiple Data (SIMD) and the message passing paradigm into the Single Program 
Multiple Data (SPMD) as the threaded mode requires a lockstep synchronization in a shared 
memory context and the message passing mode can operate in a distributed memory context with 
varying degrees of autonomy [26, 28]. 
 The GPU software paradigms, unlike the CPU counterparts that explicitly sought to boost 
performance via parallelism, evolved out of a need to render specialized visuals for graphics 
heavy applications e.g. high demand games [1].  Once the graphics pipeline became flexible, 
8 
 
allowing programming of the vector and fragment processors, an efficient set of software 
constructs for rendering advanced 3-D visuals was needed and shader languages and libraries 
were developed, so called because the graphics programs generally were written to shade 
fragments of a given rendered object [24, 31], and included the first portable library - Silicon 
Graphics ubiquitous OpenGL [1, 7, 32].  OpenGL was a watershed moment in graphics 
programming as now applications no longer had to be written specifically for a given 
architecture and/or operating system, rendering and manipulating primitives using sets of matrix 
operations that included transformation, translation, rotation, and scaling [33, 34]. 
 The OpenGL graphics library was written as an extension of the C language and is built 
using a basic 4-element vector wzyx ,,,  such that if 1w  the vector defines a position in space 
and if 0w  it is a defined direction [24, 34, 35]. These basic 4-element vectors describe a 
homogeneous coordinate system with the w element allowing translation and rotation, building 
into a series of 44x matrices that can be modified simultaneously with a single formula [1, 2, 24, 







 elements are the translation components – i.e. M03, M13, and M23 in the figure. 
 
Figure 4. An example 4-by-4 matrix utilized by OpenGL.  
 Figure 5 shows a generic 2-D translation of a simple rectangle that occurs after applying 
translation elements to each of the vertices – done simultaneously by the execution of a 




Figure 5. Vertex point P is shifted left and down with translation values. 
 The OpenGL library defines the matrices as column-major order rather than row-major 
but this essentially makes no difference as pre-multiplying with row matrices is the same as post-
multiplying with column matrices [33, 34].  
 The advent of OpenGL provided a boost to graphics programming itself, accessing the 
growing computational power afforded by the GPU for non-rendering purposes was still 
hindered by the tedious and difficult mappings required by the library. The release of Nvidia’s 
Compute Unified Device Architecture (CUDA) in 2007 marked the beginnings of General 
Purpose GPU (GPGPU) computing as it is known today [7, 25, 36].  The CUDA API uses C 
language bindings to access underlying system calls to the GPU processors and embraces the 
familiar concept of threading. CUDA utilizes a GCC-like compiler, NVCC, to compile the GPU-
bound code to low-level Parallel Thread eXecution (PTX) virtual machine and Instruction Set 
Architecture [37, 38].  PTX provides a machine independent architecture for CUDA compilers to 
target and allows for portability across multiple GPU generations [37, 38]. 
 Computational modeling and simulations in many fields require both hardware and 
software compatibility and influence the resultant computational performance [20, 23, 25, 39-
41].  The factors that influence the computational performance can be categorized as software 
and hardware oriented, each category can be further demarcated as computational algorithms 
10 
 
and data-structures/layouts under software; and architectural designs for single and multiple 
CPU/GPU under the hardware category. 
1.2 Focus and Objective 
 The focus and objective of this dissertation is to analyze and understand the intrinsic 
interrelationship between the computing hardware architecture and software variables on the 
performance of the single and multiple CPU/GPU computing systems shown schematically in 
Figure 6.  These analysis and discussions are based on computationally intensive, unstructured 
finite element computations for transient composite process flow modeling application.  The 
discussions are organized into single CPU/GPU and multiple CPU/GPU systems. The reminder 
of the dissertation is presented and organized into chapters as outlined next.  
 
Figure 6. Schematic representation of the interrelationship of computational performance. 
 Chapter 2 defines the computational problem analyzed and provides key computationally 
intensive kernels, associated algorithms, and software data-structures/layouts, as well as the 
hardware descriptions.  Chapter 3 establishes the computational potential of the GPU via the 
analysis and discussion of a key component of many computationally intensive applications – the 
sparse matrix-vector multiplication.  The performance metrics and results as well as the software 
data-structures and architectural factors that influence the performance are presented.  Chapter 4 
11 
 
focuses on the full-solution for the candidate application defining key computationally intensive 
kernels and associated developments such as data-structures/layouts for the single CPU/GPU 
computing system.  The computational performance as it relates to the architecture and software 
relationship is examined and discussed in this chapter.  Chapter 4 will also discuss the hardware 
architectural factors and problem size and how these work to influence computational 
performance. All these factors are considered in the development of an empirical computational 
complexity relationship that will then be correlated to the results and parameters to understand 
how these factors influence the performance.  Chapter 5 is essentially the same as chapter 4 
excepting for the use of a multiple CPU/GPU computing systems – defining the relevance of 
these systems for multiple processor designs.  Finally, chapter 6 is a summary of the results and 
analysis as well as proposed future directions for CPU/GPU and hybrid computing systems.  
This last chapter will establish the interrelationship between software-hardware variables on 
computational performance and how these findings can be applied to many computationally 





Computational Problem – Candidate Application 
 This chapter describes the computationally intensive physical problem employed as a 
candidate application for analysis and discussion of computational performance factors – Resin 
Transfer Molding (RTM) process and the transient process resin infusion flow modeling [30, 
42, 43].  This candidate application presented employs unstructured finite element computations, 
assembling sets of locally defined stiffness matrices into a single global stiffness matrix [44], 
that is sparse, symmetric and positive-definitive [44, 45].  The global stiffness matrix composed 
of finite element computations, is solved at each time-step during resin infusion flow analysis 
using the iterative Preconditioned Conjugate Gradient method [46] rather than a direct solver 
such as LU-Decomposition which is computationally prohibitive for most non-trivial problems 
[44, 45]. 
 The computational model configurations for computational performance analysis built 
upon finite element meshes employed by the computationally intensive candidate application 
increase in the problem size based on the 3-noded triangular element and node count but are 
consistent in geometry and parameters applied – i.e., both mesh models have the same physical 
descriptions of resin infusion flow modeling excepting size. These meshes, ordered by increasing 
element/node, are shown below – Figure 7 is representative of both mesh configuration 
employed in the resin flow infusion modeling analysis for the candidate application.  
Unstructured Mesh Number of Nodes Number of Elements 
MA 26,936 53,178 




 The finite element models defined by meshes MA and MB are consistent in geometry 
and physical problem parameter input therefore for exhaustive purposes another, less regular, 
mesh is analyzed for performance behavior and is denoted as 10FT – shown below is the number 
of nodes/elements for this model mesh configuration. The 10FT unstructured mesh falls roughly 
into the category of medium-sized computational problem with respect to the other two meshes 
studied in this research and is composed of 3-noded triangular elements; however, it is more 
complex in structure. The 10FT unstructured mesh model is shown in Figure 8. 
Unstructured Mesh Number of Nodes Number of Elements 
10FT 29,171 58,187 
 
 Figure 9, Figure 10, and Figure 11 show plots of the non-zero element sparsity pattern of 
matrices that result from the input meshes MA, MB, and 10FT respectively.  Clearly meshes 
MA and MB are more regular than the matrix from mesh 10FT, which is though diagonally 
dominant and symmetric, has non-zero elements dispersed more evenly across the entire matrix. 
 All algorithms referenced in this chapter are presented in Appendix C of this dissertation 
for convenience and better formatting of text. 
 




Figure 8. Unstructured mesh geometric configuration of 10FT model. 
 




Figure 10. Sparse matrix non-zero entry distribution for mesh MB. 
 
Figure 11. Matrix sparsity pattern of mesh 10FT. 
2.1 Physical Description   
 The Resin Transfer Molding (RTM) process flow modeling methodology studied is 
based on the work published in [30, 43, 47] and is presented briefly next. 
16 
 
 2.1.1 Resin Mass Conservation. Following the discussions in [30, 43, 47], the resin flow 
through the fiber preform contained within the mold cavity is represented by the transient mass 
conservation equation.  The physical mass conservation equation (formed by coupling the mass 
conservation equation with the momentum equation via Darcian velocity field) is given by 
equation (2.1) with K the permeability tensor,  the resin viscosity, P the pressure field, and 
 the state variable representing the infused state of the resin – further details are available in 
[30, 43, 47]. 
















     (2.1) 
The value of the state variable   is 0, where the resin has not infused the fiber preform, and 1, 
where the resin has completely infused the fiber preform in any given region of the Eulerian 
mold continuity domain   used in the Finite Element Method (FEM) computations. 
 As discussed in [43]  the application of Galerkin weighted residual formulation and 
approximating for the pressure P  and fill factor  , with appropriate elements and associated 
shape functions, yields a semi-discrete system of equations given by equation (2.2) with C  the 
lumped mass matrix representing pore volume, K  the stiffness matrix, q  the load vector, and   
the time derivative term. 
 qPKC       (2.2) 
The transient semi-discrete equation is then solved by introducing the finite difference 
approximation given by equation (2.3). 







      (2.3) 
17 
 
 2.1.2 LCM Solution Strategy The semi-discrete equation (2.2) can be reduced to the 
equation (2.4) as discussed in [43], with C  taken to be the lumped mass matrix.  






     (2.4) 
 The above form of the discretized equation is solved by the LCM process flow simulation 
algorithm at each time-step.  Equation (2.4) defines the implicit form of the process flow 
modeling in LCM detailed in [43].  The generalized algorithm for the finite element 
approximation of the process flow modeling for each time step is summarized in Algorithm 2.1 - 
the interested reader is referred to  [30, 43, 47] for further details on the LCM process and its 
conversion to the Finite Element Method formulation for the computational simulation. 
2.2 Key Computationally Intensive Functions 
 Examination of Algorithm 2.1 reveals the most computationally expensive function of 
the candidate application – the solution to the system of linear equations given in matrix form as 
bxA   embodied on line 9 as      mimjmij gPK ˆ .  The solution to systems of linear equations 
can be accomplished via direct methods or iterative methods - each has advantages and 
disadvantages. 
 Direct solution methods such as LU-Decomposition provide a solution in a single direct 
solution step for the linear equation system without a need of initial guess solution vector but 
require high computational costs whereas iterative methods start with an initial guess that 
iteratively converges to a solution vector but have significantly lower costs [45].  The candidate 
application presented executes this solver within a nested loop so any incurred costs from a 
single call to the solver will be exacerbated by the product of  LK   with K the number of 
iterations for mass-convergence and L the number iterations for all nodes to be determined as 
filled – in addition to the iterations for the convergence of the iterative solution . The 
18 
 
Preconditioned Conjugate Gradient (PCG) iterative solver was selected as a balance of 
computational cost and accuracy for the solution of linear system of equations. 
 2.2.1 The Iterative Solver. The PCG iterative solver chosen for the candidate application 
provides good balance of computational cost and solution accuracy, composed of a set of matrix-
vector operations [44, 46].  The majority of the computational cost of the PCG iterative solver is 
well documented as the Sparse Matrix-Vector (SpMV) multiplication [17, 46, 48-51] – this 
operation is shown on lines 6 and 9 of Algorithm 2.2.  The SpMV operation has high 
computational cost due to three performance issues – poor locality, poor instruction mix, and a 
high memory overhead [45, 52].  Poor locality of SpMV results from indirect and often irregular 
memory accesses, poor instruction mix is derived through the execution of three memory loads 
for every two floating-point operations, and the high memory overhead is due to the largest 
portion of the matrix being zero and thus useless to the computation but held in memory 
regardless.  The high memory overhead of this candidate application, and indeed any application 
that builds large sets of sparse matrices, is defined as a memory-bound problem and requires data 
compression data-structures/layouts to execute [30, 44, 48].  
 There are different compression formats that are used to execute memory-bound 
problems and a discussion of these follows. 
2.3 Data-Structures/Layouts 
 One of the most common form of data compression for sparse matrix data structures used 
in engineering/scientific application codes for High Performance Computing (HPC)  is the 
Compressed Sparse Row (CSR) structure which is partly due to its ease of programming [52-
54]; however this is not the only model to follow and much research has been done to explore 
this area as the format used can have significant impact on the resulting performance [44, 49, 
19 
 
52].  The CSR data compression format consists of three arrays – non-zero data elements, 
column indices, and row pointers, and operates by iterating over all the rows of the compressed 
sparse matrix (see Figure 12).  Each row is represented as the length between each non-zero 
element held at the row pointer index currently being iterated over, which is then passed as an 
input to the column indices array resulting in the original  columnrow,  position of the element 
from the original sparse matrix – now held as an array of non-zero elements [52]. 
 Algorithm 2.3 is the expression of the sparse matrix-vector multiplication using the CSR 
data compression format [48, 52].  The utilization of the CSR compression format lowers the 
memory overhead for the sparse matrix-vector multiplication but does nothing for the other two 
components that effect performance, e.g. locality and instruction ratio - line 9 of Algorithm 2.3 
embodies an operation that is deleterious to any locality and does not improve the instruction 
ratio. 
 The Block Compressed Sparse Row with 2x2 (BCSR2x2) sub-blocks has been shown 
to improve the often disappointing performance of the CSR by increasing locality via reduced 
number of memory loads [21, 52, 55], however this improvement is not guaranteed.  BCSR2x2 
operates on sets of sub-blocks of dimension 2x2 rather than single elements, but locality is only 
improved if the original sparse matrix is composed of dense sets of sub-blocks; otherwise 
memory overhead is increased with no corresponding performance boost (see Figure 13).  
Determination of these dense sub-blocks of the sparse matrix is not known until runtime, making 




Figure 12. The CSR format. 
 
Figure 13. The BCSR2x2 format. 
2.4 Hardware Discussions of Computing Architectures Used 
 The GPU, CPU code developments from the present work for the candidate application is 
executed within the context of two separate computing machine architectures generally 
categorized as CPU-based, GPU-based, and/or some combination of the two.  Biomedical 
Analysis and Simulation Supercomputer (BASS) located at Chapel Hill, North Carolina is 
categorized as System A for the remainder of this dissertation, and system called as OAKLEY 
located at Ohio Supercomputing Center in Columbus, Ohio is categorized as System B for the 
remainder of this dissertation.  The System A CPU/GPU computing system is composed of 
AMD Dual-Core Opteron CPUs and Nvidia Quadro FX 5600 GPUs each with a 2.8 GHz and 1.5 
GHz clock frequency respectively. The System B CPU/GPU computing system is composed of 
21 
 
Intel Xeon X5650 6-Core CPUs and Nvidia Tesla M2070 GPUs each with a 2.66 GHz and 1.15 
GHz clock frequency respectively. The details of the hardware designs for both computing 
systems are discussed next – categorized separately by CPU and GPU. 
 2.4.1 CPU Hardware Architectures The CPU hardware designs for System A and 
System B follow the same concepts, but System B is more advanced (see Table 1 and Table 2). 
These architectures both have multiple levels of cache memory, 2 levels for System A and 3 for 
System B, and different capacities [56, 57].  System B has greater cache memory capacity at all 
levels, and while both computing systems have the same memory I/O width, System B has faster 
memory devices – double that of the memory devices for System A [56, 57]. 
 Both System A and System B have embraced the concept of multiple core architectures, 
but as with the memory structures System B is the more advanced of the two.  A processing core 
can be viewed as a separate CPU executing in the same address space enabling the computing 
architecture to theoretically increase computational ability [9, 13], each core operates at a global 
clock frequency for the device – System B executes 6 processing cores whereas System A has 2 
[56, 57].  System A executes these cores as a higher clock frequency but System B has three 
times the number of cores – a faster design for the hardware level. 
Table 1  
System A – CPU hardware architecture 
Processor Clock 2.8 GHz 
Processing Cores 2 
L1 Cache 256 KBs 
L2 Cache 2 MBs 




Table 2  
System B – CPU hardware architecture 
Processor Clock 2.66 GHz 
Processing Cores 6 
L1 Cache 384 KBs 
L2 Cache 1.50 MBs 
L3 Cache 12 MBs 
Memory I/O 64-bit DDR3 SDRAM 
 
 2.4.2 GPU Hardware Architectures. The GPU hardware designs for System A and 
System B follow the same concepts, but System B is more advanced (see Table 3 and Table 4).  
These architectures both have processing cores in the hundreds to accommodate the massive 
computational power involved in optimizing throughput, however System B has more than three 
times the number offered by System A [58, 59].  System A has a faster clock frequency for each 
processing core but has a significantly lower number of these cores so the aggregate power of 
System B is greater regardless [58-61].  Each of these cores can be viewed as a separate 
processor, but unlike the CPU, the GPU will execute the same instruction for all input until a 
reconfiguration occurs for the next large set of input  [31, 35] hence less emphasis on memory 
management via cache hierarchies. 
 Both System A and System B have wide memory I/O at 384-bits to sustain the high 
number of processes required to maximize throughput as a latency mitigation strategy prescribed 
by the DSB paradigm which the GPU follows. The memory device employed by System B uses 
the same input width but is faster than that of System A [58, 59, 61]. 
 System B is part of the CUDA Compute Architecture 2.0, a metric used by Nvidia to 
categorize the device versions at both the software and hardware levels, whereas System A is 1.0 
– a significant difference [37, 62].  The higher Compute Architecture of System B allows more 
23 
 
options, e.g. Nvidia set of libraries, and System B was the first GPU device to fully embrace 
general programming on the GPU (GPGPU) as it contained no video outlet, had double-precision 
abilities, and provided Error-Correcting Code (ECC) [25, 37, 62].  The GPU device architecture 
for System B is superior to that of System A. 
Table 3  
System A – GPU hardware architecture  
Processor Clock 1.35 GHz 
Processing Cores 128 
Memory I/O 384-bit GDDR3 
Register Count 8,192 
Shared Memory Banks 16 
CUDA Compute Architecture 1.0 
 
Table 4  
System B – GPU hardware architecture 
Processor Clock 1.15 GHz 
Processing Cores 448 
Memory I/O 384-bit GDDR5 
Register Count 32,768 
Shared Memory Banks 32 
CUDA Compute Architecture 2.0 
 
 2.4.3 Hardware Design Summary The hardware designs for both systems presented 
reflects the general concept of the DSB and ISB paradigm for the CPU and GPU respectively 
resulting in different latency mitigation policies.  Both System A and System B utilize cache 
memory hierarchies to mitigate latency for the CPU designs and both have processing core 
counts in the hundreds to optimize throughput for the GPU designs, as per DSB paradigm.  The 
memory I/O bus width is higher for the GPU hardware than for the CPU hardware, while they 
are equal in size to one another – System B has a faster memory device in both cases.  System B 
24 
 
is a more advanced hardware design for both the CPU and GPU devices in context of processing 
and memory.  The next chapter will discuss the computational potential offered by the GPU via 
performance of the highest cost operation of the candidate application, as well as many HPC 





Computational Potential of GPU – Sparse Matrix-Vector Multiplication 
 This chapter focuses on the computational potential offered by the GPU for 
computationally intensive applications such as the candidate application presented.  The 
candidate application has highest computational cost at the point of the solution to system of 
linear equations that are presented in matrix form as bxA  . This sparse matrix system is solved 
iteratively using PCG in deference to the computationally prohibitive costs of using direct solver 
methodology – a common practice in HPC modeling applications [20, 30, 48, 49, 63]. 
 The PCG solver is composed of matrix operations that should map well to the GPU 
device given it was created to execute massive numbers of matrix operations in tandem [1, 2, 7, 
24].  The Spare Matrix Vector Multiplication (SpMV) operation embodies the highest 
computational cost of the PCG solver, up to 90% of the total cost [25, 29, 48, 49, 63] - the 
minimization of this operational cost will provide a significant performance boost to the 
presented candidate application as well as many other HPC applications.  However, mapping the 
SpMV operation to the GPU involves software factors, including software API CUDA, that are 
intimately related to underlying hardware architectures provided during the computational 
modeling application execution. 
3.1 Mapping Sparse Matrix-Vector Multiplication to GPU 
 The SpMV operation is the highest cost component of the presented candidate 
application, as well as  many HPC applications that employ systems of sparse matrices and the 
potential boost provided by utilizing the GPU as a co-processor has generated a lot of interest 
[21, 48, 49, 53, 54, 64]. The SpMV derives its high operational cost not from floating-point 
26 
 
operations, as its instruction mix is poor [48]; rather the inefficient memory accesses are 
culpable. Properly mapping such a high cost and computationally weak operation to a 
computationally powerful device like the GPU for optimal performance involves an 
understanding of the software as well as the underlying hardware.  The CUDA API is the 
software API employed by the presented candidate application and is representative of this 
relationship. This is discussed in the context of the actual mapping of the SpMV algorithm to the 
CPU/GPU computing system next.  
 3.1.1 The CUDA Software/Architecture API. The Nvidia CUDA Software API for 
general GPU programming is both a software and hardware system that uses higher level C 
language extensions to call lower-level OpenGL/DirectX libraries – accessing the GPU device 
[25, 37, 62].  CUDA maintains the C language concept of threads [62], but unlike the C 
language CUDA exposes the memory hierarchy  to the developer [25, 37] which is a necessity as 
the GPU has no real virtual memory system and remains as flat as possible for the optimization 
of throughput. Despite the added complexity of an exposed memory hierarchy, CUDA freed 
developers from the necessity of translating general code to and from data-structures that the 
GPU understood i.e. column-major matrix operations, complete with model-view and matrix-
stacks so necessary for lower-level graphics programming [31, 33-35].  
 CUDA GPU threads, memory, basic API syntax, and associated device hardware 
architecture are discussed in the following sub-sections. 
 3.1.1.1 CUDA API thread hierarchy The GPU thread as defined by the CUDA 
architecture is different from the Light Weight Process (LWP) familiar to Operating System 
design [9, 37, 62]. The GPU has zero-cost context switching of threads because this is executed 
at the hardware level, commonly known as hardware multi-threading, whereas the Operating 
27 
 
System paradigm of threads is one controlled by software either in user or kernel space naturally 
incurring overhead. Figure 14 from Nvidia [62] illustrates the hierarchical structure of threads. 
 
Figure 14. Diagram of CUDA GPU thread grouping.  
 The CUDA paradigm provides a direct mapping from logical constructs, such as the 
thread, to hardware architecture designs as shown in Figure 15.  The large number of transistors 
dedicated to floating-point operations is grouped together as sets of Streaming Processors 
(SPs), defining the thread construct [62].  These SPs are grouped in sets of 8 to form sets of 
Streaming Multiprocessors (SMPs) – the domain of the Block [62].  The largest logical 
construct, the Grid, is embodied by the groupings of SMPs [62] as shown in Figure 16. 
 




Figure 16. CUDA hardware with logical Grid overlay. 
 CUDA executes operations via 24-stage graphics pipelines each fully completing in 4 
memory clock cycles using a single SMP defining the logical unit of execution as batches of 32 
threads called a warp - 3284   [37].  The CUDA paradigm increases parallel granularity, 
naturally extending from the single thread to the Block which is composed of sets of threads, to 
the Grid which is composed of sets of Blocks [37]. 
 All threads in a Block are assigned to a single SMP, abstracted from the programmer, 
although multiple Blocks can be assigned to a single SMP [65]. The abstraction of the 
Block/thread/SMP construct is the dominate strategy to produce scalability of CUDA to different 
generations of GPU devices – the programmer need not know the exact number of SMPs to 
develop GPU-bound code as the hardware will schedule as needed [25, 65]. 
 3.1.1.2 CUDA API memory hierarchy CUDA defines varying layers of memory that 
reflect both the underlying hardware architecture and logical threads [62]. Figure 17 from 
Nvidia, shows the overview of logical memory constructs to the underlying physical hardware – 
29 
 
clearly the GPU is not equivalent to the CPU in memory complexity but does provide some level 
of layering [1, 2, 62]. 
 
Figure 17. CUDA GPU mapping of the physical and logical memory structure. 
 The lowest level of memory in the defined hierarchy is the register file, composed of the 
set of registers for the SMP – each thread has mutually exclusive access to an on-chip register 
and local memory in read/write mode [2].  As with CPU hardware designs, the register is the 
fastest of the two on-chip structures with local memory costing approximately 20 to 50 clock 
cycles [1, 2, 62, 66].  The next highest level of memory for the CUDA design is the shared 
memory structure.  Shared memory is bound to a given Block and each thread has read/write 
access implying the need for synchronization to avoid race conditions [1, 2, 9, 62, 66] – the next 
set of memory levels are visible to all Blocks defined in the application.  
 Constant and texture memories are both read-only with regards to the threads in any 
given Block but texture can yield some level of locality as the CPU has write access to this 
30 
 
structure [2, 24].  Global memory, sometimes called device memory, has the highest capacity 
and clock cycle cost running as high as 600 to 800 cycles per call [1, 2, 62, 66] – global memory 
is the only area where the CPU and GPU can communicate using the Peripheral Component 
Interconnect Express (PCIe) bus.  The PCIe bus is a well-known point of bottleneck in many 
CPU/GPU computing HPC applications employing sparse matrix systems [48, 53, 64, 67, 68]. 
 3.1.1.3 CUDA API basic syntax The CUDA API is an extension of the C language 
invoking its own GCC –like compiler, e.g. NVCC, to compile high-level GPU Kernels to PTX 
machine independent code which is executed at runtime [38, 62] – CUDA recognizes GPU-
bound code via keywords.  These keywords are prepended to the C-like function signatures  [25, 
62] defined as the GPU-bound Kernel and prior to calling must have memory set aside for the 
execution of each on the GPU device – as a  ThreadsBlocks,  structure.  Figure 18 
illustrates an example usage of the GPU Kernel declaration and memory space allocation for a 
generic function. 
 
Figure 18. Example defining CUDA Kernel function signature and execution space. 
 3.1.2 Algorithmic Strategies for SpMV Mapping. Mapping the SpMV operation to the 
GPU via the CUDA Software API presents the immediate challenge of how to distribute the 
matrix to the set of warps to be executed.  A straightforward approach would apply a single 
thread per row, chunking the domain into sets of 32 – this is not the best approach as 
documented in [69].  As with  [69],  the SpMV operation is mapped to the GPU device using the 
31 
 
one warp per row concept to obtain better utilization of the device resources – this is discussed in 
detail after the initial performance results. 
 Details of the code used in this chapter to gather the performance results are presented in 
Appendix A and includes both the CSR and BCRS2x2 data compression formats. 
3.2 GPU Initial SpMV Performance Results 
 The results of SpMV on the CPU/GPU for both machine system architectures, System A 
and System B, using the CSR data format are presented in this section with the goal of exposing 
the performance effects of software, hardware, and algorithmic factors using a consistent model 
in differing computing environments.  These results are gathered using randomly filled sparse 
matrices with 50% sparseness, increasing in total matrix sizes from 1K to 4K.   
 Critical to understanding the observed results is the establishment of metrics to define 
performance benchmarking. Computational performance benchmarking for the GPU and CPU 
developments and resin flow infusion modeling for the remainder of this dissertation was 
accomplished as follows. 
 Normalized FLOPS:  The raw count of floating-point operations is modified by clock 
frequency of the device being measured to address the variance in processor speeds for 
the GPU and CPU architectures. Equation (3.1) illustrates the normalization process 
described, denoted as normFLOPS , with cntFLOP the raw count of floating-point 





       (3.1) 
 KFLOPS:  The approximate thousands of floating-point operations per second. As the 
GPU and CPU architecture vary in their processor speed, the KFLOPS is normalized by 
the clock cycle of the device being measured. 
32 
 
 Speedup factor:  The ratio of CPU execution time to GPU execution time whereby the 
larger the value, the more optimal the performance obtained through the GPU. 
System A is the first architecture examined followed by System B. 
 3.2.1 System A. The CPU/GPU computing system execution of the SpMV operation was 
compared against the CPU-only version.  The CPU-only environment is much slower than that 
of the CPU/GPU environment in every case as shown in Table 5.  The CPU/GPU computing 
system increases in performance at an almost exponential rate, accelerating at the 2K matrix – 
this is consistent with previous findings of GPU performance on larger input models [1, 21, 22, 
70]. 
Table 5  
Time comparison for SpMV on System A (CSR compression) 
Matrix Rows CPU Time (ms) GPU Time (ms)  Speedup Factor 
1024 9.479572 0.167552 56.5769 
2048 35.138212 0.069408 506.2559 
4096 148.453443 0.06912 2147.764 
 
 3.2.2 System B. The CPU/GPU computing system execution of the SpMV operation was 
compared against the CPU-only version. The CPU-only environment is much slower than that of 
the CPU/GPU environment in every case as shown in Table 6.  The CPU/GPU computing 
system increases in performance at an almost exponential rate, accelerating at the 2K matrix – 
this is consistent with previous findings of GPU performance on larger input models [1, 21, 22, 
70]. 
Table 6  
Time comparison for SpMV on System B (CSR compression) 
Matrix Rows CPU Time(ms) GPU Time(ms) Speedup Factor 





2048 8.417524 0.055424 151.8751 
4096 30.3342 0.047872 633.6522 
 
3.3 Software Data-Structures/Layouts Factors 
 The previous section of this chapter establishes an obvious benefit in performance when 
using the CPU/GPU computing system over CPU-only in both System A and System B 
computing environments. However it is necessary to understand how different software data-
structures/layouts can affect performance in CPU/GPU systems in order to optimize for 
computationally intensive applications.  The first software factor to be analyzed is one that is 
commonly touted in the GPGPU community – thread occupancy [37, 49, 71, 72].  
 The thread occupancy of a CUDA enabled GPU device is defined as a ratio of active 
warps to the maximum number of warps supported by the Compute Architecture [37] – System 
A which is Compute Architecture 1.0, and System B which is Compute Architecture 2.0 support 
24 and 48 warps per SMP respectively [58-61].  The importance of thread occupancy can mean 
the difference of as much as 20-times performance boost [37, 73] .  However, arbitrarily 
assigning the largest number of warps per block possible is the wrong approach.   
 There exists a finite set of registers that are allocated for each of the thread blocks, and if 
each block requires many registers as defined by threads, the aggregate number of active blocks 
possible is reduced and correspondingly the occupancy is reduced and performance suffers [25, 
37]. For example, System A defines 192,8 32-bit registers for each SMP and can execute at most 
768 threads meaning that at most   10...6666.10
768
8192
 registers can be used per thread to 
achieve 100% occupancy.  The negative effects on performance can be further compounded by 
34 
 
register spilling to device memory, increasing memory cycle counts hundreds of times [37, 62].  
Both CPU/GPU computing systems architectures were determined to obtain maximum thread 
occupancy at 256 threads per block, a multiple of the warp size – providing the optimal access to 
local registers and avoiding costly code spills, allowing the hardware to properly coalesce 
memory addresses [62, 69]. 
 Another software factor that can affect performance of a CPU/GPU computing system is 
the data compression format used – understanding this is vital to optimizing memory-bound 
applications such as the presented candidate application [2, 52, 74, 75].  As noted in the 
introduction, the SpMV lends itself to several performance challenges – key to the data 
compression format is locality.  The CSR data compression format has poor locality due to 
frequent address indirections and BCSR2x2 can mitigate this by lowering the number of memory 
loads per floating-point operation – simply by maintaining a 2x2 sub-block set rather than a 
single element [2, 52, 74, 75].  However, the benefits of using BCSR2x2 rely heavily on the 
existence of dense 2x2 sub-blocks in the original sparse matrix. 
 The software factors of thread occupancy and data compression format were combined 
and executed on a randomly generated 4K sparse matrix defined with a 50% sparseness for both 
System A and System B. Table 7 and Table 8 shows the performance of these software factors 
for System A and System B respectively - Figure 19 and Figure 20 illustrate graphically the 
same results. Both System A and System B display increased performance as the number of 
threads per block grows evidence of better utilization of GPU computational resources. 
However, dramatic performance increases generated by growing thread occupancy occur at 32 
for System A and 64 for System B shown by Figure 19 and Figure 20 respectively – due to 
clock cycle execution which is explained in greater depth in section 3.5 of this chapter. The 
35 
 
effect of changing data compression format from CSR to BCSR2x2 is greater for System A (see 
Figure 19) than for System B (see Figure 20) as the later defines an on-chip cache relegating 
locality mitigation to a lower impact factor for performance. These results clearly illustrate that 
software factors can have an impact on the CPU/GPU computing system’s performance – the 
associated hardware factors are discussed next. 
Table 7 
 SpMV time comparison per thread occupancy and data format (System A) 
Threads Per Block GPU Time (ms) - CSR GPU Time (ms) - BCSR2x2 
16 43.0042 15.7161 
32 0.1456 0.122944 
64 0.068736 0.125344 
128 0.080096 0.1272 
256 0.06912 0.179392 
 
 




Table 8  
SpMV time comparison per thread occupancy and data format (System B) 
Threads Per Block GPU Time (ms) - CSR GPU Time (ms) - BCSR2x2 
16 18.6252 15.6434 
32 32.1774 29.8709 
64 0.057152 0.06192 
128 0.06928 0.063936 
256 0.047872 0.062208 
 
 
Figure 20. System B performance as threads per block increase (CSR and BCSR2x2). 
3.4 Architectural Hardware Factors 
 The CPU/GPU computing systems execute within differing environmental context for 
System A and System B and are important components in the resulting performance of SpMV.  
System B has a more advanced CPU and GPU architecture than System A, a 6-core CPU and 
Fermi GPU design versus a 2-Core CPU and Quadro GPU design for System B and System A 
respectively [56-61, 76].  In and of itself, this difference is irrelevant however when comparing 
System A to System B as these architectural hardware variations must be factored into the result. 
37 
 
 The architectural hardware design of System B defines a GPU device that provides extra 
hardware for context switching compared to the corresponding GPU device of System A.  The 
increased switching hardware of the GPU device of System B is the manifestation of the double-
pumped graphics pipeline described by the Fermi architecture [61]. Important to the sheer 
computational abilities are the number of processing cores of System B with 448 as compared to 
System A at 128 [61, 76] – the GPU device on System B has greater than 3-times the power of 
System A by this metric. 
 The architectural hardware design of System B defines a more advance memory structure 
than the corresponding structure of System A and this is reflected consistently at every memory 
device [58, 60, 61].  System B and System A both have 384-bit wide memory I/O but System B 
has the faster GDDR5 memory versus System A with GDDR3 memory.  The GDDR5 memory 
of the GPU device of System B operates at twice that of GDDR3 – therefore throughput of data 
will be maximal for System B [58, 59].  The hardware design of the GPU device for System B 
has 4-times the number of registers than the corresponding GPU device on System A at 32,768 
to 8,192 for System B and System A respectively [58, 59] – these extra registers will provide 
more capacity for threads of a given warp as a register is thread-bound in nature [25, 66, 69].  
Finally, the shared memory of the GPU device of System B is 3-times greater than that of 
System A at 49,152 to 16,384 bytes for System B and System A respectively – providing larger 
cache-like structure for System B [61, 76].   
 These hardware, software, and algorithmic factors when analyzed individually are 
important but it is within the context of the aggregate that the real importance is revealed.  This 
interdependence of factors is discussed next. 
38 
 
3.5 Interdependence of Software and Hardware Factors 
 The previous sections of this chapter have established the importance of software, 
hardware, and algorithmic factors on resulting performance; however these factors are defined as 
interdependent.  These factors work both independently and with one another to produce the 
observed performance results for SpMV in this chapter. 
 Executing one warp per row rather than one thread per row, the dominant algorithmic 
factor with regards to mapping the SpMV operation to the CPU/GPU computing system provides 
a fuller utilization of the GPU device [37, 48].  The fuller utilization of the GPU is directly 
impacted by the hardware as the entire warp is now performing useful work and memory 
addresses are likely coalesced [37, 48].  The algorithmic factor is also impacted by the software 
factor of increasing thread occupancy during SpMV as memory is set aside in units likely to 
increase the use of more threads per warp.   
 The software factor described by thread occupancy and data compression formats effect 
performance by increasing memory address coalescing and increasing locality – essentially 
altering the number of memory loads to the corresponding floating-point operations for SpMV.  
However, software factors are tied to the hardware with thread occupancy in the same way that 
the one warp per row is affected, and the impact of the data compression format change was less 
pronounced for System B than for System A. 
 The hardware factors of both computing systems effects the performance of SpMV in 
two major ways – overall execution speed and impact of data locality.  The GPU device of 
System B uses a double-pumped logical graphics pipeline that is expressed in hardware as extra 
context switch chip; so twice the data input per clock cycle is massaged by the larger number of 
processing cores of System B over that of System A. This performance difference is seen from 
39 
 
the data plots defined as Figure 19 and Figure 20 where the later executes at approximately half 
the time as the former.  
 As stated previously in section 3.3 of this chapter, the mitigation of locality issues 
derived via the employment of BCSR2x2 data compression has a lower effect on performance 
for System B than for System A due to the presence of a hardware-level cache on System B that 
does not exist on System A [59-61]. An interesting artifact of the hardware, software, and 
algorithmic interplay can be seen when increasing the number of threads per block for System A 
and System B as shown in Figure 19 and Figure 20 respectively.  The general performance in 
both cases is similar but shifted to the right for System B, e.g. address coalescing appears 
markedly improved at 32 threads per row for System A and 64 for System B – this differential is 
likely an artifact of memory device I/O. 
 CUDA defines the unit of execution to be a warp which is a collection of 32 threads 
working simultaneously - coalescing memory addresses within this grouping [37, 62].  The 
memory device I/O employed by System B can execute twice for a single clock cycle [59-61]  
e.g. 32-bits per every 2-cycles means 16-bits for a single-clock cycle - each floating-point 
operation requires at least 32-bits as per the data-type; so the 32-bits metric can be extended as 
32-threads. System A employees a memory device I/O with single clock cycle execution, 
effectively creating a 16-to-32 comparison, thus System B will coalesce at double that of System 
A, i.e. 64 threads versus 32 threads.  Related to the concept of memory address coalescing is 
shared memory banks. 
 Shared memory is a software managed cache-like structure, heavily banked to align with 
the Single Instruction Multiple Data (SIMD) lane width of the processing core – as with 
address coalescing, proximity of these banks to thread accesses is important [25, 37].  These 
40 
 
banks, sometimes called segments, execute optimally with address interleaving such that given 
float pointer fp in bank B  and % being defined as the modulus operator, 1fp  points to the 
address at  and   16%1B  and   32%1B for System A and System B respectively - each bank 
holding a 4-byte access per cycle [37, 77].  Critical to performance using shared memory is the 
avoidance of bank conflicts which can present any time data access is not sequential.  A bank 
conflict occurs when more than one memory access is made to the same bank in the same clock 
cycle – successive 32-bit words are shared among 16 banks for System A and 32 banks for 
System B [54, 61, 62, 69, 74, 77].  CUDA handles a bank conflict by serializing each of the 
contending threads, for example: given N memory accesses and N unique shared memory banks 
bandwidth is increased by a factor of N with no conflicts but is decreased by 
K
1
 for each 
K thread that requires serialization [37, 62, 77].  Figure 21 illustrates a bank conflict on a generic 
CUDA GPU device. 
 
Figure 21. Block diagram of memory bank conflict for generic CUDA device 
41 
 
 Regardless of the individual factors discussed, both CPU/GPU computing systems 
analyzed expose performance boosts for SpMV – as shown in Figure 22. 
 
Figure 22. System A and System B Speedup factors for execution of SpMV operation 
 This chapter has illustrated that execution of SpMV in the CPU/GPU computing system, 
the highest cost of the PCG iterative solver, displays impressive improvement over the CPU-only 
version.  The factors of software, hardware, and algorithm have demonstrated inter-dependency 
in regards to the resulting performance of SpMV – how these factors affect the full candidate 




Full Candidate Application – Single CPU/GPU Computing System 
 This chapter focuses on the full solution to the candidate composite process flow 
modeling application within the context of a single CPU/GPU computing system for both 
System A and System B.  The full solution of the computationally intensive candidate 
application is mapped to the CPU/GPU computing system and validated against an analytical 
solution for all computing systems involved.  The validated application is then executed using 
System A and System B and the resulting performance is analyzed to determine how the 
hardware and software factors work together to impact the resulting application computational 
performance.  During the mapping, key computationally intensive kernels are presented and 
associated GPU developments explored. 
 This chapter will ascertain how the hardware architectures of System A and System B 
work together with the software factors to denote the application performance – key in this 
discussion is the calculation of a computational complexity analysis.  The computational 
complexity analysis is actualized as a performance modeling equation that can be used to project 
how different problem, software, and hardware parameters will affect performance.  
Understanding these variations in factors/parameters is essential as new computing architectures 
arrive to get optimal performance for HPC applications in many legacy and new computational 
modeling codes / code developments. 
4.1 Mapping Full Candidate Application to GPU   
 The mapping of the full candidate application to the single CPU/GPU environment is a 
natural extension from the previous chapter, detailing the mapping of the SpMV operation, in the 
single CPU/GPU environment for both System A and System B.  Both of these mappings are 
43 
 
done within the single shared address space CPU/GPU computing architectures and the SpMV is 
the largest component of the PCG iterative solver [48-50] and hence representative of the full 
solution itself.  The presented candidate application employees the Concurrent Number 
Cruncher (CNC) GPU solvers package by Luc Buatois, et al [50] which uses a custom SpMV 
operation with CSR and BCSR2x2 data compression formats as well as Nvidia’s CUBLAS 
library [78] for SAXPY and DOT-PRODUCT calls [50] (see Figure 23).  Nvidia has recently 
released a library for sparse matrix operations known as CUSPARSE with restrictions to CUDA 
Compute Architecture of at least Version 1.1 [79].  System A falls into the CUDA Compute 
Architecture of 1.0, therefore in an effort to apply consistency across CPU/GPU computing 
systems the CUSPARSE library was not used in this study. 
 
Figure 23. The CNC operational flow from input to output. 
 The key computationally intensive Kernels presented in the CNC software as well as the 
process of mapping the presented candidate application to function in a CPU/GPU computing 
system are discussed next. 
 4.1.1 Key Computationally Intensive Kernels. The most computationally intensive 
kernels of the full candidate solution are those that relate to the iterative PCG solver, as it is 
44 
 
called  LK   times with K the number of iterations for mass-convergence and L the number 
iterations for all nodes to be determined as filled (see Algorithm 2.1), as well as the number of 
CG iterations for each solution call to the linear equation solver.  The SpMV is the largest cost of 
the PCG iterative solver and is ported to the local GPU device, as with the SAXPY and DOT-
PRODUCT kernels, and called within the nested loop described in Algorithm 2.1, but the 
consideration of GPU code developments, software, and hardware factors is critical for optimal 
performance results. 
 4.1.2 GPU Code Developments. The PCG iterative solver defined in the candidate 
application is mapped and ported to GPU via the CNC code as shown in Figure 24.  The CNC 
code is third-party software created using the C/C++ programming language that embeds the 
CUDA Kernels within its design – the library header for the CNC package is simply placed as 
part of the preprocessor “include” calls and compiled with the proper CUDA library links using 
the NVCC compiler and an executable is created with separate GPU-bound code to leverage the 
local GPU device [37, 38, 50].  The CNC package solves the system of linear equations of the 
matrix form bxA   receiving input matrix data from the calling C/C++ class file(s), executing 
the GPU device code, and retrieving the resulting solution vector for the CPU-bound code 
portion of the application [50].  The CNC package defines the PCG iterative solver, as with all 
PCG solvers [46], using multiple calls to the SpMV, SAXPY, and DOT-PRODUCT operations – 
the SAXPY and DOT-PRODUCT portions of the solver are examined next. 
 4.1.2.1 CUDA Kernels of PCG solver The SAXPY and DOT-PRODUCT operations are 
defined in the Nvidia CUBLAS library [78] which is the CUDA version of the Basic Linear 
Algebra Subprograms (BLAS).  BLAS is categorized by levels 1, 2, and 3 with level-1 consisting 
of vector operations, level-2 consisting of matrix-vector operations, and level-3 consisting of 
45 
 
matrix-matrix operations [78, 80, 81].  CUBLAS follows the BLAS paradigm, defining the 
SAXPY and DOT-PRODUCT as level-1 category operations [78]. 
 
Figure 24. Presented candidate application diagramming the placement of CNC Package. 
 There are several steps that must be followed for any CPU/GPU computing system to use 
a CUDA library package which can be generalized as initialization, allocation, execution and 
finally retrieval of results [37, 62]. Initialization is the first step and is potentially heavy due in 
no small part to the sheer size of the library itself. The CUBLAS library is quite large, 
containing the Shader Assembly (SASS) and PTX code for every Kernel defined in the library 
with PTX as much as 75% of the library size [37, 38, 62].  The SASS is the binary version of the 
library and PTX the intermediate to allow for differing GPU device generations and is loaded 
using a Just In Time (JIT) compiler construct – the driver has to locate and read the SASS 
binary for the particular GPU device and load it to the machine’s board.  Once the library is 
loaded, the CPU issues a command for the GPU device to allocate memory for the number of 
elements and the data-type that is to be used by the Kernel. The data is then passed to the GPU 
from the CPU via pointers as direct contact between CPU and GPU is not possible [24, 37, 62]. 
The kernel is executed from input data and passed back to the CPU host for use by the system – 
46 
 
the GPU memory is not explicitly removed until another call to the allocation is made, the GPU 
is a state-machine by design [8, 31, 34, 35]. 
 The CUBLAS scalar DOT-PRODUCT function executes a single memory access for 
every arithmetic operation yielding an upper bound on performance that becomes the ratio of 
CPU to GPU memory sizes, typically ranging around 5 to 10 [37, 78, 82].  The cost of access to 
the data residing in CPU space erodes any performance boost. This is why the full 
implementation of the candidate application keeps data located in GPU space [50] – instead of 
multiple global accesses during the iterative phases of the solver, single calls are made at the 
beginning and end of the solver [50].  The computational cost of the scalar DOT-PRODUCT is 
negligible relative to the SpMV but its role in the PCG algorithm is invaluable as these 
computations are used to define the solution convergence of the linear equation system [46, 65]. 
 The scalar DOT-PRODUCT is primary in the computation of residuals [46], applied in 
Algorithm 2.2 at lines 11 and 12.  The closer the residuals are to zero, the more likely a solution 
has been located – however this assumption could become tenuous if the device employed is not 
fully compliant with IEEE-754 floating-point representation standards [83].  The potential for 
non-compliant floating-point operations using GPU devices is a valid concern [84] as GPU 
manufactures have never openly held to this compliance and this was never a concern [83, 85, 
86]. The typical high-res games redraw frames at up to 60-times a second [2, 24] so any visual 
artifacts produced by a slightly-off floating-point value would not be noticed even by the most 
perceptive human player. Once the GPU moved into the exacting analytics required of the 
engineering/scientific community, the formerly lax adherence to numerical accuracy standards 
needed to be tightened.  The algebraic representation of the scalar DOT-PRODUCT operation is 
shown in equation (4.1) with A and B  vectors each of length n . 
47 
 








2211     (4.1)  
 Nvidia has long maintained that its GPU devices held fast to the IEEE standards but 
admitted that the more accurate double-precision representation was not supported [86], at least 
until the advent of the Tesla Fermi architectural designs [60, 61].  These architectures are not 
only double-precision compliant they contain Error Correcting Code (ECC) adaptations to avoid 
propagation of small numerical inaccuracies [59].  
 Perhaps counter-intuitively, the real instantiation of floating-point inaccuracies involving 
the scalar DOT-PRODUCT is the addition portions of the operation rather than the 
multiplication [13, 45, 87].  The reason is the potential for alignment error when normalizing the 
result to be 127-biased [13, 87]. Figure 25 shows an example conversion from a base-10 
floating-point number to the corresponding base-2 representation with applied 127-bias. 
 
Figure 25. IEEE conversion from base-10 to base-2 with normalization. 
 Unfortunately, even full compliance of IEEE standards for floating-point representations 
is no guarantee of full accuracy in every case [45, 87]. 
 Paraphrasing work by Anthony M. Castaldo [87], given two numerals with exponents 
21,ee  such that 21 ee  the number of bits to exceed the scale of the returned value is 21 ee   this 
48 
 
is the number of zeros that the smaller numeral will be padded to on the left to properly align.  
Therefore, if the value of smaller operant is just one greater than the bits of the mantissa it adds 
nothing to the resulting addition.  Supposing the 32-bit format, if the aforementioned exponents 
differ by 24 or more, the answer will be the larger of the two – the other operant is completely 
ignored in the result.  The operation still ascribes to the IEEE-754 standard but is completely 
erroneous with regards to small numerical variations in the long term; particularly sensitive to 
these perturbations are large clustered machines where the error can magnify as it progresses 
through the system, known as a soft error [84] – hence contributing to the importance of ECC 
adaptations. 
 The CUBLAS SAXPY operation executes a scalar multiplication and addition with 
vectors as shown in equation (4.2) with y and x vectors and  the scalar.  The CUBLAS version 
of this level-1 BLAS operation is generalized to allow for incremental steps in both x  and y  
directions, i.e. an array stride [78, 88].  These extra levels of indexing could cause a slight drop 
in performance but is a trade-off to the generality provided by the library.  The SpMV operation 
is the same used in the previous chapter – complete with the same determination of optimal 
thread occupancy. 
     yxy       (4.2) 
 The full candidate application is validated against an analytically derived solution for a 
simple 2D unstructured mesh model consisting of radial center injection in a thin circular plate 
mold geometry using the single CPU/GPU computing system next. 
4.2 Validation of Full Candidate Application on Single CPU/GPU   
 The correctness of the full candidate application for the single CPU/GPU computing 
systems is ensured via the examination of flow-front progression and injection port pressures of 
49 
 
numerical solutions for CPU and GPU against the correspondingly analytical results.  The model 
used for validation is simple 2D circular plate mold geometry with a center, radial injection and 
is compared with the resulting analytical equation. 
 The simple circular plate model has a radius of 10 cm and an inner radius of 0.15 cm for 
the radial injection port as shown in Figure 26.  The inner radius, 0R , is subjected to a constant 
radial flow rate Q .  The thickness of the cavity is H , the injection inlet pressure is P, and is a 
function of transient resin infusion time, resin viscosity is  , the permeability of the fiber 
preform is K , and the porosity of fiber compaction is .  The flow front radial location at any 
time t  is given by [89]: 















      (4.3) 
The corresponding expression for injection pressure, which varies with time, is given by [89]: 




























     (4.4) 
 
Figure 26.  2D circular plate validation model (not to scale). 






Q  , permeability
2080.44 cmeK  , a viscosity PaS02.0 , a porosity of 805.0 , 
a time step 5.0t , and an element thickness cmH 742.0 .  The circular plate model involved 
a computational mesh of 1,344 nodes and 2,560 3-noded triangular elements.  Figure 27 and 
Figure 29 display the flow-front progression showing the computed and analytical variation of 
the radial flow front location with respect to time for System A and System B respectively and 
clearly define accuracy with the analytical solution.  Figure 28 and Figure 30 display the 
corresponding transient inlet injection pressure for System A and System B respectively and 
clearly define accuracy of the computational solution. 
 The flow-front and inlet injection pressure values are accurate for this circular plate radial 
injection geometry. Other complex flow modeling geometries also showed equivalent 
comparison of the flow front progression and the predicted fill time between CPU/GPU based 
computational solutions for the same geometry and problem parameters employed. 
Computational performance modeling performance of the full candidate solution for the single 
CPU/GPU computing systems is presented next. The initial performance evaluations were 




Figure 27. Validation of single CPU/GPU for flow-front progression (System A). 
 




Figure 29. Validation of single CPU/GPU for flow-front progression (System B). 
 
Figure 30. Validation of single CPU/GPU for inlet injection pressure (System B). 
4.3 Initial Full Candidate Application Performance on Single CPU/GPU   
 The results of full candidate application on the single CPU/GPU for both computing 
system machines, System A and System B, with unstructured meshes MA, MB and 10FT as 
53 
 
input model data and the CSR data compression format are presented in this section with the goal 
of exposing the performance effects of software, hardware, and algorithmic factors using a 
consistent model in differing computing environments.  Clearly, the most salient metric of 
performance measures for scientific and engineering applications where analysis time is 
concerned is the direct measure of computational solution time.  However, due to the inherent 
processor differences in terms of clock speeds, benchmarking different hardware is difficult to 
quantify.  Total computational time employed also depends on the cost of arithmetic operations 
in different architectures that can vary, and detailed information in commodity processors used in 
this work are proprietary [66, 72, 88].  A comparison of the Floating-Point Operations rate 
(FLOPs) may also be misleading in diversified architectures with different clock frequencies – 
therefore FLOPs are normalized by the clock rate as defined in section 3.2 of chapter 3.  
Normalizing FLOPs helps to avoid idiosyncrasies of individual hardware, to provide arithmetic 
power comparisons.  Execution time depends on processor speeds and normalized FLOPs allow 
for a more definitive comparison over wall-clock time or FLOPs. 
 The single CPU/GPU computing system defined as System A is examined first followed 
by System B. 
 4.3.1 System A. The single CPU/GPU computing system execution of the full candidate 
application was compared against the CPU-only version.  Table 9 shows that the GPU 
outperforms the CPU-only system for process flow modeling analysis employing input mesh 
MA, MB, and 10FT when examined with regards to total solution wall-clock time.  Both flow 
modeling analysis obtained same flow progression contours and predicted infusion time for the 
same physical problem parameters employed in all cases. The superior arithmetic power of the 
GPU for this single CPU/GPU computing system is clearly visible in Table 10, presenting an 
54 
 
advantage of more than 9-times the number of KFLOPs, which have been normalized as per 
section 3.2 of chapter 3, in the GPU compared to the CPU. 
Table 9  
Full solution performance with single CPU/GPU and CSR compression (System A) 
Unstructured Mesh CPU Time (secs.) GPU Time (secs.) Speedup Factor 
MA 6,176.46 418.44 14.76 
MB 81,929.7 4,219.61 19.42 
10FT 6770.31 285.17 23.74 
 
Table 10 
 Full solution KFLOPs with single CPU/GPU and CSR compression (System A) 
Unstructured Mesh KFLOPs (CPU) KFLOPs (GPU) 
MA 142.26 1,169.56 
MB 133.39 1,292.44 
10FT 135.92 1,186.73 
 
 4.3.2 System B. The single CPU/GPU computing system execution of the full candidate 
application was compared against the CPU-only version.  Table 11 shows that the GPU 
outperforms the CPU-only system for input mesh MA, MB, and 10FT when examined with 
regards to total solution wall-clock time. The superior arithmetic power of the GPU for this 
single CPU/GPU computing system is clearly visible in Table 12, presenting an advantage of 
almost 3-times the number of KFLOPs, normalized as per section 3.2 of chapter 3, for the GPU 
compared to the CPU. 
Table 11 
 Full solution performance with single CPU/GPU and CSR compression (System B) 
Unstructured Mesh CPU Time (secs.) GPU Time (secs.) Speedup Factor 
MA 424.93 168.57 2.52 





10FT 627.83 105.09 5.97 
 
Table 12 
 Full solution KFLOPs with single CPU/GPU and CSR compression (System B) 
Unstructured Mesh KFLOPs (CPU) KFLOPs (GPU) 
MA 2,299.38 3,213.32 
MB 1,891.28 5,046.01 
10FT 2,142.75 4,743.81 
 
 4.3.3 Initial performance analysis. The execution of the unstructured mesh input files, 
MA, MB, and 10FT for both computing systems exposes some common performance behaviors 
– most notably the correlation to problem size and performance.  System A and System B 
produce better performance using total solution time and normalized FLOPs as metrics when the 
problem size increased.  This performance boost for increasing problem size is reflected in the 
previous chapter’s analysis of the SpMV for both CPU/GPU computing systems as well as 
though out the published literature regarding GPU performance [2, 25, 67, 90].  However, there 
are some differences with the magnitude of the performance boost found. System B has a lower 
speedup factor and KFLOPs change than does System A.  Likely this is due not to any defect of 
System B but rather the more advanced CPU used – the dual-core CPU of System A is so 
lacking in relation to the GPU that the speedup factor must be greater. 
4.4 Software Data-Structures/Layouts Factors  
 This previous section was an initial performance analysis for three unstructured mesh 
model inputs via the single CPU/GPU computing systems defined as System A and System B 
and produced some good performance boosts, especially for System A.  However, the initial 
56 
 
software variables need to be examined to identify potential factors that can hinder performance 
of the presented candidate application.  The first software factor to be examined is data 
compression format. 
 The initial performance was executed using CSR, a data compression format with a noted 
proclivity for poor computational intensive performance [53, 54] – this has been shown to 
improve with BCSR2x2 due to increased locality [52-54].  The presented candidate application 
generates systems of equations based on 3-noded triangular elements each with 1-degree of 
freedom resulting in dense sets of sub-blocks in the sparse matrix [44, 45] – a potential boon for 
locality using BCSR2x2. 
 The presented candidate application generates dense sub-blocks of non-zero elements via 
the methodology of the Finite Element Method (FEM) - initial collections of   33x  local 
element matrices result from 1-degree of freedom applied to the input 3-noded triangular 
elements [44, 45].  These local element matrices are coalesced into a global element matrix that 
maintains symmetry from these 33x  sub-matrices [44, 45] – the 22 sub-blocks utilized by the 
BCSR2x2 data compression will be subsumed by the dense sub-matrices of  the global element 
matrix.  The application of data compression formats to improve locality is closely associated to 
memory address coalescing and thread occupancy as all of these seek to maximize throughput by 
grouping as many threads as possible, however the CNC GPU solver requires multiple Kernels 
which forces a schism to these groupings.   
 CUDA places implicit barriers between dependent Kernel invocations [37], e.g. Kernels 
that rely on one or more currently executing Kernels must wait for system synchronization to 
occur before continuing [62, 65].  This new software factor subsists with the PCG iterative solver 
because it is composed of not just the SpMV Kernel but a number of CUBLAS calls executed in 
57 
 
sequence.  System B, as part of the CUDA Compute Architecture 2.0, can simultaneously apply 
Kernels using the CUDA stream [60, 61] but this is not a luxury offered by System A [62, 76].  
Therefore, to maintain as much consistency of variables between systems, these stream 
constructs were not employed in the present work. The role of the data compression format is 
discussed next. 
 Figure 31 and Figure 32 show the execution time for System A and System B 
respectively revealing a distinct difference in the performance benefit of changing from CSR to 
BCSR2x2 for each of these CPU/GPU computing systems.  Figure 31 shows the performance 
boost of increased locality for the presented candidate application expressed as the BCSR2x2 
data format for input meshes MA and MB with negligible effect for the less regular 10FT mesh 
configuration whereas Figure 32 describes a deleterious result for meshes MA and MB – again 
10FT is negligibly affected.  This difference in performance is surprising from a pure locality 
approach as the dense sub-blocks created by the 3-noded triangular elements are the same for 
both System A and System B – the arithmetic power leveraged by the different CPU/GPU 
computing systems via the normalized FLOPs is shown in Figure 33 and Figure 34 for System A 
and System B respectively. 
 System A shows a much higher rate of floating-point operations when going from CSR 
to BCSR2x2 than System B for the corresponding change (see Figure 33 and Figure 34) – 
excepting the 10FT model mesh configuration which illustrates negligible difference in both 
computing environments.  System B manifests a negligible increase in normalized FLOPs for the 
BCSR2x2 format and a decrease in performance – given the relative regularity of the input mesh 
configuration; this implies an initially poorer locality for System A. The lower impact of the 
locality change for System B is likely due to the implementation of an actual cache as per the 
58 
 
Nvidia Fermi architecture [60, 61].  The negligible change for the 10FT input mesh supports a 
correlation not just to the size of the problem domain but the expressed regularity of the matrix.  
The hardware factors on performance for both System A and System B are discussed next. 
 
Figure 31. Full candidate solution performance (System A). 
 




Figure 33. Full candidate solution KFLOPs performance (System A). 
 
Figure 34. Full candidate solution KFLOPs performance (System B). 
4.5 Hardware Architectural Factors  
 The observed performance of the CPU/GPU computing systems using the unstructured 
mesh configuration defined by MA, MB, and 10FT is affected not just by the software factors 
60 
 
discussed in the previous section but hardware factors as well.  The presented candidate 
application resulted in a performance boost for both System A and System B when using the 
CSR format albeit to a smaller degree for System B – excepting the negligible results observed 
for the less regular input mesh 10FT in both computing environments.  However, when the 
BCSR2x2 format was employed the performance of System B dropped while System A 
increased – fairly regular geometries, consistent models and software factors were used leaving 
the underlying hardware architectural factors as culpable.  
 System B has a cache defined at the hardware level [60, 61] whereas System A does not 
[58, 76] – this has great impact on solutions involving sparse matrices such as those applied via 
the  presented candidate application [54].  A hardware-level cache provides the ability to stash a 
process during execution and quickly retrieve it when needed rather than calling global memory 
read/writes at every instant.  The more advanced architecture of System B does not negate the 
importance of judicious application of software factors, but it does alleviate it somewhat.  The 
lowered importance of locality on System B, expressed using BCSR2x2, adds all the 
computational overhead of extra loops to iterate over defined sub-blocks but none of the 
corresponding utilization of greater arithmetic units and hence none of the predicted performance 
seen in System A. 
 The software and hardware factors encountered during the execution of computationally 
intensive applications using CPU/GPU computing systems are typically difficult to quantify and 
as such defining concrete metrics for performance on these system is difficult.  The formulation 





4.6 Computational Complexity Analysis  
 Computational complexity has historically been quantified using asymptotic analysis to 
understand how the design scales [91], but this methodology relies on axioms that do not exist 
for GPU computing such as constant operations yielding negligible costs to overall performance 
[92].  Adjusting the number of threads per block can allow CUDA to coalesce memory addresses 
resulting in as much as 20-times reduction in execution time [72] – small changes can have big 
impacts on the algorithm [66].  Therefore quantifying algorithmic behavior is relegated to the 
minutia of performance modeling as standard parallel modeling techniques fail to take into 
account the importance of limited and high access costs to the exposed memory hierarchy of the 
modern GPU [16, 72]. 
 Currently there is no standard for performance analysis for use with GPGPU computing, 
however options have been published [16, 66, 67, 72, 93] and this study follows many of the 
concepts put forth in these works.  The code that is ported to the GPU(s) for this research is 
broken up into its major components, analyzed, and relevant parts re-assembled to form a 
general performance model. The single CPU/GPU computing system is studied first and expands 
to the multiple CPU/GPU computing system in the next chapter. 
 The following sub-section will derive a mathematical model for a single call to the PCG 
iterative solver rather than the whole solution and is detailed in Appendix A.  This approach 
yields a base equation from which the different behaviors for different parameters can be 
determined. The third party CNC software was used for the results being analyzed which defines 
1616  thread blocks [50] – this is the assumed constant for all derived equations. 
 4.6.1 Single call to PCG solver. Computing the average number of non-zero elements 
for a given sparse matrix that is assumed to be square with 1616  thread blocks, NZR  with memG , 
62 
 
M are the number of bytes in global memory and the total number of rows respectively – using 
bytes as the metric defines the numerical values of 4 given that there are 4 bytes for every 
floating-point data type expressed in equation (4.5). 
















,1      (4.5) 
Computing the number of blocks, BN , with tN , wN ,  and smpN  the number of threads per warp, 
the number of warps per block, and the total number of symmetric multiprocessors for M rows 
and average non-zero entries per row as NZR  columns can be defined by equation (4.6). 









     (4.6)  
Computing the cycle cost based on the defined cost of sparse matrix-vector computational costs 
for the serial CPU at  NN rows 2  [46, 91] and using a global memory cycle count of 500 
based on the average for the actual data range of 400 – 600  is TC , with B  the number of threads 
per block expressed in equation (4.7). 











2      (4.7) 
Computing the total cost for execution of preconditioned conjugate-gradient solver on GPU, 
pcgC  with K , R  and D the total number of iterations for convergence, memory clock frequency, 
and number of cycles per pipeline can be defined by equation (4.8). 











   (4.8) 
Equation (4.8) is not exact but does reflect many of the same performance modeling equations in 
published works [66, 72]. 
63 
 
 Figure 35 and Figure 36 depict a comparison of actual to estimated performance time for 
System A and System B respectively using unstructured mesh configuration MA, MB, and 
10FT expressed with the CSR data compression format.  The results are not exact, as the thread 
dispatch policy for both GPU devices is non-public information [72, 93] and is interleaved for 
optimal throughput [2, 37, 55]. In addition, synchronization costs between dependent Kernels of 
the PCG iterative solver are not published. The issue is further compounded with an assumed not 
large but with an unknown dispatch policy.  The approximate nature of the results provided by 






 with tA  and tE  defining the actual and estimated times respectively. 
 Figure 37 and Figure 38 show the normalized error of the single PCG call for System A 
and System B respectively and indicate values of less than 50% which is a manageable amount 
of error for the given input data – the data results are shown in Table 14 for System A and Table 
15 for System B. The next section extends from the single call to the PCG iterative solver and 
mathematically models the full candidate solution. 
Table 13 
 GPU device limits for both systems 
System A: Compute Architecture 1.0 System B: Compute Architecture 2.0 
Core Clock Rate = 1.35 GHz Core Clock Rate = 1.15 GHz 
Total Warps = 24 Total Warps = 48 
Total SMP = 16 Total SMP = 14 
Shared Memory per Block = 1024 Shared Memory per Block = 1024 
Registers per Block = 512 Registers per Block = 256 





Figure 35. Performance modeling of single call to PCG solver (System A). 
 




Figure 37. Normalized error for single call PCG modeled performance (System A). 
 
Figure 38. Normalized error for single call PCG modeled performance (System B). 
Table 14 
 Performance modeling of single calls PCG (System A) 
Unstructured Mesh Actual Time (ms) Estimated Time (ms) 





MB 2.45664 3.545509018 
10FT 0.938336 1.008680919 
 
Table 15 
 Performance modeling of single calls PCG (System B) 
Unstructured Mesh Actual Time (ms) Estimated Time (ms) 
MA 0.470848 0.442022833 






 4.6.2 Full Solution Cost – Single GPU. The final modeling performance equation for a 
single CPU/GPU computing system is derived from the estimated single call to the PCG 
computed from equation (4.8).  The size and granularity of the local GPU device registers are 
significant to the number of active threads for utilization of computational resources and is 
defined by equation (4.9) with allocRg  the register allocation unit size and granularW  the warp 






A        (4.9) 
Equation (4.9) is a major component of a defined conditional function that models the 
serialization of threads within the warp construct shown as equation (4.9.1) with ATBN  the 









































pen       (4.9.1) 
Equation (4.9.1) is passed to the Gaussian distribution adapted to model the performance 












     (4.9.2) 
Computing the total cost for execution of solution [93] with multiple calls to the GPU-enhanced 
PCG, gpuT  with K
 the total number of iterations for full solution convergence – equation (4.9.3) 
defines the final performance modeling equation for a single CPU/GPU computing system. 
  growthpcggpu RCKT         (4.9.3) 
 Figure 39 and Figure 40 depict the results of the actual to estimated execution times for 
the solutions of unstructured mesh configurations MA, MB, and 10FT for single CPU/GPU 
computing systems executing in System A and System B environments respectively.  The 
formed finite element matrices for the input data meshes being solved via the presented 
candidate application are expressed using the CSR data compression formats in all cases.  
 The observed results of performance modeling for both CPU/GPU computing systems 
show a close equivalence of actual to estimated solution times.  Both System A and System B 
showed dramatic decreased difference of estimated and actual time as the size of the input data 
increased   – an inverse relationship that corroborate the premise that the underutilized and non-
coalesced memory accesses can be expensive [62, 66, 72, 93]. The more intense the floating-
point operations, the more fully utilized the CPU/GPU computing system resources are and the 
68 
 
better chance of address coalescing. The increased number of input data elements being solved 
mitigates the impact of extraneous variables such as threaded time-sharing policies and equation 
(4.9.3) becomes the dominating predictor for the single CPU/GPU computing system for the full 
solution of the candidate application.  Figure 41 and Figure 42 shows dramatic decrease in 
normalized error between the actual and complexity analysis predicted results for both System A 
and System B respectively support this precept - Table 16 and Table 17 are the actual results. 
 




Figure 40. Performance modeling single CPU/GPU full solution (System B). 
 




Figure 42. Error single CPU/GPU full solution modeled performance (System B). 
 
Table 16 
 Performance modeling single CPU/GPU full solution (System A) 
Unstructured Mesh Actual Time (secs) Estimated Time (secs) 
MA 204.69 157.30 
 
MB 1,804.05 1,760.76 
 




 Performance modeling single CPU/GPU full solution (System B) 
Unstructured Mesh Actual Time (secs) Estimated Time (secs) 
MA 146.64 75.25 
 
MB 974.90 843.17 
 





 4.6.3 Contribution of Hardware Factors. This section establishes a relationship to 
hardware factors and the resulting application performance via the derived equation (4.9.3), 
adjusting hardware variables and then projecting against the actual performance of the 
application.  The resulting differentials are analyzed and the impact of the adjusted parameter(s) 
on performance of the CPU/GPU computing system is theorized. 
 4.6.3.1 Single PCG Call. The first performance modeling that is adjusted for hardware 
factor(s) changes is the single call to the PCG solver, since this is used to build the final full-
solution model (see equations (4.8) and (4.9.3)).  The number of SMP chips was adjusted to be 
greater than and less than the current number defined in both single CPU/GPU computing 
systems for a single call to the PCG iterative solver to determine the theoretical effect the SMP 
counts have on the resulting performance. 
 The behavior expressed by both System A and System B computing systems was similar, 
excepting of course the faster nature of System B [58-61, 76].  Figure 43 shows that reducing the 
number of SMPs to 4 for System A resulted in an increase in execution time, exacerbated by the 
larger input of unstructured mesh MB. This same reduction in SMPs on System B shown by 
Figure 44 resulted in a similar increased execution time and, as with System A; the larger model 
configuration MB had a greater impact.  The observed performance change for both System A 
and System B as a result of this adjusted hardware factor is as expected given the larger model 
mesh MB presents more elements for a relatively low number of SMPs – SMPs are the physical 
equivalent of the logical block [37, 62] so lower counts manifest less computational power that 
will be further debilitated applied in larger problem space provided by input mesh MB.  The 
converse is observed by increasing the number of SMPs to 64 for both System A and System B, 
72 
 
as before this is no source of consternation – more computational power will naturally be 
leveraged by a larger environment in which to be expressed. 
 The hardware factor of SMPs at the level of a single call to the PCG iterative solver 
extends naturally to the performance of the full candidate solution.  The full candidate solution 
subsumes the single PCG call and works in tandem - result expressed as a magnification, a 
constant, or abrogation to resulting performance.  The full candidate solution hardware factors 
for the single CPU/GPU computing system are discussed in the next sub-section. 
 




Figure 44. Performance model for single PCG solver on System B (SMP adjusted). 
 4.6.3.2 Full solution – single CPU/GPU. Adjusting the hardware factor(s) for the full 
candidate solution and the resulting impact(s) on the performance is modeled in this sub-section. 
As with the previous sub-section the number of SMPs is adjusted and the resulting theoretical 
performance is examined to deduct the relative impact of this factor on behavior for both single 
CPU/GPU computing environments for System A and System B.  Equation (4.9.2) defines the 
most direct access of the number of SMPs to the full candidate solution as a rate of 
growth/decay. 
 The full solution performance model given by (4.9.3) applies the Gaussian distribution to 
the overall performance of the system with the determination of serialized threads within a warp 
defined as the allocated register unit size for the given GPU device - the ratio of SMPs directly 
impact the value of this allocated unit size as fewer SMPs coerces lower unit sizes and less 
computational ability and vice versa. SMPs are given this position in the performance 
complexity model as each will provide a factor to the initial single PCG solver call, modeled by 
74 
 
equation (4.8).  Examining the performance results in Figure 45 and Figure 46 reveal a 
comparable behavior to that observed for the single PCG call discussed previously but to a much 
less magnitude. 
 The single CPU/GPU computing system environments defined by System A and System 
B express a performance boost for an increased number of SMPs and a higher execution time for 
decreased numbers of SMPs.  However, the full candidate solution is less affected by this 
hardware factor change as was with the single PCG call with input mesh configuration MA 
yielding negligible results for System A and System B; the input mesh 10FT mesh configuration 
now reflects the performance comparable to mesh configuration MB. 
 




Figure 46. Performance model for full candidate solution on System B (SMP adjusted). 
 4.6.4 Contribution of Software Factors. This section establishes a relationship to 
software factors and the resulting application performance via the derived equation (4.9.3), 
adjusting hardware variables and then projecting against the actual performance of the 
application.  The resulting differentials are analyzed and the impact of the adjusted parameter(s) 
on performance of the CPU/GPU computing system is theorized. 
 4.6.4.1 Single PCG Call. The number of threads per block is an obvious software factor 
to adjust as this is can be related to memory address coalescing and can be viewed as another, 
less direct application of thread occupancy.  The theoretical performance results of both System 
A and System B corroborates documented performance regarding number of threads per block 
[48, 49, 65, 66].  Lowering the number of threads per block to 128 from the optimal 256 
increases theoretical execution time for the single PCG solver and increasing to 768 boosts 
performance, modeling the effect of hardware coalescing of memory addresses [62, 77].  This 
76 
 
software factor is manifest for both System A and System B computing environments as shown 
in Figure 47 and Figure 48 for System A and System B respectively. 
 The influence of the given software factors is clearly evident at the single PCG solver call 
level, the next sub-section discusses the full candidate solution. 
 4.6.4.2 Full solution – single CPU/GPU. Adjusting software factor(s) for the full 
candidate solution and the resulting impact(s) on performance is modeled in this sub-section – 
the number of threads per block is altered.  The effect of this software factor is similar to the 
observed results for the single PCG solver call - smaller numbers of threads express lower 
performance whereas greater numbers yield a performance boost. 
 The observed hardware and software factors work together with the defined 
computational algorithm to effect performance – this interplay of factors is discussed in the next 
section. 
 




Figure 48. Performance model for single PCG solver on System B (Threads adjusted). 
  
 




Figure 50. Performance model for full candidate solution on System B (SMP adjusted). 
4.7 Performance and Relation to Software and Hardware Factors   
 The resulting performance of the single CPU/GPU computing system is directly tied to 
the interplay of software and hardware factors of the environments in which they executed and 
are related in this sub-section. Hardware factors such as the addition of SMPs and increasing 
memory provide the context in which software factors can express further optimization and 
software factors such as increasing thread occupancy and data compression formats that can 
increase locality allow device architectures to apply low-level features like hardware multi-
threading for maximal memory throughput. 
 Equation (4.9.3) accommodates a mathematical model to map theoretical performance 
changes as hardware and software variables are altered, allowing the developer to better 
understand the effects different software and hardware constructs for HPC computational 
modeling applications using CPU/GPU computing systems.  The development of predictive 
79 
 
models for computing follows three general approaches – analytical, profile, and simulation 
based categories [92].   
 Simulation-based performance prediction uses an application that has modeled the 
objective architecture in exacting detail and generates results from dynamic and random inputs – 
accurate but computationally costly [92].  Profile-based performance prediction uses two stages 
to develop the model; instrumentation is utilized to generate statistical information on a given 
program run and analysis is used on these statistics to create an estimation of performance for a 
given architecture [92].  The analysis-based performance prediction model derives a 
mathematical equation that can estimate program behavior for specific architectures and 
algorithms – this is the model utilized in this chapter and follows precepts established by the 
works of [66, 72, 92, 93]. 
 All of the following discussions on the effects of hardware and software factors and any 
resulting interplay are derived from altering the variables of equation (4.9.3) and illustrate 
theoretical performance as consequence. The software factor Threads-Per-Block are denoted as 
TPB and the theoretical change exhorted by different parameters is given as ETime in the 
following discussion.  
 Table 20 categorizes the hardware factors with System A as the computing environment 
contrasting the actual solution execution time against the effective time generated by the 
alteration of input variables and is shown in milliseconds.  Table 21 categorizes the software 
factors with System A as the computing environment contrasting the actual solution execution 




 Table 22 categorizes the hardware factors with System B as the computing environment 
contrasting the actual solution execution time against the effective time generated by the 
alteration of input variables and is shown in milliseconds. Table 23 categorizes the software 
factors with System B as the computing environment contrasting the actual solution execution 
time against the effective time generated by the alteration of input variables and is shown in 
milliseconds.  
 The degree of impact that the proper implementation of software and hardware factors 
present is shown in Table 18 and Table 19 for System A and System B respectively.  The 
resulting factors of change are calculated as non-dimensional quantifier such that differences 
from the original solution times are applied against the solution times that result when the 
parameter/factors are expressed – the closer to zero the less effect the factor has on system 
performance and the greater the magnitude beyond one the more negative the effect on 
performance. The larger negative impact on performance is manifest as the lowering of the 
number of SMP, the converse is also true and is expected given that the increasing of SMP 
invariably increases the number of graphics cores from which to utilize in a given problem 
domain.   
 Interestingly the relative effect of increasing the number of TPB has the same 
consequence regardless of the computing system utilized.  The relative equivalence of software 
factors regardless of computing environments is to be expected as the algorithm should operate 
independently if optimized – of course, this is not conclusive as the more dispersed the non-zero 
elements, the greater the irregularity of memory addressing, memory bank conflicts and the 
resultant performance degradation.  This is evidenced by Table 18 as the lowered number of 
SMPs exposes the weakness of locality – higher relative latency due to lower hardware ability to 
81 
 
hide it. Regularity of the sparse structure can play a significant impact to the potential 
performance benefit in CPU/GPU computing environments. 
 However, this is not conclusive as the analytical performance model derived is, as all 
prediction models of this category, based on conservative measures of a given architecture and 
problem domain - extrapolating too far beyond the initial derivation can create computational 
artifacts in the results. 
Table 18 












MA SMP 4 2.427 TPB 128 0.414 
MB SMP 4 0.352 TPB 128 0.414 
10FT SMP 4 2.272 TPB 128 0.414 
MA SMP 64 0.750 TPB 512 0.000 
MB SMP 64 0.750 TPB 512 0.000 
10FT SMP 64 0.750 TPB 512 0.000 
 
Table 19 












MA SMP 4 2.50 TPB 128 0.414 
MB SMP 4 2.50 TPB 128 0.414 
10FT SMP 4 2.50 TPB 128 0.414 
MA SMP 64 0.780 TPB 512 0.000 
MB SMP 64 0.780 TPB 512 0.000 





 Hardware factors and theoretical performance (System A) 
Input Variable Value Time (ms.) ETime (ms.) 
MA SMP 4 204,689.00 537,490.96 
MB SMP 4 1,804,050.00 2,370,321.50 
10FT SMP 4 140,546.00 444,203.64 
MA SMP 64 204,689.00 39,324.91 
MB SMP 64 1,804,050.00 440,189.68 
10FT SMP 64 140,546.00 33,945.25 
 
Table 21 
 Software factors and theoretical performance (System A) 
Input Variable Value Time (ms.) ETime (ms.) 
MA TPB 128 204,689.00 222,455.30 
MB TPB 128 1,804,050.00 2,490,088.86 
10FT TPB 128 140,546.00 192,023.31 
MA TPB 512 204,689.00 204,689.00 
MB TPB 512 1,804,050.00 1,804,050.00 
10FT TPB 512 140,546.00 140,546.00 
 
Table 22 
 Hardware factors and theoretical performance (System B) 
Input Variable Value Time (ms.) ETime (ms.) 
MA SMP 4 146,635.00 263,356.17 
MB SMP 4 974,897.00 2,951,083.33 
10FT SMP 4 89,228.10 263,514.70 
MA SMP 64 146,635.00 16,459.76 
MB SMP 64 974,897.00 184,442.71 





 Software factors and theoretical performance (System B) 
Input Variable Value Time (ms.) ETime (ms.) 
MA TPB 128 146,635.00 106,411.96 
MB TPB 128 974,897.00 1192417.74 
10FT TPB 128 89,228.10 106,476.02 
MA TPB 512 146,635.00 146,635.00 
MB TPB 512 974,897.00 974,897.00 
10FT TPB 512 89,228.10 89,228.10 
 
 The performance results of the single CPU/GPU computing systems have been shown to 
be consistent for presented input unstructured mesh model configurations with varying sizes for 
both System A and System B computing systems. This performance can be dramatically 
affected by sometimes slight aberrations of input – from this single CPU/GPU paradigm the 




Full Candidate Application – Multiple CPU/GPU Computing System 
 This chapter focuses on the full solution to the candidate application within the context of 
a multiple CPU/GPU computing system for both System A and System B.  The full solution of 
the computationally intensive candidate application is mapped to the CPU/GPU computing 
system, distributed across independent nodes, and the resulting performance is analyzed to 
determine how the hardware and software factors work together to impact the resulting 
application performance.  During the mapping, key computationally intensive kernels are 
presented and associated GPU developments explored. 
 This chapter will ascertain how the hardware architectures of System A and System B 
work together with the software factors to denote the application performance – key in this 
discussion is the calculation of a computational complexity analysis for multiple CPU/GPU 
computing systems which is a natural extension from the single version presented in the previous 
chapter.  The computational complexity analysis is actualized as a performance modeling 
equation that can be used to project how different problem, software, and hardware parameters 
will affect performance.  Specific computational behavior of multiple CPU/GPU computing 
systems is the exposure of the important cost of intra-nodal and local host communication to the 
performance of a computationally intensive application. 
 Understanding these variations in factors/parameters is essential as new computing 
architectures arrive to get optimal performance for HPC computational modeling legacy and new 





5.1 Mapping Full Candidate Application to GPU   
 The mapping of the full candidate application to the multiple CPU/GPU environment 
extends from the previous chapter, detailing the mapping of the single CPU/GPU computing 
system, and is developed and demonstrated for both System A and System B environments.  The 
previous chapter applied the CNC solver set in the context of a shared memory address 
environment and this chapter grows the environment to include multiple systems each with its 
own CPU/GPU architecture providing intra-node communication via MPI standard [28, 94].  The 
ability of CPU/GPU computing systems to scale with multiple architectures is critical as HPC 
applications have long embraced the increased performance provided by parallelizing large-scale 
problems with domain decomposition techniques via MPI and much research in the GPGPU 
computing community is targeting this objective [36, 69, 95-97].  The single CPU/GPU 
computing system presented in the previous chapter passed the computationally intensive 
solution to system of linear equations in matrix form bxA  to a local GPU device where it can 
be most beneficial - extending this paradigm to the multiple systems encompasses an extra level 
of intra-nodal communication, e.g. MPI. 
 MPI is a standard for message passing systems with different implementations such as 
MVPICH [26, 28] and historically dominates HPC modeling as an effective methodology for 
application performance boosting [26, 28].  Implementations of the MPI standard define a tool 
for connecting multiple machines and/or discrete CPUs as a logical whole in order to solve 
problems that are computationally prohibitive in a single machine context [26, 28].  This 
methodology is similar to the Parallel Virtual Machine (PVM), the predecessor of MPI [26, 
28], but the differences are important. 
86 
 
 PVM and MPI take different approaches as to the defining and utilization of distributed 
topologies.  MPI allows the user to easily create virtual topologies [98] that must be explicitly set 
in PVM [99, 100].  The abstraction of topologies with MPI is one of the reasons for its 
popularity; software developers do not have to focus on different architectural environments 
when creating an application in MPI.  The use of virtual topologies has another benefit – many 
MPI implementations optimize the definition of the system based on the physical nodes in the 
current cluster.  The MPI implementation simply alters the identifications of the various 
processors contained in the defined communicator to reflect the optimal distances of the actual 
machines contained in the system [98, 100].  PVM will allow for the communication, not only 
between heterogeneous architectures but also between different languages – e.g., a C-Code 
program can interface to a FORTRAN-Code program and vice-versa.  PVM will probe for 
differences in architecture to allocate native resources as needed.  MPI, whose chief design is 
around both performance and portability, assumes a consistent connection via a defined world 
communicator [98, 100].  PVM is designed for operation within a heterogeneous set of 
architectures, while MPI can do this also, it is not explicitly defined within the standard itself  
[98, 100]. 
 MPI was chosen as the standard for intra-node communication for the presented 
candidate composite process flow modeling application as this represents a significant body of 
research in GPGPU computing [101-104] and porting legacy code to utilize CPU/GPU systems 
requires a robust and portable solution that encompasses this paradigm.  The key 
computationally intensive Kernels encompassed by the multiple CPU/GPU computing systems 
are discussed next. 
87 
 
 5.1.1 Key Computationally Intensive Kernels. The key computationally intensive 
Kernels for the multiple CPU/GPU computing systems introduced in this chapter are evolved 
from two distinct sources.  The first source is a direct result of the utilization of sets of  single 
CPU/GPU computing systems from which the multiple CPU/GPU computing system is formed 
as each individual CPU/GPU machine involved brings with it the local computationally intensive 
Kernels that are effected by a separate machine architecture.  The second source is directly 
related to the communication vehicle for the multiple CPU/GPU computing systems, MPI and 
any communicational overhead it brings to the total system is magnified by the significant GPU 
and CPU communications via the local PCIe bus  – a noted bottleneck in single CPU/GPU 
systems now magnified by the number of distinct nodes in the communication world of MPI [69, 
97]. 
 5.1.2 GPU Code Developments. This sub-section establishes GPU code developments 
such as API tools/libraries and the data-structures/layouts for the definition of the multiple 
CPU/GPU computing system.  Nvidia’s CUDA API has remained ahead in the GPGPU 
computing community [1, 2, 24] and this continues as interest grows in CPU/GPU computing 
clusters with the provision of the Unified Virtual Address (UVA) space [37, 62, 105]. 
 UVA allows CUDA to map GPU device buffers into a global virtual address space and 
then queries the system to determine if a desired address is in GPU or CPU space.  UVA then 
signals CUDA to execute a PCIe communication or local memory call, depending on the 
physical location of the virtual address – theoretically allowing for direct MPI buffer transfers.  
However, CUDA UVA was not utilized with the presented candidate application as the construct 
did not come to fruition until compute architecture 2.0 and this leaves out System A [58]. So in 
the interest of consistency was not pursued in this dissertation.   
88 
 
 Given that the individual GPU devices in the multiple CPU/GPU computing systems 
have no direct communication to outside nodes a double-copying is inferred [105].  The 
individual CPU/GPU computing system executed upon a local sub-domain of the global problem 
space, as per domain decomposition [26, 28], passing the computationally intensive system of 
linear equations defined as a sparse matrix system to the local GPU.  The GPU executes the 
system using the PCG iterative solver, as with the single CPU/GPU system, and returns the result 
back across the PCIe bus to the CPU where it is then stored in the defined MPI communication 
buffers to be shared with other nodes in the system – unavoidably increasing the latency as there 
must now be explicit staging of memory buffers for collective and point-to-point calls [94] and 
GPU to CPU to MPI and back the same path at each iteration of the algorithm. 
 The full candidate application is validated against an analytically derived solution for a 
simple 2D radial injection circular plate mold geometry model using the multiple CPU/GPU 
computing system next. 
5.2 Validation of Full Candidate Application on Multiple CPU/GPU   
 The correctness of the full candidate application for the multiple CPU/GPU computing 
systems is ensured via the examination of flow-front progression and injection port pressures of 
numerical solutions for CPU and GPU against the correspondingly analytical results.  As before, 
the model used for validation is a simple 2D circular plate with the resulting analytical equation. 
 The simple model being used in this chapter is a radial flow in a circular plate with a 
radius of 10 cm and an inner radius of 0.15 cm shown in Figure 56.  The inner radius, 0R , is 
subjected to a constant flow rate Q .  The thickness of the cavity is H , the pressure is P, resin 
viscosity is  , the permeability of the fiber preform is K , and the porosity of fiber compaction 
is .  The flow front radius at any time t  is given by [89]: 
89 
 















      (5.1) 
The corresponding expression for injection pressure, which varies with time, is given by [89]: 




























     (5.2) 
 
Figure 51.  2D circular plate validation model (not to scale). 




Q  , permeability
2080.44 cmeK  , a viscosity PaS02.0 , a porosity of 805.0 , 
a time step 5.0t sec, and an element thickness cmH 742.0 .  The circular plate model 
involved a computational mesh of 1,344 nodes and 2,560 3-noded triangular elements. Figure 52 
and Figure 54 display the flow-front progression for System A and System B respectively and 
clearly define accuracy with the analytical value. Figure 53 and Figure 55 display the inlet 




 The flow-front and inlet injection pressure values are accurate so the full candidate 
solution for the multiple CPU/GPU computing systems is presented next with a focus on initial 
performance evaluation using the multiple CPU/GPU computing systems. 
 
Figure 52. Validation of multiple CPU/GPU for flow-front progression (System A). 
 




Figure 54. Validation of multiple CPU/GPU for flow-front progression (System B). 
 
Figure 55. Validation of multiple CPU/GPU for inlet injection pressure (System B). 
5.3 Initial Full Candidate Application Performance on Multiple CPU/GPU   
 Much of the establishment of key computationally intensive Kernels for the multiple 
CPU/GPU computing systems is a reflection of the single CPU/GPU computing system from the 
92 
 
previous chapter.  The single system provides the building blocks of the multiple systems leaving 
the effective performance of the multiple systems to the parlance of latency mitigation as a 
product of standard MPI communication overhead [26, 28] and noted CPU/GPU costs [106-109].  
 5.3.1 System A. The full candidate solution for multiple CPU/GPU computing systems 
was applied using System A and is discussed in this sub-section. The initial problem domain was 
partitioned into various sub-domains each to be executed on a single CPU/GPU computing 
system.  The multiple sub-domain results are then compared against the CPU-only, and the 
CPU/MPI solution times. 
 Table 24 shows the total solution times, in milliseconds, for increasing numbers of 
partitions using as input unstructured mesh MA expressed using CSR data format compression 
with System A computing architecture.  The observed total solution times in all cases are higher 
for the multiple CPU/GPU computing system over the single CPU-only and CPU plus MPI 
which illustrates the significant cost potential of the extra layer of latency produced by intra-node 
communication within the execution of the global solution domain.  The slight decrease at 2-
partitions for the GPU plus MPI model is due to the cache effect of increased local memory to 
allow more of the problem to be immediately available for solving.  
 Table 25 shows the total solution times, in milliseconds, for increasing numbers of 
partitions using as input unstructured mesh configuration MB expressed using CSR data format 
compression with System A computing architecture.   The observed total solution times for GPU 
plus MPI are more promising as the increased computational loads on the individual nodes using 
the larger sized input mesh are large enough to overcome the latency of intra-node as well as the 
PCIe bottleneck – until 16 sub-domains are created and computational loads on individual nodes 
in the system become too poor to overcome increasing latency for this fixed problem size. Figure 
93 
 
56 and Figure 57 visually depict the observed performance using System A as the computing 
environment and different sized mesh configurations MA and MB respectively. 
 The more structured meshes MA and MB depict distinct performance differences when 
employing MPI with the local GPU. The smaller model MA illustrates no performance benefit 
for MPI with GPU over MPI without GPU as the smaller model does not have enough 
computational intensity at the local GPU level to overcome the cost of intra-nodal 
communication generated by MPI [70, 105]. The larger model MB reveals a slight performance 
boost when using MPI and the local GPU for the global count of partitions that remain below 16 
when the intra-nodal communication cost once again become the dominating factor of 
performance. The more unstructured mesh model 10FT shows a negligible difference of MPI 
with or without utilizing the local GPU due to the more evenly distributed non-zero elements, 
each computing node in the system is given nearly equal divisions of work and the so throughput 
latency is better handled, thus hiding the intra-nodal communication cost of MPI better. 
 Table 26 shows the total solution times, in milliseconds, for increasing numbers of 
partitions using as input unstructured mesh configuration 10FT defined earlier expressed using 
CSR data format compression with System A computing architecture.  The observed total 
solution times for GPU plus MPI are higher than the corresponding CPU plus MPI – the 
performance is closer than the results shown by the mesh MA as the numbers of non-zero 
elements is greater for 10FT but the intra-nodal communication derived from the coarser-grained 
parallelism of MPI communication creates higher levels of latency that must be mitigated [69, 
105, 109]. 
 The next sub-section will discuss the results of the multiple CPU/GPU architecture 




 Multiple CPU/GPU performance in milliseconds with mesh MA (System A) 
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 
CSR 1 2,749,450.00 418,440.00 
CSR 2 1,666,960.00 1,604,200.00 
CSR 4 825,989.00 1,357,890.00 
CSR 16 260,669.00 1,461,590.00 
 
Table 25 
 Multiple CPU/GPU performance in milliseconds with mesh MB (System A) 
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 
CSR 1 77,163,900.00 4,219,610.00 
CSR 2 20,169,400.00 16,985,600.00 
CSR 4 9,633,960.00 8,570,300.00 
CSR 16 2,495,580.00 23,956,100.00 
 
Table 26 
 Multiple CPU/GPU performance in milliseconds with mesh 10FT (System A) 
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 
CSR 1 6,649,810.00 140,546.00 
CSR 2 1,206,160.00 1,246,120.00 
CSR 4 621,867.00 688,750.00 





Figure 56.  Multiple CPU/GPU computing system - mesh MA (System A). 
 




Figure 58. Multiple CPU/GPU computing system - mesh 10FT (System A). 
 5.3.2 System B. The full candidate solution for multiple CPU/GPU computing systems is 
executed in the System B computing environment and is discussed in this sub-section. The initial 
global problem domain is partitioned into various sub-domains and passed among various 
discrete processing nodes to be executed in the manner of a single CPU/GPU computing system.  
The multiple sub-domain results are then compared against the CPU-only, and the CPU/MPI 
solution times. 
 Table 27 shows the total solution times, in milliseconds, for increasing numbers of 
partitions using as input unstructured mesh configuration MA expressed using CSR data format 
compression with System B computing architecture.  The observed total solution times in all 
cases involving sub-domain partitions reveal that the GPU plus MPI construct outperforms the 
CPU/MPI model.  However this positive benefit of combining GPU and MPI is not observed in 
the larger input mesh configuration MB as can be seen in Table 28.  This observed performance 
degradation for the larger element mesh is a stark contrast from the behavior manifested in 
97 
 
System A, where a larger mesh resulted in better performance as the computationally intensive 
operations increased. 
 Table 29 shows the total solution times, in milliseconds, for increasing numbers of 
partitions using as input unstructured mesh configuration 10FT expressed using CSR data format 
compression with System B computing architecture.  The observed solution times for the 10FT 
model, while executing on the more advanced CPU/GPU computing System B never manages to 
overcome the latency incurred by the intra-nodal communication emergent from the use of 
multiple MPI communication calls – evidence of a faster and more efficient hardware that 
minimizes costs at the local host and conversely exposing the MPI communication costs. 
 Figure 59 and Figure 60 are visual depictions of the observed results of the multiple 
CPU/GPU architecture System B listed in Table 27 and Table 28, and Figure 61 illustrates the 
data given in Table 29.  The next sub-section will examine, analyze and discuss the observed 
initial performance results for multiple CPU/GPU systems represented by computing System B 
and System A. 
Table 27 
 Multiple CPU/GPU performance in milliseconds with mesh MA (System B)  
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 
CSR 1 420,980.00 168,570.00 
CSR 2 767,364.00 220,462.00 
CSR 4 448,665.00 98,034.20 
CSR 16 307,081.00 44,392.40 
 
Table 28 
 Multiple CPU/GPU performance in milliseconds with mesh MB (System B)  
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 






CSR 2 1,675,850.00 2,995,870.00 
CSR 4 872,019.00 2,172,520.00 
CSR 16 318,895.00 1,490,310.00 
 
Table 29 
 Multiple CPU/GPU performance in milliseconds with mesh 10FT (System B) 
Data Compression Partitions CPU + MPI (ms.) GPU + MPI (ms.) 
CSR 1 616,770.00 89,228.10 
CSR 2 124,995.00 249,532.00 
CSR 4 73,081.40 145,577.00 
CSR 8 41,426.60 118,746.00 
 
 




Figure 60. Multiple CPU/GPU computing system - mesh MB (System B). 
 
Figure 61. Multiple CPU/GPU computing system - mesh 10FT (System B). 
 5.3.3 Initial performance analysis. The execution of the unstructured mesh input files, 
MA, MB and 10FT for both computing systems exposes some interesting performance 
behaviors, notably a divergence of performance for the different computing environments.  The 
100 
 
observed results for both  System A and System B lower in total solution time as the number of 
sub-domains increases – excepting at 16 partitions, as System A then starts to ratchet up in 
solution times for both input meshes MA and MB.  The input mesh 10FT displays almost no 
difference between CPU plus MPI and GPU plus MPI – the impact of intra-nodal 
communication is lessened for this model within the System A computing system environment. 
 System B performs better using GPU plus MPI over CPU/MPI for the smaller input mesh 
MA but this is the opposite of that observed with the larger mesh MB. System B has both a 
larger set of registers and shared memory than System A and therefore able to hold larger 
amounts of data to increase throughput allowing better utilization and conversely exposing 
higher combined latencies of local CPU-GPU and intra-nodal communications.  The result is the 
counter-intuitive effect of a less computationally intensive problem performing better than the 
larger and more computationally costly input mesh MB – an artifact of increased memory 
complexity and no direct connection to the MPI library calls. 
 The counter-intuitive effect of better hardware creating reverse performance for larger 
problem domains, i.e. less computationally intensive models can perform better than more 
computationally intensive models using a device that is optimal for systems requiring higher 
numbers of floating point operations can be seen in Figure 61. Figure 61 shows a nearly constant 
difference from the GPU plus MPI and CPU plus MPI which is likely due to the more efficient 
execution of the GPU device – exposing a larger amount of intra-nodal communication cost for 
the global system. 
 The software factors that influence the performance of multiple CPU/GPU computing 




5.4 Software Data-Structures/Layout Factors  
 The previous section was an initial performance analysis for unstructured mesh inputs via 
the multiple CPU/GPU computing systems defined as System A and System B and produced 
mixed results – System A displayed a performance boost for the larger input mesh MB but not 
MA and System B displayed the converse.  However neither approached the same level of 
performance observed by the single CPU/GPU systems of the previous chapter and neither 
illustrated definitive performance increase for input mesh 10FT – although System A yields a 
closer result.  The software variables involved in the observed results of the initial full solution 
with multiple CPU/GPU computing systems are examined to identify potential factors that can 
hinder performance of the presented candidate application.  The first software factor to be 
examined is intrinsic to memory-bound problems such as the presented candidate composite 
process flow modeling finite element based application – data compression format.  
 In the interest of brevity, the reader is referred to chapter 4 for a more detailed reasoning 
for the execution of the BCSR2x2 format over CSR which was defined for the initial 
performance results observed.  The same system parameters that exist at the local single 
CPU/GPU computing systems are valid for the multiple CPU/GPU structure presented in this 
chapter, and the potential increase in locality via the utilization of the BCSR2x2 data 
compression format [52-54] is discussed next. 
 Figure 62 and Figure 63 show the multiple CPU/GPU performance with System A and 
System B respectively using only BCSR2x2 data compression format for input meshes MA and 
MB.  The homogeneous comparisons of the BCSR2x2 show that System A gains no positive 
performance benefit using the smaller input mesh MA but does for the corresponding larger 
mesh MB once 16 partitions is reached whereas the CSR format showed that at 16 partitions the 
102 
 
performance for this input mesh dropped.  This difference in behavior corroborates the precept 
that locality defined at the software data layout can effect behavior of HPC applications. 
 Figure 66 and Figure 67 show the performance of the input mesh 10FT for System A and 
System B respectively comparing the CSR and BCSR2x2 compression formats.  The less regular 
mesh defined by the 10FT model displays a consistent benefit when expressed using the 
BCSR2x2 compression format over the CSR format.  These observed performance results are 
consistent with the input mesh MA, which contains a similar number of non-zeros but in a much 
less complex geometry. 
 System B displays similar performance behavior as System A when the data 
compression layout is altered to BCSR2x2 but to a larger magnitude.  System A showed 
performance benefit at 16 partitions for the input unstructured mesh MB whereas System B 
illustrates this same benefit at 4 partitions.  And while System A has no discernible advantage of 
BCSR2x2 for the input mesh MA, System B does show a lower total solution cost, albeit not 
very impressive. Table 30 and Table 32 are the observed results utilizing the BCSR2x2 
compression format for System A and System B respectively. 
 The observed results are taken within the context of the BCSR2x2 data compression 
formats only, with the base line defined as the cost of execution for the global solution using 
BCSR2x2 – i.e. the full solution cost using the BCSR2x2 compression format using a CPU/GPU 
computing system at a single processor level, with no domain decomposition applied.  The 
effects of spatial locality are applied in a mixed compression environment next. 
Table 30 
 Multiple CPU/GPU performance in seconds (System A) 
Data Compression Partitions Mesh MA Mesh MB 





BCSR2x2 2 1,512.94 14,223.50 
BCSR2x2 4 1,015.11 7,429.34 
BCSR2x2 16 715.41 3,234.68 
 
Table 31 
 Multiple CPU/GPU performance of 10FT model in seconds (System A) 
Data Compression Partitions Mesh 10FT 
BCSR2x2 1 140.69 
BCSR2x2 2 1,020.35 
BCSR2x2 4 660.38 
BCSR2x2 8 551.57 
 
Table 32 
 Multiple CPU/GPU performance in seconds (System B) 
Data Compression Partitions Mesh MA Mesh MB 
BCSR2x2 1 217.76 1,931.31 
BCSR2x2 2 233.34 2,151.81 
BCSR2x2 4 168.96 1,202.46 
BCSR2x2 16 129.29 549.83 
 
Table 33 
 Multiple CPU/GPU performance of 10FT model in seconds (System B) 
Data Compression Partitions Mesh 10FT 
BCSR2x2 1 89.07 
BCSR2x2 2 169.63 
BCSR2x2 4 133.66 




 Figure 64 and Figure 65 show the comparison of the multiple CPU/GPU computing 
systems using different data formats of CSR and BCSR2x2 with System A and System B 
respectively.  System A shows a positive benefit of using the BCSR2x2 format over the CSR 
format but only for a limited number of sub-domains for the input unstructured mesh MB and 
even less for the smaller input mesh MA.  This observation illustrates that increasing spatial 
locality for the System A architecture can have absolute benefit as BCSR2x2 will improve on 
the CSR format and not just illustrate an ever increasing computational benefit as compared to a 
single compression format in all cases.  System B does not follow the same pattern as System A. 
 The multiple CPU/GPU computing system defined by System B does not show any 
positive benefit for the use of the BCSR2x2 data compression format when direct comparisons to 
CSR are made – excepting a slight improvement for 2-partitions using the MA input likely due 
to cache effects.  These observations are further elaborated in the next section on hardware 
factors as the immunity to the increased locality of System B when using BCSR2x2 is a 
consequence of this. Table 34 and Table 35 show the observed results for the comparison of CSR 
and BCSR2x2 data compression formats for System A and System B respectively.  The less 
regular input mesh 10FT shows a more consistent behavior for both computing environments. 
Table 34 




Mesh MA (BCSR2x2) Mesh MB (CSR) 
Mesh MB 
(BCSR2x2) 
1 418.44 418.44 4,219.61 4,219.61 
2 2,009.79 1,512.94 15,815.20 14,223.50 
4 1,011.24 1,015.11 7,651.25 7,429.34 









Mesh MA (BCSR2x2) Mesh MB (CSR) 
Mesh MB 
(BCSR2x2) 
1 168.57 217.76 1,197.35 1,931.31 
2 335.92 233.34 2,138.79 2,151.81 
4 157.90 168.96 1,038.69 1,202.46 
16 124.02 129.29 495.33 549.83 
 
 




Figure 63.  Multiple CPU/GPU performance BCSR2x2 compression (System B). 
 




Figure 65. Multiple CPU/GPU performance mixed compression (System B). 
 Figure 66 and Figure 67 show the performance results for System A and System B 
respectively for the input mesh configuration 10FT.  The computing environments defined by 
System A and System B illustrate general equivalence of performance benefit for increased 
locality exposed by the use of BCSR2x2 – unlike the input meshes MA and MB.  The more 
regular input meshes MA and MB have lower irregular memory access patterns than 10FT, 
exposing hardware differences to a greater degree – improving locality for the algorithm has 




Figure 66. Multiple CPU/GPU performance mixed compression - 10FT (System A). 
 





5.5 Hardware Architectural Factors  
 The observed performance of the CPU/GPU computing systems using the unstructured 
mesh input defined by MA, MB, and 10FT is affected not just by the software factors discussed 
in the previous section but hardware factors as well.  The presented candidate application 
performed with mixed results using the multiple CPU/GPU computing system paradigm, and 
switching to different data compression formats continued these amalgamated observations – 
providing enhanced results for System A but less so for System B.  The lower sensitivity to the 
adjustment of data compression format of System B given the equivalence of partition counts 
and input meshes implies an underlying hardware factor. 
 The architectural design of System B is defined as CUDA compute architecture 2.0 and 
System A is defined as CUDA compute architecture 1.0 – significant architectural differences 
for these systems exist.  CUDA’s thread concept is register-bound and with System B 
embodying  over 32,000 on-chip registers compared to System A with a little over 8,000 
provides the ability of more resources for execution threads to remain viable – more importantly 
is the existence of an actual cache structure for System B that is absent from System A. 
 The inclusion of the System B cache and higher memory device I/O allow for higher 
throughput and a finer granularity than that provided by System A.  This finer granularity and 
faster memory I/O for System B creates less sensitivity to the locality alterations provided via 
the BCSR2x2 data compression format, as the multiple CPU/GPU computing system has less 
problems with data locality than does System A.  Therefore optimizing the performance of the 
presented candidate application has different requirements for the different architectures that 
need to be understood. 
110 
 
 Figure 64 and Figure 65 exemplify the importance of a proper coordination of software 
and hardware factors for optimizing HPC applications.  Figure 64 shows that System A is 
positively impacted with the application of BCSR2x2 due to a lack of hardware-level cache 
whereas utilizing BCSR2x2 to increase locality for System B is both unnecessary and potentially 
deleterious to performance as shown in Figure 65 using input meshes MA and MB. Increasing 
the number of elements in a single clock cycle with the implementation of BCSR2x2 using the 
multiple CPU/GPU computing system defined by System B is likely over-utilizing the on-chip 
hardware resources as competition increases. 
 The next section discusses the observed full solution performance using an augmented 
version of equation (4.9.3) from chapter 4 such that the performance of multiple CPU/GPU 
computing systems is endorsed. 
5.6 Computational Complexity Analysis  
 This section establishes the mapping of the observed performance and the derived 
complexity analysis for the multiple CPU/GPU computing system, detailed in Appendix A.  The 
theoretical performance estimates for System A are discussed first followed by those for System 
B where all results are generated under the assumption of CSR data compression format. 
 The complexity analysis model for the multiple CPU/GPU systems is a natural extension 
from the previous chapter’s derivation of the single CPU/GPU systems model – the results from 
the single analysis model are incorporated as a critical component of the multiple analysis model.  
However, the introduction of MPI as a communication amongst various sub-domains presents an 
added level of communication abstraction given that the GPU cannot communicate directly with 
the CPU it can neither communicate with the MPI library calls that can contain a significant 
amount of overhead [105].  Building from equation (4.9.3) in chapter 4 and using value found 
111 
 
for gpuT , equation (5.3) with sdP  the number of sub-domains, ATBN the number of active threads 
per block, and C the cost of combined combination of intra-node communication – assuming that 



















_      (5.3) 
 The determination of the performance modeling equation for multiple CPU/GPU systems 
is more involved than the single CPU/GPU system model - the individual architectures involved 
can present obfuscated operational costs which accentuate the PCIe bottleneck of the single 
system.  Equation (5.3) depicts a change in computational cost when the number of sub-domains 
reaches 16 as CUDA waits until a half-warp is instantiated before issuing a memory transaction 
[37, 93] – this condition is meant to emulate this behavior across discrete systems.   
 The communication variable C from equation (5.3) is affected not only by the size of the 
problem domain but also individual architectures and MPI implementations involved and it is the 
inter-play of MPI and local CPU-GPU host communications that is generally deleterious to 
multiple CPU/GPU computing systems [105, 106, 110]. The understanding of how these 
communication factors interact with the determined modeling equation is discussed in the next 
sub-section.  
 5.6.1 Relationship of MPI-GPU and CPU-GPU communication. The Multiple 
CPU/GPU computing systems do not yield optimal parallel performance as expected when given 
P  processors and a sparse matrix with zN  total non-zero elements and R  local GPU registers 
i.e. does not result in 
P
1
 benefit [28].  The reason is that as the initial system of zN  elements is 
broken down into smaller sets that are held at the local GPU device yields more latency to hide 
112 
 
and correspondingly less computational intensity to utilize GPU resources.  This behavior was 
observed to be consistent across System A and System B for both input meshes when adjusted 
for local GPU register counts and associated zN  elements. 
 The expected behavior of a given computationally intensive application can be seen as 
related to the percentage of total zN  elements held locally at the GPU device and the number of 
partitions distributed across the global system. The percentage of total zN elements held locally 
is given by equation (5.4) and was used as the independent variable to map the observed multiple 















R          (5.4) 
 The ratio of the optimal parallel performance and actual performance for a given value of 
R defines the performance deviation from ideal due to the local CPU-GPU host and MPI 
communication inter-play.  These deviations were mapped using regression such that localNZ  is 
the ratio of the total number of non-zero elements, zN , from the global problem domain held by 
the local GPU device and pX̂ is the value of the deviation computed as the ratio of ideal 
parallelism 
P
Ts  and the actual execution time for the given number of partitions P and execution 
time for the serial version sT  represented as the solid BLACK lines in Figure 68, Figure 69, 
Figure 70 and Figure 71. These are shown for System A as Figure 68 and Figure 69 for input 
meshes MA and MB respectively - Figure 70 and Figure 71 illustrate these same factors for 
input meshes MA and MB using System B. 
113 
 
 The equations revealed by regression, shown for convenience in Figure 68, Figure 69, 
Figure 70, and Figure 71 as the dashed RED lines, vary with (5.4) as input but can be easily 
replaced by the number of processors P  as the proportion of zN  elements held by a local GPU 
device is directly related – this same precept holds for input mesh 10FT. The equations derived 
via regression have a Pearson Product-Moment correlation coefficient of 1 in all cases, the 
coefficient dependency is one-to-one [111] – i.e., exact match with the deviation from ideal 
represented as pX̂  in the figures. The approximated equations derived with regression from pX̂  
are degree 3 for all models and essentially isolate the overhead of CPU-GPU local host and intra-
nodal communication costs. These approximated polynomial equations are employed as an 
asymptotic measurement. Therefore, by the definition of asymptotic behavior [91], the 
relationship of MPI and local CPU-GPU communication effects on multiple CPU/GPU 
computing systems can be shown as  (5.5). 
 3R         (5.5) 
 The equations derived via regression are specific to the observed performance for a given 
input mesh and architecture, but extending the number of partitions i.e. increasing independent 
variable against equation  (5.5) and applying some constant K  define the cost of intra-nodal and 
local CPU-GPU host communication will not be greater than cubic. 
 The asymptotic equation  (5.5) is compared to the equations that were derived using 
regression for both System A and System B. Figure 72 and Figure 73 show the asymptotic 
behavior of the input meshes MA and MB for System A respectively and Figure 74 Figure 75 
show input meshes MA and MB for System B. The results are shown in Figure 76 and Figure 77 
for input mesh 10FT using computing System A and System B respectively. These figures show 
that regardless of the number of partitions/sub-domains the theoretical cost of local CPU-GPU 
114 
 
and intra-nodal communication, represented as the solid line, indeed stay below cubic, 
represented as the dashed line, for all models and both computing system environments. 
 
Figure 68. Deviation from ideal mapped with regression multiple CPU/GPU mesh MA. 
 




Figure 70. Non-zeros held locally (mesh MA) and factors off with multiple CPU/GPU. 
 




Figure 72. Asymptotic behavior of MPI and CPU-GPU communication (MA, System A). 
 




Figure 74. Asymptotic behavior of MPI and CPU-GPU communication (MA, System B). 
 




Figure 76. Asymptotic behavior of MPI and CPU-GPU communication (10FT, System A). 
 
Figure 77. Asymptotic behavior of MPI and CPU-GPU communication (10FT, System B). 
 The next sub-section examines the theoretical performance results using the model given 
by equation (5.3) and altering hardware factors. 
119 
 
 5.6.2 Comparison of performance modeling. The Multiple CPU/GPU computing 
system performance predictive model is compared to the actual time for each sub-domains for 
both System A and System B for input meshes MA and MB with the same input parameters. 
Figure 78, Figure 79 and Figure 80 show a strong correlation to the modeled performance given 
by equation (5.3) and the actual full solution execution time for the multiple CPU/GPU system 
defined by System A.  Figure 81, Figure 82 and Figure 83 show a strong correlation to the 
modeled performance equation (5.3) and the actual full solution execution time for the multiple 
CPU/GPU system by System B. 
 




Figure 79. Multiple CPU/GPU theoretical performance with input MB (System A). 
 




Figure 81. Multiple CPU/GPU theoretical performance with input MA (System B). 
 




Figure 83. Multiple CPU/GPU theoretical performance with input 10FT (System B). 
 5.6.3 Contribution of Hardware Factors. This section establishes a relationship to 
hardware factors and the resulting application performance via the derived equation (5.3), 
adjusting hardware variables and then projecting against the actual performance of the 
application.  The resulting differentials are analyzed and the impact of the adjusted parameter(s) 
on performance of the CPU/GPU computing system is theorized.  The number of SMPs for each 
of the defined computing systems is adjusted while the rest of the model is held as constant to 
isolate the specific hardware. 
 Figure 84 and Figure 85 show that as the number of SMPs drops the corresponding 
theoretical performance decreases for System A for both input mesh MA and MB respectively – 
due to the lower computational power of the individual processing elements.  Increasing the 
number of SMPs for System A has the opposite effect on theoretical performance for both input 
mesh MA and MB, directly related to the greater computational power that is leveraged at this 
alteration. These theoretical performance results are consistent for Figure 84 and Figure 85 with 
123 
 
the decrease of SMPs, shown as the dashed RED lines, producing greater effect when compared 
to the corresponding increase of SMPs shown as the dotted BLUE lines.  
 Figure 86 shows the less structured input mesh 10FT and depicts a theoretical behavior 
across increasing partitions/sub-domains as roughly reflective of that for the more structured 
input meshes MA and MB for System A shown in Figure 84 and Figure 85. Increasing the 
number of SMPs will lower the total execution time and decreasing the number of SMPs will 
raise the total execution time. However, the degree of change is significantly higher for the 10FT 
mesh using System A – likely due to the more distributed nature of the mesh, generating a 
correspondingly less regular sparse matrix and coercing more indirection in the data compression 
format. The increased indirection of the sparse matrix-vector multiplication for the 10FT model 
combined with lower SMPs means lower process throughput hindered by higher levels of 
irregular memory access patterns, significantly deteriorating latency hiding. 
 Figure 87 and Figure 88 show that System B has a similar theoretical behavior to that 
produced by System A; however the theoretical performance is much less pronounced.  
Theoretical performance drops for both increasing and decreasing the number of SMPs at 4 
partitions as the computational intensity becomes less salient and the communication costs for 
intra-nodal communication overtake the final results. 
 The next sub-section discusses the software factors on theoretical performance for 




Figure 84. Multiple CPU/GPU performance with MA - hardware factor (System A). 
 




Figure 86. Multiple CPU/GPU performance with 10FT - hardware factor (System A). 
 




Figure 88. Multiple CPU/GPU performance with MB - hardware factor (System B).  
 
Figure 89. Multiple CPU/GPU performance with 10FT - hardware factor (System B). 
 5.6.4 Contribution of Software Factors. This section establishes a relationship to 
software factors and the resulting application performance via the derived equation (5.3), 
127 
 
adjusting hardware variables and then projecting against the actual performance of the 
application.  The resulting differentials are analyzed and the impact of the adjusted parameter(s) 
on performance of the CPU/GPU computing system is theorized. 
 Thread occupancy is a common practice for increasing the performance of GPU-based 
systems in the GPGPU computing community and this paradigm is followed to achieve 
theoretical performance boost, altering the number of Threads-Per-Block (TPB).  Increasing the 
TPB value in equation (4.9.3) is carried through to equation (5.3) and allows higher probability 
of coalesced memory accesses and utilizes more floating-point operational units – e.g. improved 
theoretical performance.  Once again, System A displays the clearest benefits for both input 
meshes as shown in Figure 90 and Figure 91 - Figure 92 clearly illustrates the best performance 
at 256 threads per block. 
 Lower the number of TPB to less than optimal for System A, defined as 256, provides 
less opportunity for address coalescing as well as lower throughput whereas increasing the TPB 
has the converse theoretical effect. Figure 93,  Figure 94 and Figure 95 for System B display 
similar effects on theoretical performance observed on System A excepting the sudden “dip” 
encountered at 4 partitions for both input meshes. 
 The observed theoretical “dip” at 4 partitions for System B is an artifact of equation (4.7) 
with the cost of the denominator B .  System B with input mesh MB shows a lowering of the 
effects of both higher and lower TPB as the number of partitions increase due to the growing 
influence of intra-nodal communication latency as well as increasing calls to the local CPU via 




Figure 90.  Multiple CPU/GPU performance with MA - software factor (System A). 
 




Figure 92. Multiple CPU/GPU performance with 10FT - software factor (System A). 
 




Figure 94. Multiple CPU/GPU performance with MB - software factor (System B). 
 
Figure 95. Multiple CPU/GPU performance with 10FT - software factor (System B). 
 The next section relates the observations of hardware and software factors to the final 
performance results of the presented candidate application, reasoning the importance of careful 
131 
 
use for multiple CPU/GPU computing systems for optimal HPC modeling application 
performance. 
5.7 Performance and Relation to Software and Hardware Factors   
 The resulting performance of the multiple CPU/GPU computing system is directly tied to 
the interplay of software and hardware factors of the environments in which they executed and 
are related in this section – simply expanding hardware chips will not necessarily produce the 
desired performance boost if the algorithm poorly incorporates the hardware and vice versa.  
Point in fact, just loading a system with the largest possible number of threads (a software factor) 
will overload the register file (a hardware factor) with resource demands enforcing less 
utilization as well as register spilling to device memory and increasing the number of clock 
cycles to hundreds.  Increasing the number of computational chips via the increasing number of 
SMPs (a hardware factor)  will mean little if the access pattern of a matrix system defined by the 
application (an algorithmic factor) is accessed by Kernel threads in a row-major order when the 
GPU device is optimized for column-major causing non-contiguous addressing.  
 The single CPU/GPU systems from the previous chapter illustrate the overlap of software 
and hardware artifacts on resulting performance and the multiple CPU/GPU systems in this 
chapter show the same influence.  However, this is not as easy to spot as the aggregate costs 
imposed by intra-node communication can abrogate any performance benefits observed. And 
determining the cost of this intra-node communication is difficult given its combination of the 
local PCIe overhead of CPU/GPU communications. 
 The current state of the GPU is one of isolation from the CPU as well as the MPI 
standards – this is an area of current research and concern for future co-processor accelerators 
[36, 96, 97, 112, 113].  Equation (5.3) takes liberties and employs approximation with regard to 
132 
 
the final cost of this communication between nodes and the local CPU-GPU costs as there is an 
inherent double-copy when using MPI library calls for a set of one or more CPU/GPU systems 
[105].  
 Establishing a direct and dynamic relation among all the defined software, hardware, and 
algorithmic factors is necessary to elicit optimal performance boost for the presented candidate 
application.  This same judicious application of software and algorithmic methodologies are 
needed for many other HPC computational modeling applications as the iterative solution to the 
sparse matrix system defined by the presented candidate application is common to many 
scientific and engineering applications [20, 21, 23, 39, 55, 80] that wish to fully utilize the 
substantial performance boosting capabilities of not only GPU accelerators but the inexorable 




Summary and Future Directions 
 The major conclusions of this dissertation can be summarized as follows:  
(i) The relationship of software and hardware factors on the performance of 
computationally intense applications that wish to execute within the context of the 
modern CPU/GPU computing systems must be judiciously applied for optimal 
performance. 
(ii) A predictive performance model was adapted for this research and is within the range 
of acceptable normalized error for functionality.  This model can be used to assist 
with the proper determination of cost/benefit optimal manipulation of software and 
hardware factors. 
(iii) Intra-nodal communication and local CPU/GPU host communication can be 
deleterious to performance benefits for multiple CPU/GPU computing environments 
and the asymptotic upper bound on this communicational cost was calculated as 
asymptotically bound to cubic values with data locality. 
(iv) The more regular an input matrix being solved by a CPU/GPU computing system, 
single or multiple node, the more exposed software factors are to final performance 
whereas the less regular a resulting matrix system,  the greater the impact of 
hardware. 
 Computing systems are fast approaching a time when the non-deterministic paradigm of 
parallelism inherent in multi-cored architectures like the GPU will become common-place.  High 
Performance Computing applications wishing to harness this computational power optimally will 
have to be adjusted as per three categories of factors – software, hardware, and algorithmic.  
134 
 
Computing system environments will continue to evolve but the basic understanding of these 
performance factors will provide solid foundations upon which robust and efficient legacy and 
new computational modeling applications can be developed. 
 Chapters 2 and 3 provided the underlying hardware architectural and software 
algorithmic principles of two separate CPU/GPU computing systems defined in this work as 
System A and System B.  Algorithmic factor adjustments such as switching from a one thread 
per row to one warp per row to solve the sparse matrix system bxA   engender an immediate 
performance boost.  The same statement can be made for software factor adjustments such as 
data structure layout via the CSR to BCSR2x2, as the general improvement in locality mitigated 
the lack of real memory cache inherent to GPU devices.  System B illustrated a distinct hardware 
architectural advantage over System A, providing more than 3-times processing cores as well as 
4-times the number of registers and memory devices that executed on both sides of the clock 
pulse – affecting a double-pumped graphics pipeline.  These initial chapter results were reflected 
for both the single and multiple CPU/GPU computing systems in chapters 4 and 5, with the 
added complexity of MPI communication for the latter. 
 Chapter 4 also produced a computational complexity analysis of the CPU/GPU 
computing system that was used to project the performance of the presented candidate 
application within the context of both System A and System B machine environments.  
Adjusting the software and hardware variables in the complexity equation reflected actual 
performance results to within reasonable limits cohobating the interdependence of software and 
hardware factors of the CPU/GPU computing architecture at a mathematical level. The 
introduction of multiple CPU/GPU computing systems in chapter 5 further advanced the concept 
of these performance factors as the mathematical complexity was shown to be an exponential 
135 
 
factor of the number of SMPs per system utilized.  Chapter 5 also exposed the cost of intra-
nodal and CPU-GPU local host communication as a correlation of the percentage of locally 
defined non-zero elements held by a given GPU device registers and the factors off from the 
calculated ideal parallelism via domain decomposition as multiple processors/nodes – found as a 
negative factor on performance that is cubic in nature.  The performance results determined with 
the presented candidate application can be applied to other computationally intensive HPC 
applications as well.  Chapter 5 also revealed that the less regular input mesh defined by 10FT 
the less effect locality plays with regards to data compression formats – due to smaller likelihood 
of dense sub-matrices that are critical to blocked compression formats e.g., BCSR2x2. 
 The presented candidate application is designed around computational elements built up 
using the FEM methodology, resulting in a sparse matrix system that is a well documented point 
of computational intensity [21, 48, 71, 79, 110, 112].  The solution of systems involving sparse 
matrices is a common paradigm in the HPC modeling applications, all facing the same 
computational dilemma – how to optimally solve these algorithms using modern computing 
environments. Thus, the methodologies presented in this work can be applied to a wide range of 
computationally intensive applications built around sparse matrix systems and their solution in 
the computational modeling analysis. 
 The current popularity of the GPU as a computationally powerful co-processor will 
continue to grow as demand for more powerful machines to execute HPC applications grows – 
this will be exacerbated by the trend in mobile computing.  The on-chip architectures of mobile 
computing tables and smart phones have provided a new and interesting opportunity for GPGPU 
computing – fused addresses [115].  The PCIe CPU-GPU communication bottleneck is well 
documented [21, 48, 71, 79, 110, 112] but a fusing of CPU and GPU on the same chip will likely 
136 
 
change this but will also create some new issues, e.g. memory device I/O.  The fused systems, 
such as AMD APC processor use the slower DDR memory device rather than GDDR of the GPU 
resulting in the unusual situation of an efficient sparse matrix solution but with the opposite 
effect on dense matrix systems [114, 115].   
 The inexorable growth in multi-cored CPUs will also provide more computationally 
intensive power and a unique dynamic will develop as the GPU gets closer to the flexible 
memory structure of the CPU, and vice versa such as processors like Intel’s Sandy Bridge [114].  
The developer wishing to attain optimal performance with these new machines will need to 





1 Boggan, S.K., and Pressel, D.M.: ‘GPUs: An Emerging Platform for General-Purpose 
 Computation’, in Editor (Ed.)^(Eds.): ‘Book GPUs: An Emerging Platform for General-
 Purpose Computation’ (Army Research Lab, 2007, edn.), pp. 50 
2 Fatahalian, K., and Houston, M.: ‘GPUs: A Closer Look’, ACM Queue, 2008, 6, (2), pp. 
 10 
3 Ross, P.E.: ‘Why CPU Frequency Stalled’, IEEE Spectrum, 2008, 45, (4), pp. 1 
4 Tian, D.Z.: ‘Editorial (Moore's Law)’, IEEE Potentials, 2008, 27, (6), pp. 3 
5 Kang, S., Choi, H.J., Kim, C.H., Chung, S.W., Kwon, D., and Na, J.C.: ‘Exploration of 
 CPU/GPU co-execution: from the perspective of performance, energy, and temperature’, 
 in Editor (Ed.)^(Eds.): ‘Book Exploration of CPU/GPU co-execution: from the 
 perspective of performance, energy, and temperature’ (ACM New York, NY, USA 2011, 
 edn.), pp. 38-43 
6 Hong, S., and Kim, H.: ‘An integrated GPU power and performance model’, in Editor 
 (Ed.)^(Eds.): ‘Book An integrated GPU power and performance model’ (ACM New 
 York, NY, USA, 2010, edn.), pp. 280-289 
7 Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, 
 P.: ‘Brook for GPUs: Stream Computing on Graphics Hardware’, ACM Transactions on 
 Graphics - Proceedings of ACM SIGGRAPH 2004 2004, 23, (3), pp. 777-786  
8 Rumpf, M., and Strzodka, R.: ‘Graphics Processor Units: New Prospects for Parallel 
 Computing’, in Bruaset, A.M.a.T., Aslak (Ed.): ‘Numerical Solution of Partial 
 Differential Equations on Parallel Computers’ (Springer, 2005), pp. 89-134 
138 
 
9 Silberschatz, A., Galvin, P., and Gagne, G.: ‘Applied Operating System Concepts: 
 windows XP update’ (John Wiley & Sons, Inc., 2003, 1st edn. 2003) 
10 Silberschatz, A., Galvin, P., and Gagne, G.: ‘Applied Operating System Concepts’ (John 
 Wiley & Sons, Inc., 2000, 1 edn. 2000) 
11 Ross, P.E.: ‘Why CPU Frequency Stalled’, IEEE Spectrum, 2008, 45, (4), pp. 72-72 
12 Patterson, D.: ‘Computer Organization and Design : The Hardware/Software Interface’ 
 (Elsevier Science Ltd, 1997. 1997) 
13 Patterson, D.A., and Hennessy, J.L.: ‘Computer Organization & Design: The 
 Hardware/Software Interface’ (Morgan Kaufmann Publishers, Inc., 1998, 2nd edn. 1998) 
14 Liang, Y.: ‘The Use of Parallel Polynomial Preconditioners in the Solution of Systems of 
 Linear Equations’, University of Ulster, 2005 
15 Cheng, H.: ‘Vector Pipelining, Chaining, and Speed on the IBM 3090 and Cray X-MP’, 
 Computer, 1989, 22, (9), pp. 9 
16 Sivasubramaniam, A., Singla, A., Ramachandran, U., and Venkateswaran, H.: ‘Machine 
 Abstractions and Locality Issues in Studying Parallel Systems’, in Editor (Ed.)^(Eds.): 
 ‘Book Machine Abstractions and Locality Issues in Studying Parallel Systems’ (Georgia 
 Institute of Technology, 1993, edn.), pp.  
17 Thomas, S.: ‘Preconditioned Conjugate Gradient Methods for Semiconductor Device 
 Simulation on a CRAY C90 Vector Processor’. Proc. VECPAR '96 Selected papers from 




18 Thomas, S.: ‘Preconditioned conjugate gradient methods for semiconductor device 
 simulation on a CRAY C90 vector processor’, Vector and Parallel Processing — 
 VECPAR'96, 1997, 1215 
19 Tagaya, S., Nishida, M., Hagiwara, T., Yanagawa, T., Yokoya, Y., Takahara, H., stadler, 
 J.o., and Galle, M.: ‘The NEC SX-8 Vector Supercomputer System’: ‘High Performance 
 Computing on Vector Systems’ (Springer-Verlag Berlin, Heidelberg, 2006), pp. Part-1, 
 3-24 
20 Bustamam, A., Burrage, K., and Hamilton, N.A.: ‘Fast Parallel Markov Clustering in 
 Bioinformatics Using Massively Parallel Computing on GPU with CUDA and 
 ELLPACK-R Sparse Format’, IEEE/ACM Trans Comput Biol Bioinform., 2012, 3, (9), 
 pp. 13 
21 Corrigan, A., Camelli, F.F., Lohner, R., and Wallin, J.: ‘Running unstructured grid-based 
 CFD solvers on modern graphics hardware’, International Journal for Numerical Methods 
 in Fluids, 2011, 66, (2), pp. 221-229 
22 Grozea, C., Bankovic, Z., and Laskov, P.: ‘FPGA vs. Multi-Core CPUs vs. GPUs: 
 Hands-on Experience with a Sorting Application’: ‘Facing the multicore-challenge ’ 
 (Springer-Verlag Berlin, Heidelberg 2010), pp. 105-117 
23 Hamada, T., Narumi, T., Yasuoka, K., Nitadori, K., and Taiji, M.: ‘42 TFlops 
 Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics and 
 Turbulence’, in Editor (Ed.)^(Eds.): ‘Book 42 TFlops Hierarchical N-body Simulations 
 on GPUs with Applications in both Astrophysics and Turbulence’ (ACM New York, NY, 
 USA, 2009, edn.), pp. 12 
24 Luebke, D., and Humphreys, G.: ‘How GPUs Work’, IEEE Computer, 2007, 2007 
140 
 
25 Luebke, D.: ‘CUDA: Scalable parallel programming for high-performance scientific 
 computing’, in Editor (Ed.)^(Eds.): ‘Book CUDA: Scalable parallel programming for 
 high-performance scientific computing’ (ACM New York, NY, USA 2008, edn.), pp. 836 
 - 838  
26 El-Ghazawi, T.A., Cantonnet, F., Yao, Y., Annareddy, S., and Mohamed, A.S.: 
 ‘Benchmarking parallel compilers: A UPC case study’, Future Generation Computer 
 Systems - Systems performance analysis and evaluation 2006, 22, (7), pp. 11 
27 Polze, A., and Troger, P.: ‘Trends and challenges in operating systems—from parallel 
 computing to cloud computing’, Concurrency and Computation: Practice & Experience, 
 2012, 24, (7), pp. 10 
28 Wilkinson, B., and Allen, M.: ‘Parallel Programming: Techniques and Applications 
 Using Networked Workstations and Parallel Computers’ (Pearson Education, Inc., 2005, 
 2nd edn. 2005) 
29 Haney, R.H.: ‘Study and Evaluation of Domain Decomposition Approaches in two 
 Parallel Software Code Developments for Process Flow Modeling in Liquid Composite 
 Molding’, North Carolina A & T State University, 2006 
30 Mohan, D.R., Shires, D., and Mark, A.: ‘Scalable Large Scale Process Modeling and 
 Simulations in Liquid Composite Molding’: ‘Computational Science - ICCS 2001’ 
 (Springer Berlin Heidelberg, 2001), pp. 1199-1208 
31 Angel, E., and Shreiner, D.: ‘An introduction to shader-based OpenGL programming’, in 
 Editor (Ed.)^(Eds.): ‘Book An introduction to shader-based OpenGL programming’ 
 (ACM, 2009, edn.), pp.  
141 
 
32 Udupa, A., Govindarajan, R., and Thazhuthaveetil, M.J.: ‘Software Pipelined Execution 
 of Stream Programs on GPUs’, in Editor (Ed.)^(Eds.): ‘Book Software Pipelined 
 Execution of Stream Programs on GPUs’ (IEEE Computer Society Washington, DC, 
 USA, 2009, edn.), pp. 200-209  
33 Chen, J.X., and Wegman, E.J.: ‘Foundations of 3D Graphics Programming: Using JOGL 
 and Java3D’ (Springer-Verlag London Limited, 2006, 1 edn. 2006) 
34 Hearn, D., and Baker, P.M.: ‘Computer Graphics C version - 2nd Edition’ (Pearson 
 Education, 1997, 1996, 2nd edn. 1996) 
35 http://www.opengl-tutorial.org, accessed 12-02-2012 2012 
36 Leeser, M., Yablonski, D., Brooks, D., and King, L.S.: ‘The Challenges of Writing 
 Portable, Correct and High Performance Libraries for GPUs’, ACM SIGARCH 
 Computer Architecture News, 2011, 39, (4), pp. 5 
37 Nvidia: ‘CUDA C BEST PRACTICES GUIDE’, in Editor (Ed.)^(Eds.): ‘Book CUDA C 
 BEST PRACTICES GUIDE’ (Nvidia Corporation, 2011, edn.), pp. 76 
38 Nvidia: ‘PARALLEL THREAD EXECUTION ISA VERSION 3.1’, in Editor 
 (Ed.)^(Eds.): ‘Book PARALLEL THREAD EXECUTION ISA VERSION 3.1’ (Nvidia 
 Corporation, 2012, edn.), pp. 241 
39 Ayanda, D., and Adejumo, Y.: ‘A Prototype Model of High Performance Computing 
 Using Beowulf Cluster’, International Journal of Emerging Sciences, 2011, 1, (4), pp. 
 696-705 
40 Goddeke, D., Wobker, H., Strzodka, R., Mohd-Yusof, J., McCormick, P., and Turek, S.: 
 ‘Co-Processor Acceleration of an Unmodified Parallel Solid Mechanics Code with 
142 
 
 FeastGPU’, International Journal of Computational Science and Engineering, 2009, 4, 
 (4), pp. 254-269 
41 Lu, F., Song, J., Yin, F., and Zhu, X.: ‘Performance evaluation of hybrid programming 
 patterns for large CPU/GPU heterogeneous clusters’, Computer Physics 
 Communications, 2011, 183, (6), pp. 1172-1181 
42 Mohan, R., Ngo, N.D., Tamma, K.K., and Fickie, K.D.: ‘Three-Dimensional Resin 
 Transfer Molding Process: Developments for Thick Composite Manufacturing 
 Applications’, in Editor (Ed.)^(Eds.): ‘Book Three-Dimensional Resin Transfer Molding 
 Process: Developments for Thick Composite Manufacturing Applications’ (U.S. Army 
 Research Laboratory, 1996, edn.), pp. 32 
43 Mohan, R.V., Ngo, N.D., Tamma, K.K., and Fickie, K.D.: ‘On a Pure Finite-Element-
 Based Methodology for Resin Transfer Mold Filling Simulations’, in Editor (Ed.)^(Eds.): 
 ‘Book On a Pure Finite-Element-Based Methodology for Resin Transfer Mold Filling 
 Simulations’ (Army Research Lab, 1996, edn.), pp. 18 
44 Chandrupatla, T.R., and Belegundu, A.D.: ‘Introduction To Finite Elements In 
 Engineering’ (Prentice-Hall, Inc., 2002, 3 edn. 2002) 
45 Rao, S.S.: ‘Applied Numerical Methods For Engineers And Scientists’ (Prentice-Hall, 
 Inc., 2002, 1st edn. 2002) 
46 Shewchuk, J.R.: ‘An Introduction to the Conjugate Gradient Method Without the 
 Agonizing Pain’, in Editor (Ed.)^(Eds.): ‘Book An Introduction to the Conjugate 
 Gradient Method Without the Agonizing Pain’ (Carnegie Mellon University, 1994, edn.), 
 pp. 64 
143 
 
47 Mohan, D.R., Ngo, N.D., and Tamma, K.K.: ‘On a pure finite-element-based 
 methodology for resin transfer molds filling simulations’, Polymer Engineering & 
 Science, 1999, 39, (1), pp. 26-43 
48 Baskaran, M.M., and Bordawekar, R.: ‘Optimizing Sparse Matrix-Vector Multiplication 
 on GPUs’, in Editor (Ed.)^(Eds.): ‘Book Optimizing Sparse Matrix-Vector Multiplication 
 on GPUs’ (IBM Research, 2008, edn.), pp. 11 
49 Bell, N., and Garland, M.: ‘Effcient Sparse Matrix-Vector Multiplication on CUDA’, in 
 Editor (Ed.)^(Eds.): ‘Book Effcient Sparse Matrix-Vector Multiplication on CUDA’ 
 (NVIDIA Corporation, 2008, edn.), pp. 32 
50 Buatois, L., Caumon, G., and Levy, B.: ‘Concurrent number cruncher: a GPU 
 implementation of a general sparse linear solver’, International Journal of Parallel, 
 Emergent and Distributed Systems, 2009, 24, (3), pp. 18 
51 Helfenstein, R., and Koko, J.: ‘Parallel preconditioned conjugate gradient algorithm on 
 GPU’, Journal of Computational and Applied Mathematics, 2011, 236, (15), pp. 6 
52 Shahnaz, R., Usman, A., and Chughtai, I.R.: ‘Review of Storage Techniques for Sparse 
 Matrices’, in Editor (Ed.)^(Eds.): ‘Book Review of Storage Techniques for Sparse 
 Matrices’ (IEEE, 2005, edn.), pp. 1-7 
53 Hugues, M.R., and Petiton, S.G.: ‘Sparse Matrix Formats Evaluation and Optimization on 
 a GPU’, in Editor (Ed.)^(Eds.): ‘Book Sparse Matrix Formats Evaluation and 
 Optimization on a GPU’ (IEEE Computer Society Washington, DC, USA, 2010, edn.), 
 pp. 122-129 
54 Rehman, M.S.: ‘Exploring Irregular Memory Access Applications on the GPU’, 
 International Institute of Information Technology, 2010 
144 
 
55 De Jong, M.A.: ‘Developing a CUDA solver for large sparse matrices for MARIN’. 
 Master thesis, Delft University of Technology, 2012 
56 Corporation, A.: ‘AMD Family 10h Server and Workstation Processor Power and 
 Thermal Data Sheet’, in Editor (Ed.)^(Eds.): ‘Book AMD Family 10h Server and 
 Workstation Processor Power and Thermal Data Sheet’ (2010, edn.), pp. 98 
57 Corporation, I.: ‘Intel® Xeon® Processor 5600 Series: The Next Generation of 
 Intelligent Server Processors’, in Editor (Ed.)^(Eds.): ‘Book Intel® Xeon® Processor 
 5600 Series: The Next Generation of Intelligent Server Processors’ (Intel Corporation, 
 2010, edn.), pp. 8 
58 Corporation, N.: ‘NVIDIA Quadro® FX 5600 Datasheet’, in Editor (Ed.)^(Eds.): ‘Book 
 NVIDIA Quadro® FX 5600 Datasheet’ (2008, edn.), pp. 2 
59 Corporation, N.: ‘TESLA M2050 AND TESLA M2070/M2070Q DUAL-SLOT 
 COMPUTING PROCESSOR MODULES’, in Editor (Ed.)^(Eds.): ‘Book TESLA 
 M2050 AND TESLA M2070/M2070Q DUAL-SLOT COMPUTING PROCESSOR 
 MODULES’ (Nvidia Corporation, 2010, edn.), pp. 18 
60 Nvidia: ‘NVIDIA GeForce 8800 Architecture Technical Brief’, in Editor (Ed.)^(Eds.): 
 ‘Book NVIDIA GeForce 8800 Architecture Technical Brief’ (Nvidia Corporation, 2006, 
 edn.), pp. 55 
61 Nvidia: ‘NVIDIA’s Next Generation CUDA Compute Architecture: Fermi - Whitepaper’, 
 in Editor (Ed.)^(Eds.): ‘Book NVIDIA’s Next Generation CUDA Compute Architecture: 
 Fermi - Whitepaper’ (NVIDIA Corporation, 2009, edn.), pp. 22 
145 
 
62 Nvidia: ‘NVIDIA CUDA C Programming Guide: Version 4.2’, in Editor (Ed.)^(Eds.): 
 ‘Book NVIDIA CUDA C Programming Guide: Version 4.2’ (Nvidia Corporation, 2012, 
 edn.), pp. 173 
63 Komatitsch, D., Michea, D., and Erlebacher, G.: ‘Porting a high-order finite-element 
 earthquake modeling application to NVIDIA graphics cards using CUDA’, Journal of 
 Parallel Computing, 2009, 69, (5), pp. 9 
64 Kuznik, F., Obrecht, C., Rusaouen, G., and Roux, J.-J.: ‘LBM based flow simulation 
 using GPU computing processor’, Computers & Mathematics with Applications, 2010, 
 59, (7), pp. 12 
65 Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, 
 E., Zhang, Y., and Volkov, V.: ‘Parallel Computing Experiences with CUDA ’, Micro, 
 IEEE 2008, 28, (4), pp. 13-27 
66 Parakh, A.: ‘Performance estimation and application mapping on different GPUs’. Proc. 
 HiPC - High Performance Computing Confrence, Pune, INDIA, December 18-21, 2012 
 2012 pp. Pages 
67 Bernaschi, M., Bisson, M., and Rossetti, D.: ‘Benchmarking of communication 
 techniques for GPUs’, Journal of Parallel and Distributed Computing, 2013, 73, (2), pp. 5 
68 Micikevicius, P.: ‘3D Finite Difference Computation on GPUs using CUDA’, in Editor 
 (Ed.)^(Eds.): ‘Book 3D Finite Difference Computation on GPUs using CUDA’ (ACM 
 New York, NY, USA, 2009, edn.), pp.  
69 Papadrakakis, M., Stavroulakis, G., and Karatarakis, A.: ‘A new era in scientific 
 computing: Domain decomposition methods in hybrid CPU–GPU architectures’, 
146 
 
 Computer Methods in Applied Mechanics and Engineering, 2011, 200, (13-16), pp. 1490-
 1508 
70 Di, P., Wu, H., Xue, J., Wang, F., and Yang, C.: ‘Parallelizing SOR for GPGPUs using 
 alternate loop tiling’, Journal of Parallel Computing, 2012, 38, (6-7), pp. 18 
71 Goddeke, D., Strzodka, R., and Turek, S.: ‘Performance and accuracy of hardware-
 oriented native-, emulated- and mixed-precision solvers in FEM simulations’, 
 International Journal of Parallel, Emergent and Distributed Systems 2007, 22, (4), pp. 
 221-256  
72 Kothapalli, K., Mukherjee, R., Rehman, M.S., Patidar, S., Narayanan, P.J., and Srinathan, 
 K.: ‘A Performance Prediction Model for the CUDA GPGPU Platform’, in Editor 
 (Ed.)^(Eds.): ‘Book A Performance Prediction Model for the CUDA GPGPU Platform’ 
 (IEEE, 2009, edn.), pp. 463 - 472  
73 Lukash, M., and Rupp, K.: ‘Sparse Approximate Inverse Preconditioners for Iterative 
 Solvers on GPUs’, in Editor (Ed.)^(Eds.): ‘Book Sparse Approximate Inverse 
 Preconditioners for Iterative Solvers on GPUs’ (Society for Computer Simulation 
 International 2012, edn.), pp.  
74 Huo, X., Ravi, V., Ma, W., and Agrawal, G.: ‘An Execution Strategy and Optimized 
 Runtime Support for Parallelizing Irregular Reductions on Modern GPUs’, in Editor 
 (Ed.)^(Eds.): ‘Book An Execution Strategy and Optimized Runtime Support for 
 Parallelizing Irregular Reductions on Modern GPUs’ (ACM New York, NY, USA, 2011, 
 edn.), pp. 2-11 
147 
 
75 Wald, I.: ‘Active thread compaction for GPU path tracing’, in Editor (Ed.)^(Eds.): ‘Book 
 Active thread compaction for GPU path tracing’ (ACM New York, NY, USA, 2011, 
 edn.), pp. 51-58 
76 Hewlett-Packard Development Company, L.P.: ‘QuickSpecs NVIDIA Quadro FX 5600 
 PCIe Graphics Card’, in Editor (Ed.)^(Eds.): ‘Book QuickSpecs NVIDIA Quadro FX 
 5600 PCIe Graphics Card’ (2009, edn.), pp. 4 
77 Gou, C., and Gaydadjiev, G.N.: ‘Elastic Pipeline: Addressing GPU On-chip Shared 
 Memory Bank Conflicts’, in Editor (Ed.)^(Eds.): ‘Book Elastic Pipeline: Addressing 
 GPU On-chip Shared Memory Bank Conflicts’ (ACM New York, NY, USA, 2011, edn.), 
 pp.  
78 Nvidia: ‘CUDA CUBLAS Library: Version 1.1’, in Editor (Ed.)^(Eds.): ‘Book CUDA 
 CUBLAS Library: Version 1.1’ (Nvidia Corporation, 2007, edn.), pp. 84 
79 Corporation, N.: ‘CUDA CUSPARSE Users Guide’, in Editor (Ed.)^(Eds.): ‘Book 
 CUDA CUSPARSE Users Guide’ (Nvidia Corporation, 2012, v5.0 edn.), pp. 123 
80 Bahi, J.M., Couturier, R., and Khodja, L.Z.: ‘Parallel GMRES implementation for 
 solving sparse linear systems on GPU clusters’, in Editor (Ed.)^(Eds.): ‘Book Parallel 
 GMRES implementation for solving sparse linear systems on GPU clusters’ (Society for 
 Computer Simulation International San Diego, CA, USA, 2011, edn.), pp. 12-19 
81 Kruger, J., and Westermann, R.: ‘Linear Algebra Operators for GPU Implementation of 
 Numerical Algorithms’, ACM Transactions on Graphics (TOG) - Proceedings of ACM 
 SIGGRAPH 2003, 2003, 22, (3), pp. 8 
82 Holk, E., Byrd, W., Mahajan, N., Wilcock, J., Chauhan, A., and Lumsdaine, A.: 
 ‘Declarative Parallel Programming for GPUs’: ‘Advances in Parallel Computing, Volume 
148 
 
 22:  Applications, Tools and Techniques on the Road to Exascale Computing’ (IOS 
 Press, 2011) 
83 Kahan, W.: ‘IEEE Standard 754 for Binary Floating-Point Arithmetic’, in Editor 
 (Ed.)^(Eds.): ‘Book IEEE Standard 754 for Binary Floating-Point Arithmetic’ 
 (University of California Berkeley CA, USA, 1997, edn.), pp. 30 
84 Dimitrov, M., Mantor, M., and Zhou, H.: ‘Understanding Software Approaches for 
 GPGPU Reliability’, in Editor (Ed.)^(Eds.): ‘Book Understanding Software Approaches 
 for GPGPU Reliability’ (ACM New York, NY, USA, 2009, edn.), pp. 94-104 
85 Hillesland, K., and Lastra, A.: ‘GPU floating-point paranoia’. Proc. ACM Workshop on 
 General Purpose Computing on Graphics Processors In ACM Workshop on General 
 Purpose Computing on Graphics Processors 2004 2004 pp. Pages 
86 Whitehead, N., and Fit-Florea, A.: ‘Precision & Performance: Floating Point and IEEE 
 754 Compliance for NVIDIA GPUs’, in Editor (Ed.)^(Eds.): ‘Book Precision & 
 Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs’ (Nvidia, 
 2011, edn.), pp. 7 
87 Castaldo, A.M.: ‘ERROR ANALYSIS OF VARIOUS FORMS OF FLOATING POINT 
 DOT PRODUCTS’, The University of Texas at San Antonio, 2007 
88 Volkov, V., and Demmel, J.W.: ‘Benchmarking GPUs to tune dense linear algebra’. Proc. 
 International Conference for High Performance Computing, Networking, Storage and 
 Analysis, 2008. SC 2008., Austin, TX, USA, 15-21 Nov. 2008 2008 pp. Pages 
89 Mohan, R.V., Ngo, N.D., and Tamma, K.K.: ‘On a Pure Finite Element Methodology for 
 Resin Transfer Mold Filling Simulations’, Polymer Engineering and Science, 1999, 39, 
 pp. 26-43 
149 
 
90 Fatahalian, K., Sugerman, J., and Hanrahan, P.: ‘Understanding the Effciency of GPU 
 Algorithms for Matrix-Matrix Multiplication’, in Editor (Ed.)^(Eds.): ‘Book 
 Understanding the Effciency of GPU Algorithms for Matrix-Matrix Multiplication’ 
 (ACM New York, NY, USA 2004, edn.), pp. 133-137  
91 Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C.: ‘Introduction To Algorithms’ 
 (The MIT Press, 2001, 2nd edn. 2001) 
92 Resios, A.: ‘GPU performance prediction using parametrized models’. Masters, Utrecht 
 University, 2011 
93 Zhang, Y., and Owens, J.D.: ‘A quantitative performance analysis model for GPU 
 architectures’. Proc. High Performance Computer Architecture (HPCA), 2011 IEEE 17th 
 International Symposium on San Antonio, TX, 12-16 Feb. 2011 2011 pp. Pages 
94 Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., and Panda, D.K.: ‘MVAPICH2-
 GPU: optimized GPU to GPU communication for InfiniBand clusters’, Computer 
 Science - Research and Development, 2011, 26, (3-4), pp. 257-266 
95 Karunadasa, N.P., and Ranasinghe, D.N.: ‘Accelerating High Performance Applications 
 with CUDA and MPI’. Proc. 2009 International Conference on Industrial and 
 Information Systems (ICIIS), Sri Lanka, December 28-31, 2009 2009 pp. Pages 
96 Tipparaju, V., and Vetter, J.S.: ‘GA-GPU: Extending a Library-based Global Address 
 Space Programming Model for Scalable Heterogeneous Computing Systems’, in Editor 
 (Ed.)^(Eds.): ‘Book GA-GPU: Extending a Library-based Global Address Space 
 Programming Model for Scalable Heterogeneous Computing Systems’ (ACM New York, 
 NY, USA 2012, edn.), pp. 53-64 
150 
 
97 Wang, L., Huang, M., Narayana, V., and El-Ghazawi, T.A.: ‘Scaling Scientific 
 Applications on Clusters of Hybrid Multicore/GPU Nodes’, in Editor (Ed.)^(Eds.): ‘Book 
 Scaling Scientific Applications on Clusters of Hybrid Multicore/GPU Nodes’ (ACM New 
 York, NY, USA, 2011, edn.), pp.  
98 Wilkinson, B., and Allen, M.: ‘Parallel Programming: Techniques and Applications 
 Using Networked Workstations and Parallel Computers’ (Pearson Prentice Hall, 2005, 2 
 edn. 2005) 
99 Sun Microsystems, I.: ‘Sun HPC ClusterTools 8.2 (software tools)’, in Editor 
 (Ed.)^(Eds.): ‘Book Sun HPC ClusterTools 8.2 (software tools)’ (Oracle, Inc., 2009, 
 edn.), pp.  
100 Geist, G.A., Kohl, J.A., and Papadopoulos, P.M.: ‘PVM and MPI: A comparison of 
 features’, Calculateurs Paralleles, 1996, 8, (2) 
101 Song, J.P., and Shires, D.: ‘Central Processing Unit/Graphics Processing Unit 
 (CPU/GPU) Hybrid Computing of Synthetic Aperture Radar Algorithm’, in Editor 
 (Ed.)^(Eds.): ‘Book Central Processing Unit/Graphics Processing Unit (CPU/GPU) 
 Hybrid Computing of Synthetic Aperture Radar Algorithm’ (U.S. Army Research 
 Laboratory, 2010, edn.), pp.  
102 Khajeh-Saeed, A., and Perot, J.B.: ‘Computational Fluid Dynamics Simulations Using 
 Many Graphics Processors’, Computing in Science & Engineering, 2011, 14, (3), pp. 10-
 19 
103 Komatitsch, D., Michea, D., and Erlebacher, G.: ‘Porting a high-order finite-element 
 earthquake modeling application to NVIDIA graphics cards using CUDA’, Journal of 
 Parallel and Distributed Computing, 2009, 69, (5), pp. 451-460 
151 
 
104 Fengshun, L., Song, J., Yin, F., and Zhu, X.: ‘Performance evaluation of hybrid 
 programming patterns for large CPU/GPU heterogeneous clusters’, Computer Physics 
 Communications, 2012, 183, (6), pp. 1172-1181 
105 Ji, F., Ajiy, A.M., Dinanz, J., Buntinasz, D., Balajiz, P., Fengy, W.-c., and Ma, X.: 
 ‘Efficient Intranode Communication in GPU-Accelerated Systems’. Proc. Parallel and 
 Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 
 26th International Shanghai 21-25 May 2012 2012 pp. Pages 
106 Belviranli, M.E., Bhuyan, L.N., and Gupta, R.: ‘A Dynamic Self-Scheduling Scheme for 
 Heterogeneous Multiprocessor Architectures’, ACM Transactions on Architecture and 
 Code Optimization (TACO) - Special Issue on High-Performance Embedded 
 Architectures and Compilers, 2013, 9, (4), pp. 19 
107 Filipovic, J., Peterlik, I., and Fousek, J.: ‘GPU Acceleration of Equations Assembly in 
 Finite Elements Method – Preliminary Results’, SAAHPC: Symposium on Application 
 Accelerators in HPC, 2009 
108 Fousek, J., Filipovic, J., and Madzin, M.: ‘Automatic Fusions of CUDA-GPU Kernels for 
 Parallel Map’, ACM SIGARCH Computer Architecture News, 2011, 39, (4), pp. 1 
109 Khajeh-Saeed, A., and Perot, J.B.: ‘Computational Fluid Dynamics Simulations Using 
 Many Graphics Processors’, Computing in Science & Engineering, 2012, 14, (3), pp. 9 
110 Goddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick, P., Buijssen, S.H.M., 
 Grajewski, M., and Turek, S.: ‘Exploring weak scalability for FEM calculations on a 
 GPU-enhanced cluster’, Journal of Parallel Computing, 2007, 33, (10-11), pp. 685-699 
111 Samuels, M.L., and Witmer, J.A.: ‘Statistics for the Life Sciences’ (Pearson Education, 
 Inc., 2003, 3 edn. 2003) 
152 
 
112 Che, S., Sheaffer, J.W., and Skadron, K.: ‘Dymaxion: Optimizing Memory Access 
 Patterns for Heterogeneous Systems’, in Editor (Ed.)^(Eds.): ‘Book Dymaxion: 
 Optimizing Memory Access Patterns for Heterogeneous Systems’ (ACM New York, NY, 
 USA, 2011, edn.), pp.  
113 Steinberger, M., Kainz, B., Kerbl, B., Hauswiesner, S., Kenzel, M., and Schmalstieg, D.: 
 ‘Softshell: Dynamic Scheduling on GPUs’, ACM Transactions on Graphics (TOG) - 
 Proceedings of ACM SIGGRAPH Asia 2012, 2012, 31, (6), pp. 12 
114 Solutions, A.E.: ‘Intel Sandy Bridge Brings Many Benefits the PC/104 Form Factor’, in 
 Editor (Ed.)^(Eds.): ‘Book Intel Sandy Bridge Brings Many Benefits the PC/104 Form 
 Factor’ (Embedded Solutions, 2011, edn.), pp. 5 
115 Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., and Vetter, J.S.: ‘The Tradeoffs 
 of Fused Memory Hierarchies in Heterogeneous Computing Architectures’, in Editor 
 (Ed.)^(Eds.): ‘Book The Tradeoffs of Fused Memory Hierarchies in Heterogeneous 






 The CUDA Kernel code and associated functions and structures for the execution of 
sparse matrix-vector multiplication discussed in chapter 3 are presented below. 
 








#define BLOCK_SIZE 16 
 
// HOST-Side C-Code interface for generalized matrix multiplication operation of the form  
// 'Ax = b' using CRS compression. 
extern "C" void mul_multBlocks_CRS(float *val, unsigned int vLength,  unsigned int *rp, 
  unsigned int rpLength, unsigned int *cp, unsigned int cpLength,  float *b,  
  float *x, unsigned int m, unsigned int n, unsigned int p, float &time); 
 
// HOST-Side C-Code interface for generalized matrix multiplication operation of the form  
// 'Ax = b' using BCRS 2x2 compression. 
extern "C" void mul_multBlocks_BCRS(float *val, unsigned int vLength, unsigned int *rp, 
  unsigned int rpLength, unsigned int *cp, unsigned int cpLength, float *b,  
  float *x, unsigned int m, unsigned int n, unsigned int p, float &time); 
 
// GPU-Code for generalized matrix multiplication operation of the form 'Ax = b' for CSR 
// format. 
__global__ void mul_multipleblocks_CRS(float *val, float *b, float *x, uint2 *rp,  
  unsigned int *cp, unsigned int m, unsigned int n); 
 
// GPU-Code for generalized matrix multiplication operation of the form 'Ax = b' for BCRS 2x2 
// format. 
__global__ void mul_multipleblocks_BCRS(float4 *val, float2 *b, float2 *x, uint2 *rp,  
  unsigned int *cp, unsigned int m, unsigned int n); 
 
// Computes the current THREAD index. 
__device__ unsigned int compute_thread_index() { 
 return (blockIdx.x*BLOCK_SIZE*BLOCK_SIZE  
  + blockIdx.y*BLOCK_SIZE*BLOCK_SIZE*gridDim.x 





void mul_multBlocks_CRS(float *val, unsigned int vLength, unsigned int *rp, 
  unsigned int rpLength, unsigned int *cp, unsigned int cpLength,float *b, float *x, 
  unsigned int m, unsigned int n, unsigned int p, float &time) { 
 // Timing this operation. 
 cudaEvent_t start, stop;  time = 0.0f; 
 
 // Initialize EVENT Timers - CUDA. 
 cudaEventCreate(&start);   cudaEventCreate(&stop); 
 
 // Variables to be placed on GPU. 
 float *val_d = NULL;  float *b_d = NULL; 
 float *x_d = NULL;     uint2 *rp_d = NULL; 
 unsigned int *cp_d  = NULL; 
 
 // Compute ROW "pointer" BOUNDS to be pushed on the GPU 
 uint2 *cpu_rp =  new uint2[rpLength - 1]; 
 { 
  for(unsigned int i = 0; i < rpLength - 1; i++)  
  { 
   cpu_rp[i].x = rp[i]; 
   cpu_rp[i].y = rp[i + 1]; 
  } 
 } 
   
 // Allocate and initialize values for GPU 
 cudaMalloc((void**)&val_d, vLength*sizeof(float)); 
 cudaMalloc((void**)&x_d, m*p*sizeof(float)); 
 cudaMalloc((void**)&b_d, n*p*sizeof(float)); 
 cudaMalloc((void**)&cp_d, cpLength*sizeof(unsigned int)); 
 cudaMalloc((void**)&rp_d, rpLength*sizeof(uint2)); 
 
 cudaMemcpy(cp_d, cp, cpLength*sizeof(unsigned int), cudaMemcpyHostToDevice); 
 cudaMemcpy(rp_d, cpu_rp, rpLength*sizeof(uint2), cudaMemcpyHostToDevice); 
 cudaMemcpy(val_d, val,  vLength*sizeof(float), cudaMemcpyHostToDevice); 
 cudaMemcpy(b_d, b, n*p*sizeof(float), cudaMemcpyHostToDevice); 
 
 // Calculate dimensions for GPU Device 
 dim3 grid;  dim3 block; 
 grid.x = (unsigned int)(sqrt((float)n)/BLOCK_SIZE + 1); 
 grid.y = (unsigned int)(sqrt((float)n)/BLOCK_SIZE + 1); 
 block.x = BLOCK_SIZE;  block.y = BLOCK_SIZE; 
 
  




 // Kernel call 
 mul_multipleblocks_CRS<<<grid, block>>>(val_d, b_d, x_d, rp_d, cp_d, m, n); 
  
 // "Record" the stopping of this EVENT - i.e. return from kernel call. 
 cudaEventRecord(stop, 0);   cudaEventSynchronize(stop); 
 
 // Get the amount of time elapsed (in milliseconds) and DESTROY the CUDA timer 
 // objects. 
 cudaEventElapsedTime(&time, start, stop);  
 cudaEventDestroy(start);   cudaEventDestroy(stop); 
 
 // Retrieve results pointed by 'x_d' 
 cudaMemcpy(x, x_d, m*p*sizeof(float), cudaMemcpyDeviceToHost); 
 
 // Free Memory - CPU. 
 delete [] cpu_rp; 
  








void mul_multBlocks_BCRS(float *val, unsigned int vLength, unsigned int *rp, 
  unsigned int rpLength, unsigned int *cp, unsigned int cpLength, float *b, float *x, 
  unsigned int m, unsigned int n, unsigned int p, float &time) { 
 // Timing this operation. 
 cudaEvent_t start, stop;  time = 0.0f; 
 
 // Initialize EVENT Timers - CUDA. 
 cudaEventCreate(&start);  cudaEventCreate(&stop); 
 
 // Variables to be placed on GPU. 
 float4 *val_d = NULL;  float2 *x_d = NULL; 
 float2 *b_d = NULL;  uint2 *rp_d = NULL; 
 unsigned int *cp_d = NULL; 
 
 // Compute ROW "pointer" BOUNDS to be pushed on the GPU. 
 uint2 *cpu_rp =  new uint2[rpLength - 1]; 
 { 
  for(unsigned int i = 0; i < rpLength - 1; i++)  
  { 
   cpu_rp[i].x = rp[i]; 
   cpu_rp[i].y = rp[i + 1]; 
156 
 
  } 
 } 
  
 // Allocate and initialize values for GPU.  
 cudaMalloc((void**)&val_d, vLength*sizeof(float4)); 
 cudaMalloc((void**)&x_d, m*p*sizeof(float2)); 
 cudaMalloc((void**)&b_d, n*p*sizeof(float2)); 
 cudaMalloc((void**)&cp_d, cpLength*sizeof(unsigned int)); 
 cudaMalloc((void**)&rp_d, rpLength*sizeof(uint2)); 
 
 cudaMemcpy(cp_d, cp, cpLength*sizeof(unsigned int), cudaMemcpyHostToDevice); 
 cudaMemcpy(rp_d, cpu_rp, rpLength*sizeof(uint2), cudaMemcpyHostToDevice); 
 cudaMemcpy(val_d, val,  vLength*sizeof(float), cudaMemcpyHostToDevice); 
 cudaMemcpy(b_d, b, n*p*sizeof(float), cudaMemcpyHostToDevice); 
 
 // Calculate dimensions for GPU device. 
 dim3 grid;  dim3 block; 
 grid.x = (unsigned int)(sqrt((float)n/2.0f)/BLOCK_SIZE + 1); 
 grid.y = (unsigned int)(sqrt((float)n/2.0f)/BLOCK_SIZE + 1); 
 block.x = BLOCK_SIZE;  block.y = BLOCK_SIZE; 
 
 cudaEventRecord(start, 0); 
 
 // Kernel call. 
 mul_multipleblocks_BCRS<<<grid, block>>>(val_d, b_d, x_d, rp_d, cp_d, m, n); 
 
 // "Record" the stopping of this EVENT - i.e. return from kernel call. 
 cudaEventRecord(stop, 0);   cudaEventSynchronize(stop); 
 
 // Get the amount of time elapsed (in milliseconds) and DESTROY the CUDA 
 // timer objects. 
 cudaEventElapsedTime(&time, start, stop);   cudaEventDestroy(start);  
 cudaEventDestroy(stop); 
 
 // Retrieve results pointed by 'x_d' 
 cudaMemcpy(x, x_d, m*p*sizeof(float), cudaMemcpyDeviceToHost); 
 
 // Free Memory - CPU. 
 delete [] cpu_rp; 










C/C++ Matrix Class 
 
template <class T>  class Matrix { 
private: 
 T *data_; 
 
 // Number of Rows and Cols. 
 unsigned int m_;  unsigned int n_; 
 
 // Matrix Compression Format. 
 FORMAT_TYPE type_; 
 
 // I-Index (e.g. ROW "pointer" in CRS). 
 unsigned int *rowptr_; 
 
 // J-Index (e.g. COLUMN "pointer" in CRS). 
 unsigned int *colind_; 
 
 // First Index into Sub-Block for BCRS (i.e. 2x2 sub-blocks) Format. 
 unsigned int *nzptr_; 
 
 // Non-Zero(s) from ORIGINAL matrix for CRS (i.e. 1x1 Sub-Blocks). 
 T *val_; 
 
 // The LENGTH of 'val',  'colind_', 'rowptr_', and  'nzptr' Vectors respectively. 
 unsigned int vLength_;  unsigned int cLength_;  unsigned int rLength_; 
 
 // Compute the total number of Non-Zeros in this matrix. 
 unsigned int numNNZ(); 
 
 // Compress current matrix to CRS Format.(i.e. 1x1 Block.) 
 void compressCRS(); 
 
 // Compress current matrix to BCRS Format.(i.e. 2x2 Block.) 
 void compressBCRS(); 
 
 // CPU-Based Matrix-Vector Multiplication(s) using CRS, BCRS (2x2), and 
 // NO Compression. 
 void matVecMultCRS(T *b, T *x, unsigned int n); 
 void matVecMultBCRS(T *b, T *x, unsigned int n); 
 void matVecMultNONE(T *b, T *x, unsigned int n); 
         void matMatMult_NONE(T *b, T *x); 
 
 // GPU-Based Matrix-Vector Multiplication(s) using CRS, BCRS (2x2), and 
 // NO Compression. 
158 
 
 void matVecMultCRS_GPU(T *b, T *x, unsigned int p); 
 void matVecMultBCRS_GPU(T *b, T *x, unsigned int p); 
 void matVecMultNONE_GPU(T *b, T *x, unsigned int p); 
 
public: 
 // Create Matrix object from argument data of m-by-n dimensions. 
 Matrix(T **data, unsigned int m, unsigned int n); 
 
 // Create Matrix object from argument data of m-by-m dimensions. 
 Matrix(T *data, unsigned int m); 
 
 // Compress current data element (i.e. matrix), REMOVING the ORIGINAL 
 // Matrix elements. 
 // PARAM:  type  The Matrix Compression used (e.g. CRS). 
 void compress(FORMAT_TYPE type = CRS); 
 
 // Computes Matrix-Vector Product as defined by the current matrix compression 
 // format (if any). 
 // PARAM:  b  Right-Hand Side 
 // PARAM:  x  Solution Vector (holds solution) 
 // PARAM:  n  Length of Right-Hand Side Vector and Solution Vector 
 void matrixVectorMult(T *b, T *x, unsigned int n); 
 void matrixMultCPU(T *b, T *x); 
 
 // Computes Matrix-Vector Product as defined by the current matrix compression 
 // format (if any) for the 
 // GPU Device. 
 // PARAM:  b  Right-Hand Side 
 // PARAM:  x  Solution Vector (holds solution) 
 void matrixVectorMultGPU(T *b, T *x, unsigned int p); 
  void matrixMultGPU(T *B, T *C); 
 
 T& operator()(unsigned int i, unsigned int j); 
















C/C++ Class Template (calls CUDA file with Kernels) 
 
template <class T> 
void Matrix<T>::matVecMultCRS_GPU(T *b, T *x, unsigned int p){ 
 if(val_ == NULL || rowptr_ == NULL || colind_ == NULL) 
  throw MatrixException("Exception with  GPU call!\n"); 
 
 float gTime = 0.0f; 
 
 // Call kernel via C-Code interface. 
 mul_multBlocks_CRS(val_, vLength_, rowptr_, rLength_, colind_, cLength_, b, x, m_, 
  n_, p, gTime); 
 




Derivations of Presented Equations: 
 
 Equation (4.5) to calculate the average number of non-zeros per row when using the 
Compressed Sparse Row (CSR) data compression format is detailed below.  
 The data-type utilized in this work is the single-precision float each of which is defined 
by 4-bytes. The assumption is an initial square matrix of MM  dimension so the number of non-
zero elements for each row M is generated as a ratio subtracted from the GPU device global 
memory memG . The numerator of the ratio is 4 times the number of rows M plus 1, to account for 
even numbers of elements as well as 4-byte floats. The denominator is the number of rows 
M times 8 which defines the square of a single float – generating a ratio that is less than 1 and an 
average of length of a single row in the original matrix. The maximum of 1 or the generated 
average number of non-zeros per row is chosen as the result NZR since the value should at least 


















,1       (4.5) 
 
 Equation (4.6) to calculate the number of blocks when using the Compressed Sparse 
Row (CSR) data compression format is detailed below.  
 Each calculated block of data input to the GPU, BN , is partitioned such that a single warp 
(32 threads) is given for each row M . The number of blocks is a ratio such that the numerator is 
the product of the number of rows M and the average number of non-zero elements per row 
NZR and the denominator is the total number of warps and SMPs for the GPU device multiplied 









       (4.6) 
 
 Equation (5.3) to calculate the total solution time for multiple CPU/GPU computing 
systems when using Compressed Sparse Row (CSR) data compression format is detailed below. 
 The estimated solution time for the multiple CPU/GPU computing system is adapted 
from the estimated time for the single CPU/GPU computing system which is detailed in equation 
(4.9.3) of chapter 4 given as gpuT , and the cost of local CPU-GPU host and intra-nodal MPI 
communication defined asC . 
 The assumption is made that the single CPU/GPU computing system solution time 
estimation gpuT  and the number of active thread blocks per SMP ATBN are already known.  The 
naïve approach of computing the ratio of the single CPU/GPU solution time by the number of 
partitions sdP must be modified to account for overhead of communications defined asC . Given 
161 
 
each partition will produce an individual cost C , C  is divided by the number of partitions of the 
original global domain sdP  - this result is multiplied by the sum of the number of active thread 
blocks per SMP ATBN and the square root of the number of partitions sdP  represented in equation 




an average cost of communication assuming a square matrix, thus the sdP variable. 
 The estimated solution time of multiple CPU/GPU computing system is not complete 
until the rate of growth/decay is calculated using an exponent of the ratio of the total number of 




e . Using the 




e , the cost of communication will increase as the number of 























 The CUDA Kernels and C/C++ code for the execution of full candidate application 




// Name          : fertm2d.cpp 
// Author        : Richard Haney 
// Version       : 1.1a 
// Description : Simulated Resin Transfer Molding (RTM) such that the global solution is solved 
//                       using Finite Element Method (FEM) and is based on original FORTRAN 





using namespace std; 
 
int main(int argc, char **argv) { 
    Fertm fertm_(argc, argv); 
    int nfill = 0;  double sumf = 0.0;    
 long long int flops = 0;  double soltime = 0.0; 
  
    // initialize system 
    fertm_.initialize(); 
  
    // preprocess 
    fertm_.preprocess(nfill); 
  
    // process/solve system 
    fertm_.process(nfill, sumf, flops, soltime); 
 





















using namespace std; 
 
// Interface for Resin Transfer Molding (RTM) to be used to in the simulation program. 
class IRtm { 
public: 
 IRtm(); 
 virtual ~IRtm(); 
 virtual string get_filename() = 0; 
 
 /* 
  * Function initializes all value(s) to prepare for the execution of the RTM program 
  * ******************************************************************** 
  *                     PLEASE NOTE:   Function must be called FIRST! 
  * ******************************************************************** 
  */ 
 virtual void initialize() = 0; 
 
 /* 
  * Function executes preprocessing operations for RTM program returning the 
  * initial volume filled. 
  * ******************************************************************** 
  * PLEASE NOTE:  Function must be called AFTER the initialize() and BEFORE the 
  * process() 
  * ******************************************************************** 
  * @param numfill is the current number of filled nodes - assumed zero at this point 
  * @return total volume filled in model 
  */ 













  * Function executes the processing operations for RTM program - solving the system. 
  * ******************************************************************** 
  * PLEASE NOTE:  Function must only be called AFTER calling the preprocess() 
 *  function. 
  * ******************************************************************** 
  * @param num is the number of filled nodes 
  * @param sumf sum filled/volume 
  * @param flops number of floating-point operations 
  * @param solve_time total time for execution – in milliseconds 
  * @param verbose if true outputs verbose info. 
  */ 
 virtual void process(int &num, double &sumf, long long int &flops, double &solve_time, 
















class Fertm : public IRtm { 
protected: 
 FertmModel model_;  FertmParser parse_; 
public: 
 Fertm(int argc, char **argv); 
 virtual ~Fertm(); 
 
  // Function returns the current filename of the input file being "solved" by this class. 
 string get_filename(); 
 
 // Function returns the current "partitioned" filename being used for MPI-based 
 // parallelism, if any 
 string get_pfilename(); 
 
 // Function performs initialization operations such that the FERTM 2D program can 
 // execute properly. 





  * Function executes all preprocessing operations for the FERTM 2D program to execute 
 *  properly. 
  * @param numfill current number of filled nodes - assumed zero at this point 
  * @return total filled volume after preprocessing 
  */ 
 void preprocess(int &numfill); 
 
 /* 
  * Function executes the processing operations for  the FERTM 2D RTM - solving the 
  * system. 
  * ********************************************************************* 
  * PLEASE NOTE:  Function must only be called AFTER calling the preprocess() 
  * function. 
  * ********************************************************************* 
  * @param num is the number of filled nodes 
  * @param sumf sum filled/volume 
  * @param flops number of floating-point operations 
  * @param solve_time total time for execution – in milliseconds 
  * @param verbose if true outputs verbose info. 
  */ 
 void process(int &num, double &sumf, long long int &flops, double &solve_time,  










 The algorithms from chapter 2 of this dissertation defining the LCM solution strategy, 
sparse matrix-vector, and the preconditioned conjugate gradient iterative solver for sparse 
symmetric positive definitive matrices. 
 
 
Algorithm 2.1:  Implicit Pure FE methodology for LCM Computation 
 
      (For time step 1n and iteration m ) 
1.    REPEAT 
2.           SET   1 n
mi
to ni     (save previous fill factor values) 
3.           CALL assembleC for iC  (assembleC forms lump mass matrix) 
4.           CALL assembleK for ijK   (assembleK forms stiffness matrix K ) 
5.           CALL assembleLoad on iq  (assembleLoad forms load vector q ) 
6.           REPEAT 
7.                   SET boundary conditions on ijK  
                         (Modified load vector g ) 
8.                   SET  
mi








                         (Where ijK̂ is K matrix with boundary conditions applied) 
9.                   SOLVE      mimjmij gPK ˆ  
                         (Compute new nodal resin fraction field using equation (4)) 




















CC  THEN 
12.                      BREAK 
13.                 ELSE 
14.                      SET   1 n
mi








15.                 ENDIF  
16.         UNTIL mass resin convergence 









Algorithm 2.2:  Preconditioned conjugate gradient (solves bAx  ) 
Input:  Matrix A  and load/force vector b  
Output: solution vector x  
 
1.  Set 00 Axbr   




3.  Set 00 zp   
4.  Set 0k  
5.  DO UNTIL CONVERGENCE 











7.   kkkk pxx 1  
8.   kkkk Aprr 1  
9.   IF   1kk rr  BREAK 




  kk rMz  









rz 11   
12.   kkkk pzp   11  
13.   1 kk  

























Algorithm 2.3:  Sparse Matrix-Vector Multiplication (CSR Compression) 
Input:  Non-zero vector dat , load/force vector b , row pointer rptr , column indices cidx , and row 
length M  
Output: solution vector x  
 
1.  Set 0i  
2.  Set 0j  
3.  Set 0k  
4.  DO WHILE  Mi   
5.   Set  irptrj   
6.   Set  1 irptrk  
7.   DO WHILE  kj   
9.    Set          jcidxbjdatixix   
10.    Set 1 jj  
11.   END DO 
12.   Set 1 ii  






 This appendix contains TECPLOT visualized results of resin flow progression contours 
of the input unstructured meshes. 
 
 




Figure 97. Time filled for unstructured mesh MA single CPU/GPU (System A) 
 
 The following are the time-filled TECPLOT images for validation using the 2D circular 




Figure 98. Time filled single CPU/GPU with circular plate (System A) 
 




Figure 100. Time filled single CPU/GPU with circular plate (System B) 
 








Figure 103. Input mesh model 10FT multiple partition time-filled comparison (System B) 
