PARAMETERS THAT AFFECT PARALLEL PROCESSING FOR COMPUTATIONAL ELECTROMAGNETIC SIMULATION CODES ON HIGH PERFORMANCE COMPUTING CLUSTERS by Moon, Hongsik
Syracuse University 
SURFACE 
Dissertations - ALL SURFACE 
December 2016 
PARAMETERS THAT AFFECT PARALLEL PROCESSING FOR 
COMPUTATIONAL ELECTROMAGNETIC SIMULATION CODES ON 
HIGH PERFORMANCE COMPUTING CLUSTERS 
Hongsik Moon 
Syracuse University 
Follow this and additional works at: https://surface.syr.edu/etd 
 Part of the Engineering Commons 
Recommended Citation 
Moon, Hongsik, "PARAMETERS THAT AFFECT PARALLEL PROCESSING FOR COMPUTATIONAL 
ELECTROMAGNETIC SIMULATION CODES ON HIGH PERFORMANCE COMPUTING CLUSTERS" (2016). 
Dissertations - ALL. 588. 
https://surface.syr.edu/etd/588 
This Dissertation is brought to you for free and open access by the SURFACE at SURFACE. It has been accepted for 
inclusion in Dissertations - ALL by an authorized administrator of SURFACE. For more information, please contact 
surface@syr.edu. 
   
ABSTRACT 
 
What is the impact of multicore and associated advanced technologies on 
computational software for science? Most researchers and students have multicore 
laptops or desktops for their research and they need computing power to run 
computational software packages. Computing power was initially derived from Central 
Processing Unit (CPU) clock speed. That changed when increases in clock speed became 
constrained by power requirements. Chip manufacturers turned to multicore CPU 
architectures and associated technological advancements to create the CPUs for the future. 
Most software applications benefited by the increased computing power the same way 
that increases in clock speed helped applications run faster. However, for Computational 
ElectroMagnetics (CEM) software developers, this change was not an obvious benefit – it 
appeared to be a detriment. Developers were challenged to find a way to correctly utilize 
the advancements in hardware so that their codes could benefit. The solution was 
parallelization and this dissertation details the investigation to address these challenges. 
Prior to multicore CPUs, advanced computer technologies were compared with 
the performance using benchmark software and the metric was FLoting-point Operations 
Per Seconds (FLOPS) which indicates system performance for scientific applications that 
make heavy use of floating-point calculations. Is FLOPS an effective metric for 
parallelized CEM simulation tools on new multicore system? Parallel CEM software 
needs to be benchmarked not only by FLOPS but also by the performance of other 
parameters related to type and utilization of the hardware, such as CPU, Random Access 
Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than 
just FLOPs and new parameters must be included in benchmarking. 
   
In this dissertation, the parallel CEM software named High Order Basis Based 
Integral Equation Solver (HOBBIES) is introduced. This code was developed to address 
the needs of the changing computer hardware platforms in order to provide fast, accurate 
and efficient solutions to large, complex electromagnetic problems. The research in this 
dissertation proves that the performance of parallel code is intimately related to the 
configuration of the computer hardware and can be maximized for different hardware 
platforms. To benchmark and optimize the performance of parallel CEM software, a 
variety of large, complex projects are created and executed on a variety of computer 
platforms. The computer platforms used in this research are detailed in this dissertation. 
The projects run as benchmarks are also described in detail and results are presented. The 
parameters that affect parallel CEM software on High Performance Computing Clusters 
(HPCC) are investigated. This research demonstrates methods to maximize the 
performance of parallel CEM software code. 
 
  
   
 
 
 
PARAMETERS THAT AFFECT PARALLEL PROCESSING 
FOR COMPUTATIONAL ELECTROMAGNETIC 
SIMULATION CODES ON HIGH PERFORMANCE 
COMPUTING CLUSTERS 
 
By 
Hongsik Moon 
B.E., Kyoungpook National University, Republic of Korea, 2003 
M.S., Syracuse University, Syracuse NY, USA 2008 
 
 
DISSERTATION 
Submitted in partial fulfillment of the requirements for the Degree of  
Doctor of Philosophy in Electrical & Computer Engineering 
 
 
Syracuse University 
December 2016 
 
  
   
 
 
 
 
 
 
 
 
 
Copyright 2016 Hongsik Moon 
All rights Reserved 
 v 
 
ACKNOWLEDGEMENT 
 
I would like to express my gratitude to several individuals whose generous 
contributions of time and thought has eased the process of completing this doctoral 
degree. 
First of all, I would like to thank my advisor, Prof. Tapan K. Sarkar for his 
guidance, patience, and support for research. I have benefited from helpful discussions 
and collaborations with many graduate students and visiting scholars who have engaged 
in research in CEMLAB, in particular Dr. Yu Zhang, Dr. Daniel Garcia Donoro, and Dr. 
Mary C. Taylor. I could not finish this dissertation without them. Also, I appreciate my 
colleagues in CEMLAB, Mohammad Abdallah, Heng Chen and Dojana Salama. 
I would like to thank my dissertation committee: Prof. Jay K. Lee, Prof. Jun H. 
Choi, Prof. Kishan G. Mehrotra and Prof. Fritz H. Schlereth despite of tight schedules 
and their great comments. And I would like to thank Prof. Yoonseok Lee who graciously 
agreed to serve as committee chair in a short notice.  
I appreciate many good friendships, especially with Yunhee Kim, Chanwoong 
Shin, Sanglim Yoo, Dongwon Kim and Rev. Dr. Yong Jee and his wife, Soo Jee for 
guidance and prayer.  
I would like to express my deepest gratitude to my parents and families, 
especially my wife, Minjee Kim for their love, support and prayer. I also would like to 
express my love to children, Noah and Leah. 
Finally, I would like to thank God for completing one of the important tasks of 
my life with his grace, mercy and love. 
 vi 
 
TABLE OF CONTENTS 
ABSTRACT ........................................................................................................................ i 
ACKNOWLEDGEMENT ................................................................................................ v 
TABLE OF CONTENTS ................................................................................................ vi 
LIST OF TABLES ......................................................................................................... viii 
LIST OF FIGURES .......................................................................................................... x 
ABBREVIATIONS ......................................................................................................... xii 
1. INTRODUCTION......................................................................................................... 1 
1.1. Background .............................................................................................................. 1 
1.2. Objectives ................................................................................................................. 3 
1.3. The Scope of the Dissertation .................................................................................. 4 
2. PARALLEL CEM SIMULATION SOFTWARE : HOBBIES AND 
PERFORMANCE ............................................................................................................. 6 
2.1. Overview .................................................................................................................. 6 
2.2. Key features .............................................................................................................. 7 
2.3. Criteria for checking the performance - Simulation time and load balancing .......... 8 
3. COMPUTING PLATFORMS ................................................................................... 14 
3.1. Overview ................................................................................................................ 14 
3.2. Description of computational platforms ................................................................. 14 
 vii 
 
4. THE PARAMETERS THAT AFFECT PARALLEL PROCESSING FOR CEM 
SIMULATION CODES ON HPCC .............................................................................. 24 
4.1. Overview ................................................................................................................ 24 
4.2. Intel Hyper-Threading Technology ........................................................................ 25 
4.3. Single-core system versus multicore system .......................................................... 36 
4.4. New generation of multicore .................................................................................. 43 
4.5. Hard disk ................................................................................................................ 51 
4.6. Network .................................................................................................................. 58 
4.7. Effect of number of cores ....................................................................................... 71 
4.8. IASIZE: more than 4 GB per core ......................................................................... 85 
5. CONCLUSION ........................................................................................................... 91 
BIBLIOGRAPHY ........................................................................................................... 94 
VITA................................................................................................................................. 97 
 
  
 viii 
 
LIST OF TABLES 
Table 3.1 Summary of the computing platforms used ..................................................... 23 
Table 4.1 Simulation times: with and without HT on Cluster-1 ...................................... 32 
Table 4.2 NUN and electrical size of aircraft model with three different frequencies .... 33 
Table 4.3 Simulation times: with and without HT on Cluster-2 ...................................... 35 
Table 4.4 NUN and electrical size of Model 2 with four operating frequencies ............. 38 
Table 4.5 Simulation times using 16 cores: Cluster-2 and Cluster-3 ............................... 39 
Table 4.6 CPU specification details: Cluster-2 and Cluster-3 ......................................... 40 
Table 4.7 Simulation times using 2 cores (one node): Cluster-2 and Cluster-3............... 42 
Table 4.8 Simulation times using 8 cores (four nodes): Cluster-2 and Cluster-3 ............ 43 
Table 4.9 CPU specification details: Cluster-3 and Cluster-8 ......................................... 46 
Table 4.10 Simulation times using 24 cores: Cluster-3 and Cluster-8 ............................. 47 
Table 4.11 CPU specification details: Cluster-3 and Cluster-4 ....................................... 48 
Table 4.12 Simulation times using 24 cores: Cluster-3 and Cluster-4 ............................. 49 
Table 4.13 Simulation times with CPU speed normalized: Cluster-3 and Cluster-4 ....... 50 
Table 4.14 CPU specification details: Cluster-4 and Cluster-8 ....................................... 53 
Table 4.15 Simulation times using 24 cores: SAS and SATA ......................................... 54 
Table 4.16 Simulation times with CPU speed normalized:  SAS and SATA .................. 55 
Table 4.17 Simulation times: 10,000 RPM and 15,000 RPM on Cluster-2 ..................... 56 
Table 4.18 Simulation times: local hard disk and disk array on Clsuter-6 ...................... 57 
Table 4.19 Simulation times: without and with network on Cluster-3 ............................ 61 
Table 4.20 Simulation times: without network and with network with only 2 GB RAM 
per core on Cluster-3 ......................................................................................................... 63 
Table 4.21 Simulation times: without and with network on Cluster-4 ............................ 64 
Table 4.22 Simulation times: Gigabit Ethernet and InfiniBand on Cluster-6 .................. 65 
Table 4.23 Actual simulation times: Cluster-3 with Gigabit Ethernet and Cluster-5 with 
InfiniBand ......................................................................................................................... 66 
Table 4.24 Simulation times with CPU speed normalized: Cluster-3 with Gigabit 
Ethernet and Cluster-5 with InfiniBand ............................................................................ 67 
Table 4.25 Actual simulation times: Cluster-3 with Gigabit Ethernet and Cluster-5 with 
InfiniBand with larger NUN ............................................................................................. 68 
Table 4.26 Simulation times with CPU speed normalized: Cluster-3 with Gigabit 
Ethernet and Cluster-5 with InfiniBand with larger NUN ................................................ 68 
Table 4.27 Actual simulation times using 8 cores: Cluster-3 with Gigabit Ethernet and 
Cluster-5 with InfiniBand ................................................................................................. 69 
Table 4.28 Simulation times using 8 cores with CPU speed normalized: Cluster-3 with 
Gigabit Ethernet and Cluster-5 with InfiniBand ............................................................... 69 
Table 4.29 Actual simulation times using 16 cores: Cluster-3 with Gigabit Ethernet and 
Cluster-5 with InfiniBand ................................................................................................. 70 
 ix 
 
Table 4.30 Simulation times using 16 cores with CPU speed normalized: Cluster-3 with 
Gigabit Ethernet and Cluster-5 with InfiniBand ............................................................... 70 
Table 4.31 Simulation times: varying the number of cores and process grid on Cluster-7
........................................................................................................................................... 73 
Table 4.32.a Simulation times: forty two cases varying the numbers of cores and process 
grid with larger NUN on Cluster-7 ................................................................................... 75 
Table 4.32.b Simulation times: forty two cases varying the numbers of cores and process 
grid with larger NUN on Cluster-7 (cont’d.) .................................................................... 76 
Table 4.33 Simulation times: one simulation at a time using 512 cores for each model . 81 
Table 4.34 Simulation times: two simulations in parallel, using 256 cores for each model
........................................................................................................................................... 81 
Table 4.35 Simulation times: four simulations in parallel, using 128 cores for each model
........................................................................................................................................... 82 
Table 4.36 Simulation times: eight simulations in parallel using 64 cores for each model
........................................................................................................................................... 82 
Table 4.37 Simulation times: sixteen simulations in parallel using 32 cores for each 
model................................................................................................................................. 83 
Table 4.38 Summary of simulation times: 16 projects with similar NUN ....................... 84 
Table 4.39 Simulation times using 20 cores: six different IASIZE on Cluster-9 ............ 87 
Table 4.40 Simulation times using 10 cores: six different IASIZE on Cluster-9 ............ 89 
 
  
 x 
 
LIST OF FIGURES 
Figure 2.1 Logo of HOBBIES 10 Professional.................................................................. 6 
Figure 2.2 “View Process Info” display in HOBBIES shows (a) Project information,  (b) 
Post-processing setting and simulation time ..................................................................... 10 
Figure 2.3 Cluster Ganglia Report showing load and CPU, memory and network usage 11 
Figure 2.4 Zoom-in of Cluster Ganglia Report showing CPU usage .............................. 11 
Figure 2.5 Zoom-in of Cluster Ganglia Report showing Network usage ........................ 12 
Figure 3.1 Cluster-1, custom-built workstation by Seneca Data ..................................... 14 
Figure 3.2 Cluster-2, ten Dell PowerEdge 1855 blade servers ........................................ 15 
Figure 3.3 Cluster-3, four Seneca Data Nexlink 7875 Series Workstations .................... 16 
Figure 3.4 Cluster-4, three Seneca Data Nexlink 7875 Series Workstations .................. 17 
Figure 3.5 Cluster-5, one HP Proliant DL380 G5 server as head node and sixteen 
BL460c G1 blade servers as compute node ...................................................................... 18 
Figure 3.6 Cluster-6, Dell PowerEdge 1950 Cluster Map ............................................... 19 
Figure 3.7 Cluster-8, Dell PowerEdge T410 server and Dell Precision T5500 ............... 21 
Figure 3.8 Cluster-9, Seneca Data VFX 9300 ................................................................. 22 
Figure 4.1 HT control in BIOS setup on Cluster-2 .......................................................... 26 
Figure 4.2 Ganglia overview of Cluster-2 with HT off ................................................... 28 
Figure 4.3 Ganglia CPU usage report for Clsuter-2, the first compute node HT off ....... 28 
Figure 4.4 Screen shot of Ganglia overview of Cluster-2 with HT on ............................ 29 
Figure 4.5 Ganglia CPU usage report for Cluster-2, the first compute node HT on. ...... 29 
Figure 4.6 Aircraft model, a half structure with symmetry ............................................. 31 
Figure 4.7 Top view of a half aircraft model with symmetry .......................................... 31 
Figure 4.8 Simulation results, RCS (Phi Cut, Phi=90°) of Models 1-1 and 1-2 with and 
without HT ........................................................................................................................ 32 
Figure 4.9 Simulation results, RCS (Phi cut, Phi=90°) of Models 1-3, 1-4, and 1-5 with 
and without HT ................................................................................................................. 34 
Figure 4.10 Full size aircraft, Model 2 ............................................................................. 37 
Figure 4.11 Simulation results, RCS of Models 2-1, 2-2, 2-3, and 2-4 on Cluster-2 and 
Cluster-3 ............................................................................................................................ 38 
Figure 4.12 Graph of simulation times (minutes): Cluster-2 and Clsuer-3 ..................... 39 
Figure 4.13 Configuration of Intel QPI ............................................................................ 44 
Figure 4.14 Block Diagram of Processor with Intel QPI ................................................. 45 
Figure 4.15 New generation CPU Memory Architecture ................................................ 45 
Figure 4.16 Graph of simulation times (minutes): Cluster-3 and Cluster-8 .................... 47 
Figure 4.17 Graph of simulation times (minutes) with CPU speed normalized: Cluster-3 
and Cluster-8 ..................................................................................................................... 50 
Figure 4.18 Ganglia CPU report on Cluster-6 ................................................................. 58 
Figure 4.19 Top view and dimensions of Global Hawk, Model 3 ................................... 59 
 xi 
 
Figure 4.20 Side view and dimensions of Global Hawk, Model 3 .................................. 60 
Figure 4.21 Simulation results, Radiation Pattern of dipole antenna array, (a) Phi = 0° cut, 
(b) Phi = 90° cut of Model 3-1, (c) Phi = 0° cut, (d) Phi = 90° cut of Model 3-2,  (e) Phi = 
0° cut, (f) Phi = 90° cut of Model 3-3 ............................................................................... 61 
Figure 4.22 Plot simulation times: fourteen cases varying the numbers of cores on 
Cluster-7 ............................................................................................................................ 74 
Figure 4.23 Screen shot of “PRE-HOBBIES” ................................................................. 78 
Figure 4.24 Model 4-1 with two horn antennas with a walled environment ................... 80 
Figure 4.25 Model 4-2 with a horn antenna probe and a helical Antenna AUT with a 
walled environment ........................................................................................................... 80 
Figure 4.26 Airbus A380, Model 5 .................................................................................. 86 
Figure 4.27 Simulation results, RCS (horizontal plane) of Model 5-1 simulated with six 
different IASIZE ............................................................................................................... 86 
Figure 4.28 Filling, solving and total simulation times using six different IASIZE using 
20 cores on Cluster-9 ........................................................................................................ 87 
Figure 4.29 CPU usages during the IASIZE simulation. ................................................. 88 
Figure 4.30 Memory usages during the IASIZE simulation. ........................................... 88 
Figure 4.31 Filling, solving and total simulation time using six different IASIZE using 
10 cores on Cluster-9 ........................................................................................................ 90 
 
  
 xii 
 
ABBREVIATIONS 
AUT Antenna Under Test 
BIOS Basic Input/Output System 
CEM Computational ElectroMagnetics 
CPU Central Processing Unit 
DDR Double Data Rate 
EM ElectroMagnetics 
EMC ElectroMagnetic Compatibility 
EMI ElectroMagnetic Interference 
FLOPS FLoating-point Operations Per Seconds 
FSB FrontSide Bus 
GUI Graphical User Interface 
HOB High Order Basis function 
HOBBIES Higher Order Basis Based Integral Equation Solver 
HPC High Performance Computing 
HPCC High Performance Computing Cluster 
HT Hyper-Threading 
I/O Input/Output 
IASIZE In-core Array SIZE 
ICS In-Core Solver 
MoM Method of Moments 
NAS Network Attached Storage 
NUN Number of UNknowns 
OCS Out-of-Core Solver 
PEC Perfect Electric Conductor 
QPI Quick Path Interconnect 
RAID Redundant Array of Inexpensive (or Independent) Disks 
RAM Random Access Memory 
RCS Radar Cross Section 
RPM Revolution Per Minute 
SAS Serial Attached SCSI 
SATA Serial Advanced Technology Attachment 
SCSI Small Computer System Interface 
TIDES Target IDEntification Software 
UAV Unmanned Aerial Vehicle 
 
 
1 
 
 
 
1. INTRODUCTION 
 
1.1. Background 
Prior to 2005, chip manufacturer only made single-core processors for computers. 
Research efforts that required more processing power employed multiple single-core 
processors. Chip manufacturers achieved increasingly better performance by increasing 
the clock speed of the CPU. Hyper-Threading (HT) was another development aimed at 
improving the performance of CPUs by enabling multiple application threads to run on 
one core at the same time, essentially increasing efficiency as if the computer had more 
than one core. 
However, by 2005, power requirements begin to constrain increases in CPU speed. 
Clock speed for single-core CPUs peaked around 4 GHz. Chip manufacturers began to 
employ new technology to replace the single-core CPU with dual-core initially and then 
quad-core and then more cores on one chip. The multicore technology benefited most 
desktop and laptop users by allowing applications to run at the same time. The decreased 
CPU speed was not a detriment to these users and the decreased power consumption was 
a benefit to server, desktop, laptop and mobile platforms. The new architecture promised 
to increase user productivity by enabling greater multitasking beyond what HT on a 
single core could provide.  
However, when the first generation of multicore came out with two cores and 
slower CPU speeds, CEM researchers proved that their current applications took more 
time to execute because the code was running on only one core and at a much slower 
speed. It was clear that the new CPUs were not better for existing CEM software and that 
new methods were required for the development of future CEM codes. Having more 
2 
 
 
 
cores running more slowly and communicating over the same system bus to access 
memory and Input/Output (I/O) was pretty much the opposite of what computational 
software needs! To overcome the disadvantages of multicore CPUs, CEM researchers 
had to develop new coding methods to parallelize the software so that the multiple cores 
could work together efficiently on complex matrix math manipulations. Learning, testing, 
and optimizing parallel coding techniques was no longer an option for the CEM 
researcher – it was essential [1-4].  
Advanced computer hardware technologies are compared using benchmark 
software by FLOPS which indicates hardware performance for scientific applications that 
make heavy use of floating-point calculations. Is FLOPS an effective metric for 
parallelized CEM simulation tools on new multicore hardware? FLOPS is one of the 
important parameters that indicate performance but parallelized CEM simulation tools 
need to benchmark not only the FLOPS but also load balancing. Codes themselves need 
to be optimized and researchers need to specify the hardware to enable maximum 
performance of the code. 
I performed much of the hardware optimization and testing that is documented in 
the book titled “Parallel Solution of Integral Equation-Based EM Problems In The 
Frequency Domain” by Yu Zhang and Tapan K. Sarkar. This book shows that the 
performance of a parallel code is intimately related to the configuration of the computer 
hardware platform. To optimize the performance of parallel CEM software, numerical 
benchmarks must be run on multiple computer platforms to quantify their effectiveness in 
solving a given problem in the fastest time. Parameters that affect the performance of 
parallel CEM software are the block size, the shape of the process grid, and In-core Array 
3 
 
 
 
Size (IASIZE). These parameters and more are investigated and adjusted to provide 
excellent performance of a parallel code. Small and simple projects will not be affected 
much by the incorrect choice of parameters. However, large, complex projects will 
greatly benefit by the correct choice of relevant and critical parameters. 
Several models documented in the parallel book, from Radar Cross Section (RCS) 
of a 2 m sphere to an aircraft carrier model with dimensions 265 m long, 6 m wide, and 
47 m high were created and tested by H. S. Moon. One model depicts sixty one aircrafts 
and six helicopters. Projects include applications for Electromagnetic Compatibility 
(EMC), Electromagnetic Interference (EMI), and radiation pattern of a radome on an 
aircraft. These projects and more documented in the book are tested on various types of 
workstations and clusters [1], some of which are also used for the research presented in 
this dissertation. 
The parallel CEM software, HOBBIES is used to perform the tests in this research. 
HOBBIES is a Method of Moment (MoM) code that uses high order basis function 
(HOBs). A key feature is an Out-of-Core Solver (OCS) which uses hard disk storage 
instead of RAM to store the large dense matrix. HOBBIES is parallelized with highly 
efficient load balancing of CPU core, memory and hard disk resources. These and other 
features are presented in detail in chapter 2. HOBBIES is an accurate (MoM), fast 
(parallelized), and efficient (HOB and parallel OCS) CEM software package [5-7]. 
 
1.2. Objectives 
The objective of this dissertation is to investigate the effect of a selection of 
hardware and software parameters on the performance of a parallelized CEM code. 
Performance will be compared on nine HPCC with different hardware configurations. 
4 
 
 
 
The same parallel CEM software is used to create six complex models and the frequency 
of the simulation is used to change the electrical size of the structure. The same model 
can be executed as a small or large project, depending on the simulation frequency. The 
execution of the simulations on the different hardware platforms gives simulation times 
that are compared to identify the impact of the parameters. For small projects on small 
clusters, the impact may be negligible. For large projects, hardware and software 
parameters affect the performance significantly. CEM researchers want to use HPCC 
resources efficiently so they are not purchasing or leasing more than is required for the 
projects they need to run. This information is not typically detailed in the HPCC 
hardware or CEM software manuals. This dissertation details the experience gained from 
running a parallel CEM solver on workstation and blade server configurations with 
different number and type of CPUs, different amounts of RAM and storage, and different 
networks.  This research was undertaken to solve challenging, realistic CEM problems in 
the fastest and most efficient way. The results detailed here will help researchers choose 
the right configurations for their research. 
 
1.3. The Scope of the Dissertation 
The dissertation consists of six chapters. The first chapter presents a brief 
background, introducing how multicore and new technology affect CEM software, and 
the objective and scope of this research. The second chapter describes the HOBBIES [5] 
parallelized CEM simulation tool used in this research to check the performance of 
parallelization on different configurations of hardware. The third chapter provides 
relevant information on the nine computing platforms that are tested in this research to 
5 
 
 
 
compare the performance. The specifications of all the hardware systems are provided in 
detail, and a summary is provided at the end of the chapter. The fourth chapter presents 
the hardware and software parameters that affect parallel CEM software. The parameters 
related to the CPU include the number of cores, Hyper Threading Technology (HT), 
system bus (either Front Side Bus (FSB) or Quick Path Interconnect (QPI)), and 
advanced features like the memory controller and Intel Turbo Boost. Different types of 
hard disk are tested, with different revolutions per minutes (RPM) speeds and types of 
connection. Performance with and without a network interconnection is measured, and 
two different types of networks are compared. Software parameters are studied to 
determine how to run the parallel CEM code efficiently. Users have to calculate how 
many cores are needed to efficiently solve a project with given number of unknowns 
(NUN). CEM software cannot decide the optimum number of cores for users. The fastest 
way to run the simulation is using maximum number of cores but that is not efficient and 
may be a waste of computer resources. An understanding is critical if the researcher 
needs to specify the computational resources they need to purchase for their work. The 
fifth chapter presents a conclusion and the answer to the question, “Are multicore CPUs 
and associated new technologies helpful for the researchers and engineers who need to 
use CEM software simulation tools?” 
  
2. PARALLEL 
 
2.1. Overview 
HOBBIES stands
name implies, HOBBIES em
quadrilateral patch as the basis functions for the frequency domain integral equation 
solver. HOBBIES can handle
analyses involving electrica
metallic and dielectric structures. 
chosen for this research and is tested on 
between different hardware platfo
Professional [5]. 
Figure
 
 
CEM SIMULATION SOFTWARE : HOBBIES
PERFORMANCE 
 for Higher Order Basis Based Integral Equation Solver. As the 
ploys entire domain higher order polynomials over a large 
 solutions of a variety of types of electromagnetic field 
lly large objects of arbitrary shapes composed of complex 
Because of its modeling capabilities, HOBBIES
nine HPCCs to compare the performance
rms. Figure 2.1 shows the logo of HOBBIES 10
 
 2.1 Logo of HOBBIES 10 Professional 
6 
 AND 
 is 
 
 
 
7 
 
 
 
2.2. Key features 
There are four key features of the HOBBIES software package: MoM, HOBs, 
OCS and parallelization. Each feature will be described briefly in the following 
paragraphs. 
The first feature is MoM which is a numerical procedure for solving linear 
operator equations. This general technique is well-known and widely used because of its 
numerical accuracy. MoM generates a full dense matrix of complex impedances which 
must be inverted to solve for the coefficients of the current expansion. The matrix is 
typically stored in RAM, so the amount of RAM on a computer platform limits the size 
of the problem that can be solved when an in-core solver (ICS) is used.  
The second feature is HOBs, which uses higher order (multiple) basis functions 
over electrically large quadrilateral patches. This significantly reduces the NUNs 
compared to traditional piece-wise basis function that is using triangular patches. 
Typically, only 10 to 20 unknowns per wavelength squared of surface area are needed for 
HOBBIES simulations. 
The third feature is OCS which utilizes inexpensive and easily expandable hard 
disk drives instead of RAM for matrix storage. A large problem can be solved by 
utilizing large amounts of hard disk storage. OCS eliminates the constraint on problem 
size by RAM that is typical with MoM.  
The last feature is parallelization. To solve large problems as quickly as possible, 
the computational kernel is designed in a high-performance parallel way to execute on all 
computer platforms, from multicore laptops and desktops to workstations, servers and 
clusters. With single-core CPUs, parallelization was an optional feature to increase the 
8 
 
 
 
speed of solution. With the technological advancement to multicore CPUs parallelizing 
CEM codes is a requirement in order to provide efficient use of computational resources. 
Large-scale accurate CEM simulations were previously constrained by either physical 
memory space or CPU speed of execution. The out-of-core technique combined with 
parallel computing is the solution for engineering design cycle and research challenges 
that arise from electrically large and complex real-world applications [5-7]. 
The HOBBIES text is an excellent reference for students and researchers. The 
HOBBIES 10 Academic version of the code is provided with the book. This is a fully 
functional parallelized electromagnetic solver capable of solving up to 15,000 NUN. 
With the Academic version, many useful antenna problems can be modeled and 
simulated in a short time with a laptop or desktop with a multicore CPU. 
 
2.3. Criteria for checking the performance - Simulation time and load balancing 
In this dissertation, simulation time is used to measure the performance. This 
simulation time includes the computing time, the I/O time between CPU and RAM and 
between RAM and hard disk through the system bus, and the network time among 
multiple nodes connected with network switch. Prior to using parallel processing, only 
the computing time is the dominant component of performance. For the parallel OCS 
processing, I/O time and network time also become important components in 
performance comparisons. The basics of parallel OCS processing are described as follow.  
First, HOBBIES reads a portion of the matrix, called a block, into RAM. The 
amount of time this takes is called the matrix filling time. Second, HOBBIES performs an 
LU decomposition on that block. When all the computations for the block are complete, 
the OCS writes the result of the block to hard disk. The time for this is called matrix 
9 
 
 
 
solving time. The matrix filling and matrix solving steps are repeated with the next block 
and the next and so on until all the blocks are done being computed. Third, HOBBIES 
does post-processing, which is calculating network parameters, currents, near field 
radiations, far field radiations and RCS. The time for this final step is called the post-
processing time. 
HOBBIES provides details on how the project executed in the “View Process Info” 
[5] output display. A screenshot of this display is shown in Figure 2.2. Figure 2.2 (a) 
shows the project information including number of points, elements, excitations, 
unknowns, frequency and setting for post-processing. Figure 2.2 (b) shows details of the 
post-processing including the matrix filling, matrix solving and post-processing times. 
 
 
(a) 
 
10 
 
 
 
 
(b) 
 
Figure 2.2 “View Process Info” display in HOBBIES shows (a) Project information,  
(b) Post-processing setting and simulation time 
 
More detailed information about filling, solving and post processing time are 
provided by Cluster Ganglia Report which is a monitoring program that displays the load, 
CPU usage, memory usage and network usage on the total system and by each node [8]. 
Figure 2.3 shows a screen shot of Cluster Ganglia Report. Figure 2.4 shows a zoom-in of 
the CPU usage and Figure 2.5 shows a zoom-in of Network usage. 
11 
 
 
 
 
Figure 2.3 Cluster Ganglia Report showing load and CPU, memory and network usage 
 
 
Figure 2.4 Zoom-in of Cluster Ganglia Report showing CPU usage 
 
12 
 
 
 
 
Figure 2.5 Zoom-in of Cluster Ganglia Report showing Network usage 
 
In Figure 2.4 and Figure 2.5, one can distinguish three parts, matrix filling, matrix 
solving and post-processing. The first part shows 100% CPU use and network throughput 
is almost 0 bytes per second. This part is the matrix filling process and reading the block 
matrix into RAM. The second part shows about 80 to 90% CPU usage with about seven 
orange colored dips. The orange color shows the CPU wait time. During that time, the 
block which is done calculating is stored to the hard disk and the next block is readied to 
be processed by the CPU. This process repeats until the whole matrix solving is 
completed. In the screenshots shown, this repeats seven times to finish solving matrix. 
This part is matrix solving process. The third part shows high network throughput per 
second since communication is needed between processes to synchronize the data and 
calculates the results the user needs. This screen shot is taken using 36 cores on three 
SenecaData Nexlink 7875 workstations with two hex-core CPUs on each node connected 
with Gigabit Ethernet switch. The description of this system will be shown in detail in the 
next chapter.  
13 
 
 
 
Another important aspect of performance is load balancing. If the computational 
load is not balanced well across the cores, the lighter load will have to wait for the more 
heavily loaded cores. This unbalance results in inefficient use of computational resources 
and wasted time. Assuming that the code is optimized, the ideal situation for best load 
balancing is to have identical configurations for the compute nodes. This is called 
homogenous nodes, where as the CPU type, number of cores, RAM type and size, hard 
disk type, RPM and size, and the network type are identical for every node. In this 
dissertation, most of clusters have identical system node specifications for optimized 
hardware load balancing. This enables the best performance with respect to the 
parameters that affect parallel CEM code. 
 
 
  
3.
 
3.1. Overview 
In this chapter, the 
introduced. The specification
type, CPU type and clock speed
array controller configuration
relevant characteristics. T
Comparisons using the nine 
software differs according to hardware specification
platforms is provided in Table
 
3.2. Description of computation
Figure 3.1 Cluster
 
 
 COMPUTING PLATFORMS 
nine computing platforms that are used in this 
s of each system are provided in detail, 
, RAM type and size, hard disk type, rpm
, network type and bandwidth, operating system and 
he largest NUN that can be solved on the cluster is 
systems demonstrate how the performance of parallel CEM 
s. A summary of the 
 3.1 in the end of this chapter. 
al platforms 
-1, custom-built workstation by Seneca Data
14 
research are 
including cluster 
 and size, disk 
other 
also stated. 
nine computing 
 
 
15 
 
 
 
Cluster-1, shown in Figure 3.1, is a custom-built workstation by Seneca Data. The 
cluster has one CPU, which is an Intel i7-920 quad-core 2.66GHz processor with 8 MB 
Intel® Smart Cache, Quick Path Interconnect (QPI), Max 2.93 GHz turbo boost, and 6 
GB Double Data Rate (DDR) 3 RAM. It has two 7200 rpm serial advanced technology 
attachment (SATA) hard disks, one 160 GB and one 500 GB. The operating system is 
Windows 7 Enterprise and HOBBIES 10 Professional Windows version is installed. It 
has a total of 4 cores and can solve a project up to 160,000 NUN. 
 
Figure 3.2 Cluster-2, ten Dell PowerEdge 1855 blade servers 
 
Cluster-2, shown in Figure 3.2, consists of ten Dell PowerEdge 1855 blade 
servers, each with the same configuration. Each blade has two CPUs which are Intel 
Xeon 3.6 GHz EM64T single-core processors with 2 MB L2 cache and an 800 MHz FSB, 
8 GB of RAM, and two Small Computer System Interface (SCSI) hard disk drives: one 
146 GB, 15k rpm, and another 300 GB, 10k rpm. All the nodes are interconnected with a 
Dell 5324 24-port gigabit Ethernet switch. The total number of cores is 20, total RAM 
size is 80 GB, and total hard disk size is 4.5 TB. However, projects must be run with the 
16 
 
 
 
same size disk on each node, so the useable hard disk size is 1.5 TB using just the 146 
GB hard disks or 3 TB with 300 GB using just the hard disks. The operating system is 
Linux, ROCKS 5.3 (Rolled Tacos) which is a Linux distribution intended for HPC based 
on CentOS 5 [9]. The system is configured to have one head node and nine compute 
nodes. HOBBIES 10 Professional Linux version is installed. With the parallel OCS, 
Clsuter-2 can solve the project with up to 275,000 NUN with 146 GB hard disks, and 
394,000 NUN using the 300 GB hard disks. 
 
Figure 3.3 Cluster-3, four Seneca Data Nexlink 7875 Series Workstations 
 
Cluster-3, shown in Figure 3.3, is made with four Nexlink 7875 Series 
workstations by Seneca Data. Each node has two CPUs which are Intel Xeon X5430 2.66 
GHz quad-core processors with 12 MB L2 Cache and FSB, 16GB RAM. There are three 
300 GB Serial Attached SCSI (SAS) hard disks with 15k rpm using redundant array of 
inexpensive (or independent) disks (RAID) controller configured to RAID 0. The nodes 
17 
 
 
 
are interconnected with a DELL 2708 8-port Gigabit Ethernet switch. The total number 
of cores is 32, total RAM size is 64 GB, and total hard disk size is 3.6 TB. The operating 
system is Linux, ROCKS 5.3 and the cluster is composed of one head node and three 
compute nodes. HOBBIES 10 Professional Linux version is installed and Cluster-3 can 
solve a project with up to 470,000 NUN using the parallel OCS. 
 
Figure 3.4 Cluster-4, three Seneca Data Nexlink 7875 Series Workstations 
 
Cluster-4, shown in Figure 3.4, is also composed of three Nexlink 7875 Series 
workstations by Seneca Data, however the CPUs are a newer generation of multicore 
with six cores instead of four. Each of the three nodes has two Intel Xeon X5670 2.93 
GHz hexa-core processors with 12MB Intel® Smart Cache, QPI, Max 3.33 GHz turbo 
boost, and 48 GB DDR3 RAM. There are three 600 GB SAS hard disks 15k rpm using 
RAID controller configured to RAID 0. The nodes are interconnected with a Linksys SD 
2008 Gigabit Ethernet switch. The total number of cores is 36, the total RAM size is 144 
GB, and the total hard disk size is 5.4 TB. The operating system is ROCKS 5.3 and the 
cluster has one head node and two compute nodes. HOBBIES 10 Professional Linux 
18 
 
 
 
version is installed and Cluster-4 can solve a project with up to 580,000 NUN using the 
parallel OCS. 
 
Figure 3.5 Cluster-5, one HP Proliant DL380 G5 server as head node and sixteen 
BL460c G1 blade servers as compute node 
 
Cluster-5, shown in Figure 3.5, is a cluster with one Hewlett-Packard Proliant 
DL380 as head node and sixteen HP BL460C G1 blade servers as compute nodes. The 
head node has two CPUs which are Intel Xeon E5450 3.0 GHz quad-core with 26 MB L2 
Cache, 1333MHz FSB, EM64T processors and 16 GB RAM. There are four 146 GB 10K 
rpm SAS hard disks. Two of the hard disks are configured to RAID 0 and the other two 
as RAID 1 to provide for the system backup. Each compute node also has two quad-core 
19 
 
 
 
Intel Xeon E5450 processors and 16 GB of RAM. There are two 146 GB 10K rpm SAS 
hard disks configured to RAID 0. One Procurve 2824 Gigabit switch is connected for 
network management and three InfiniBand (ConnectX IB DDR) switches are connected 
to the blade servers for computing purposes. The total number of core is 128 cores, total 
RAM size is 256 GB, and total hard disk size is 2.4 TB. The operating system is Linux, 
HP XC 4.0 and HOBBIES 10 Professional Linux version is installed. Cluser-5 can solve 
a project up to 380K NUN. 
 
Figure 3.6 Cluster-6, Dell PowerEdge 1950 Cluster Map 
 
Cluster-6 is DELL PowerEdge 1950 cluster with one head node and 10 compute 
nodes, each with two CPUs, which are Intel Xeon quad-cores with 32 GB RAM. Four 
compute nodes have 2.66 GHz CPU clock speed and six compute nodes have 3.16 GHz 
CPU clock speed. Each node has two 146GB 15k rpm SAS hard disks with RAID 
controller configured to RAID 1. Also, Disk array, JetStor SCSI 416S RAID, is attached 
with a LSI Logic Ultra 320 SCSI card on head node and with Gigabit Ethernet on head 
node and compute nodes. Qlogic Single-Port PCI 8x SDR InfiniBand cards with a Qlogic 
20 
 
 
 
9024 switch is used for connecting each node. And also Gigabit Ethernet is connected to 
each node using a Gigabit switch. Total number of core is 80 cores, total RAM size is 
320 GB, and total hard disk size is 1.46 TB. The Figure of the cluster is not provided but 
cluster map is shown in Figure 3.6. The operating system is Linux and HOBBIES Linux 
version is installed. It can solve the project up to 300K NUN. 
Cluster-7 has same specification as Cluster-5 but with 144 compute nodes. 
Cluster-5 only needs one blade server chassis and one equipment rack for sixteen 
compute nodes. Cluster-7 requires nine blade server chassis (each holds 16 blade server 
compute nodes) and four equipment racks. The first rack has the HP Proliant DL380 G5 
as head node and one chassis with sixteen HP Proliant BL460 G1. The second and third 
racks have three chassis each, total ninety six HP Proliant BL460 G1. The fourth rack has 
two chassis, for a total of thirty two HP Proliant BL460 G1. In total, Cluster-7 has one 
head node and one hundred forty four compute nodes with 1,152 cores, 2,304 GB RAM, 
36 TB of hard disk for matrix calculation. The operating system is HP XC 4.0 and 
HOBBIES 10 Professional is installed. A project with 380,000 NUN can be solved with 
RAM using HOBBIES parallel ICS. A very large project of 1.5 million NUN can be 
solved with all the hard disks using HOBBIES parallel OCS. 
21 
 
 
 
 
Figure 3.7 Cluster-8, Dell PowerEdge T410 server and Dell Precision T5500 
 
Cluster-8, shown in Figure 3.7, consists of one Dell PowerEdge T410 server as 
head node and five Dell Precision T5500 servers as compute nodes. All nodes have one 
CPU, Intel Xeon X5650 hexa-cores 2.66GHz with max 3.06 GHz turbo boost and 24GB 
DDR3 RAM. Each node has a single 500GB SATA 7200 RPM hard disk. All the nodes 
connect to a Gigabit switch. Total number of core is 36 cores, total RAM size is 144 GB, 
and total hard disk size is 3 TB. The operating system is Linux, ROCKS 5.4.3 (Viper) 
and HOBBIES 10 Professional Linux version is installed. Cluster-8 can solve a project 
with up to 390,000 NUN. 
  
Figure
 
Cluster-9, shown in Figure 3.8, consists of o
system has two CPUs, Intel 
RAM and five 600GB SA
total RAM size is 256 GB, and 
ROCKS 6.2 (Sidewinder) and HOBBIES 10 Professional Linux version is installed. 
Cluster-9 can solve a project up to 400,000 NUN.
Table 3.1 below provides 
used in this research and detailed in the paragraphs above.
 
 
 
 
 3.8 Cluster-9, Seneca Data VFX 9300 
ne Seneca Data VFX 9300
deca-cores Xeon E5 2687Wv3 3.1 GHz, 2
S 15k rpm hard disk with RAID 0. Total number of c
total hard disk size is 3 TB. The operating system is Linux, 
 
a summary of the specifications on 
 
 
22 
 
. This 
56 GB DDR4 
ores is 20, 
the nine clusters 
23 
 
 
 
Table 3.1 Summary of the computing platforms used 
Platform Operating System 
Model 
(Number of 
Nodes) 
Processor Ram 
 (GB) 
HDD 
/node 
(GB) 
Cluster-1 Windows 7 Professional 
Seneca Data 
(1) 
Intel i7-920, 
quad-core, 1 CPU, 
2.66 GHz 
(Max 2.93 GHz) 
DDR3 
6 
SATA 
160, 500 
Cluster-2 Linux (ROCKS 5.3) 
Dell 
Power Edge 
1855 blade 
(10) 
Intel Xeon, 
single-core, 20 CPUs, 
3.6 GHz 
DDR 
80 
SCSI 
146, 300 
Cluster-3 Linux (ROCKS 5.3) 
Seneca Data 
Nexlink 7875 
(4) 
Intel Xeon E5430, 
quad-core, 8 CPUs, 
2.66 GHz 
DDR 
64 
SAS 
300 × 3 
(RAID0) 
Cluster-4 Linux (ROCKS 5.3) 
Seneca Data 
Nexlink 7875 
(3) 
Intel Xeon X5670, 
hexa-core, 6 CPUs, 
2.93 GHz 
(Max 3.33 GHz) 
DDR3 
144 
SAS 
600 × 3 
(RAID0) 
Cluster-5 Linux (HP XC 4.0) 
HP ProLiant 
Server 
DL380 (1), 
BL460c (16) 
Intel Xeon E5450, 
quad-core, 
32 CPUs, 3.0 GHz 
DDR 
256 
SAS 
146 × 2 
(RAID0) 
Cluster-6 Linux 
DELL 
PowerEdge 
1950 
(6, 4) 
Intel Xeon 
quad-core 
2 CPUs, 2.66 GHz 
2 CPUs, 3.16 GHz 
DDR 
32 
SAS 
146 × 2 
(RAID1) 
Disk array 
Cluster-7 Linux (HP XC 4.0) 
HP ProLiant  
DL380 (1), 
BL460c 
(144) 
Intel Xeon E5450, 
quad-core, 
288 CPUs, 3.0 GHz 
DDR 
2,304 
SAS 
146 × 2 
(RAID0) 
Cluster-8 Linux (ROCKS 5.4.3) 
DELL 
T410 (1), 
T5500 (5) 
Intel Xeon X5650, 
hexa-core, 6 CPUs, 
2.66GHz 
(Max 3.06 GHz) 
DDR3 
144 
SATA 
500 
Cluster-9 Linux  (ROCKS 6.2) 
Seneca Data 
VFX 9300 
Intel Xeon E5 
2687Wv3, 
deca-core, 2 CPUs, 
3.1 GHz  
(Max 3.5 GHz) 
DDR4 
256 
SAS 
600 × 5 
(RAID0) 
 
  
24 
 
 
 
4. THE PARAMETERS THAT AFFECT PARALLEL PROCESSING FOR CEM 
SIMULATION CODES ON HPCC 
 
4.1. Overview 
In the book, “Parallel Solution of Integral Equation-Based EM Problems in the 
Frequency Domain” [1], a parallel CEM code is successfully executed on a number of 
computer platforms and performances comparisons are made. Optimization of relevant 
hardware and software parameters such as single core vs. multicore, choice of operating 
system, block size, the shape of process grid, the amount of storage used on the hard disk, 
RAM type and size, and more are done to improve the performance. The time when that 
book was written coincided with the transition of CPU architecture from single-core to 
multicore. Since that time, computer hardware has improved and there are new 
parameters to investigate. New technologies for the multicore CPUs have been developed 
and these are analyzed and optimized in this research. The goal is to have maximized 
system performance and this dissertation updates and extends the work achieved in the 
book.  
In this chapter, the performances of parallel CEM software is explored on the nine 
computing platforms described in the previous chapter. The parallel CEM software 
package called HOBBIES is used to create models of five structures. Model 1 is a small 
aircraft modeled half structure with symmetry. Model 2 is a larger aircraft fully modeled. 
Model 3 is a full-size Global Hawk UAV (unmanned aerial vehicle) with a dipole 
antenna array under its wings. Models 4 and 5 are both two antennas in a non-anechoic 
room. Model 4 has two horn antennas and Model 5 has one horn antenna and one helical 
antenna.  
25 
 
 
 
HOBBIES performance results using the nine HPCC to compare hardware 
parameters, including CPU, RAM, hard disk, and network devices are presented. The 
effect of Intel Hyper Threading technology on workstation and blade cluster performance 
is also shown. Comparisons of simulation times between single-core CPU systems and 
multicore CPU systems are provided. As CPU technologies developed, new features are 
introduced, such as Intel Turbo boost and Intel Quick Path Interconnect. New generations 
of multicore include three integrated memory controllers which have a separate memory 
controller outside of the processor that yields more bandwidth than prior CPU 
architectures. Since HOBBIES is using OCS, hard disk speed and interfaces are 
important parameters that impact performance. The performances of two different 
networks are shown: Gigabit Ethernet and InfiniBand. Also the effect of the number of 
cores is presented and investigations on the process grid size and orientation are provided. 
Since the price of RAM went down, the performance using more than 4 GB RAM as 
IASIZE is presented. 
 
4.2. Intel Hyper-Threading Technology 
Intel Hyper-Threading (HT) is a technology that enables multiple threads to run 
on each core and helps to use processor resources more efficiently. As a performance 
feature, Intel HT Technology also increases processor throughput and improves overall 
performance on threaded software [10]. Basically, it works as virtual multicore processer 
by dividing one CPU into two cores using software threading. One can enable HT 
through the Basic Input/Output System (BIOS) setting. Figure 4.1 is a screen shot 
  
showing how the BIOS setup, 
disabled to turn HT on and off
Figure
HOBBIES is running with 
80% loading on matrix solving. 
offers. 
How HT works for HOBBIES in HPCC is 
scattering on air craft model are simulated with and without HT
and multiple node system
 
4.2.1. HT monitoring with 
In this section, the monitoring software Ganglia [8] is used to show 
working on the system before doing the 
 
“Intel (R) Hyper-Threading Tech” can be
.  
 4.1 HT control in BIOS setup on Cluster-2 
 
100% CPU loading during matrix filling and at least 
Perhaps there is no room for the improvement that 
investigated. Several
, comparing
s. 
Ganglia 
actual comparison with and without HT
26 
 enabled and 
 
HT 
 cases of 
 single node 
how HT is 
. Figures 
27 
 
 
 
4.2 to 4.5 show screen shots of the Ganglia GUI which displays the details on the cluster 
performance parameters. Information about CPU, memory and network performance is 
shown for a project running on Cluster-2. Figure 4.2 shows that Cluster-2 has 10 nodes, 
one head node and nine compute nodes, numbered from 0-0 to 0-8. Each node has two 
Intel Xeon CPUs, single-core. Cluster-2 has 20 physical CPUs which can be seen in the 
top left corner of the display, circled in red. Each physical core corresponds to 1 thread, 
so with HT off, there are 20 processing threads. Figure 4.4 shows the same Cluster-2, still 
with just 20 physical cores. However, in the top left corner, circled in red, it appears that 
there are 40 CPUs. This is because HT is turned on and there are 2 threads per core, for a 
total of 40 threads. 
Figure 4.2 shows that Cluster-2 with HT off is running HOBBIES using 16 out of 
20 cores (threads) on 8 compute nodes out of 10 nodes. The right top graph shows that 
Cluster-2 CPU usage for last hour indicates 80% which means 8 nodes are fully loaded as 
expected. Figure 4.3 shows the graph of CPU usage for the last hour of the first compute 
node (compute-0-0.local). The first 15 minutes of the plot shows 100% of CPU usage. 
This is the filling time and shows that the CPU is fully loaded.  
Figure 4.4 shows that Cluster-2 with HT on is running HOBBIES using 32 out of 
40 threads on 8 compute nodes out of 10 nodes. The right top graph shows that Cluster-2 
CPU usage for the last hour indicates 40% in filling time which means 8 nodes are not 
fully loaded. If it is fully loaded, CPU usage has to be 80% of CPU usage since 8 out of 
10 nodes are used. Figure 4.5 presents the graph of CPU usage for last hour of the first 
compute node (compute-0-0.local) and it also shows 50% of CPU usage which means it 
is not fully loaded even though all threads are used.  
  
Figure 
Figure 4.3 Ganglia CPU 
 
4.2 Ganglia overview of Cluster-2 with HT off 
 
usage report for Clsuter-2, the first compute
 
28 
 
 
 node HT off 
  
Figure 4.4 Screen shot of Ganglia overview of 
Figure 4.5 Ganglia CPU 
These figures show that with HT 
but it still only has two physical cores
 
Cluster-2 with 
 
usage report for Cluster-2, the first compute
 
on, the number of core on each node 
. The cores which show as the thread
29 
 
HT on 
 
 node HT on. 
shows four, 
s added by HT 
30 
 
 
 
are not physical cores. Do not be mislead by the information from the Ganglia monitor 
display. As mentioned previously, HT is software multi-threading in a physical core and 
those multi-threads share the system bus which is limited and can become a bottle neck 
when threads communicate with RAM and I/O. HT needs to be tested to verify that it will 
improve the performance of parallel CEM software. 
Not all the Intel Xeon server CPUs have HT. For example, Intel Xeon E5430 of 
Cluster-3, Intel Xeon X5670 of Cluster-4, and Intel Xeon E5450 of Cluster-5 and 
Cluster-7 do not have HT available on their CPU. 
 
4.2.2. Comparison: without and with HT on Cluster-1 (single node platform) 
To check the performance of HT, a simulation is run on the same system with and 
without HT, and the total simulation times are compared. Cluster-1 is a single node with 
one Intel i7 quad-core processor with HT. This provides 4 threads without HT and 8 
threads with HT two threads per physical core). The scattering of an aircraft is simulated. 
Instead of modeling the whole aircraft, half of the structure is modeled and the symmetry 
function is used for the simulation. This reduces the NUN in half but gives the same 
simulation result as the full size aircraft. The half aircraft using symmetry is Model 1-1. 
Model 1-1 is run on Cluster-1 with operating frequency 1.85 GHz, having 50,305 NUN. 
The size of full aircraft is 11.6m in length, 7.0m in width and 2.92m in height which is 
about 71.6λ in length, 43.2λ in width, and 18λ in height. Model 1-2 is the same half 
aircraft with symmetry model, but is run with a higher operating frequency of 2.25 GHz, 
to give a higher NUN of 75,753. The electrical size is about 87λ in length, 52.5λ in width 
and 21.9λ in height. Two views of the half aircraft model are shown in Figures 4.6 and 
31 
 
 
 
4.7. To show the simulations are valid and correct, the results, Radar Cross Section (RCS) 
of aircraft, are compared and presented in Figure 4.8. Table 4.1 is a comparison of the 
total simulation times with and without HT. 
 
 
Figure 4.6 Aircraft model, a half structure with symmetry 
 
 
Figure 4.7 Top view of a half aircraft model with symmetry 
 
32 
 
 
 
 
Figure 4.8 Simulation results, RCS (Phi Cut, Phi=90°) of Models 1-1 and 1-2 with and 
without HT 
 
Table 4.1 Simulation times: with and without HT on Cluster-1 
Model HT NUN Thread Time (minutes) Filling Solving Total 
Model 1-1 On 50305 8 37 878 926 
 Off 50305 4 37 990 1041 
Model 1-2 On 75753 8 113 11076 11210 
 Off 75753 4 93 9317 9431 
 
In Table 4.1, for Model 1-1, the simulation with HT is faster by 115 minutes in 
total simulation time and for Model 1-2, the simulation without HT is faster by 1,779 
minutes in total simulation time. The simulation time is approximately linearly increasing 
as number of core increases. The total simulation time using 8 cores should be half of 
total simulation time using 4 cores. For Model 1-1, simulation with HT is faster but it is 
not half compared to simulation without HT. For Model 1-2, the total time with HT is 
33 
 
 
 
worse than the total time without HT. This is single node system using Windows 7, 
which means that the system itself will take more resources. 
 
4.2.3. Comparison: without and with HT on Cluster-2 (multiple nodes platform) 
In this section, Cluster-2, which has 10 compute nodes, is used to compare code 
performance with and without HT. Each of the 10 nodes has two Intel Xeon single-core 
CPUs with HT capability. Cluster-2 can run using 20 threads without HT and 40 threads 
with HT. Simulations with more NUN are used to show the performance difference more 
clearly. Model 1, the half aircraft with symmetry is again used, except here with three 
different operating frequencies. Model 1-3 has 50,844 NUN with 1,875 MHz operating 
frequency, model 1-4 has 100,463 with 2,600 MHz and model 1-5 has 204,331 with 
3,850 MHz. The electrical size of aircraft according to operating frequency is shown in 
Table 4.2. 
Table 4.2 NUN and electrical size of aircraft model with three different frequencies  
Model 
Operating 
Frequency 
(MHz) 
NUN Length (λ) 
Width 
(λ) 
Height 
(λ) 
Model 1-3 1,875 50844 72.50 43.75 18.25 
Model 1-4 2,600 100463 453.13 273.44 114.06 
Model 1-5 3,850 204331 2832.03 1708.98 712.89 
 
Model 1-3 is simulated with seven different configurations: 1) two cores using 
one node with HT off, 2) four cores using one node with HT on, 3) four cores using two 
nodes with HT off, 4) eight cores using two nodes with HT on, 5) eight cores using four 
nodes with HT off, 6) sixteen cores using four nodes with HT on, and 7) sixteen cores 
34 
 
 
 
using eight nodes with HT off. Model 1-4 is simulated with four different configurations: 
1) eight cores using two nodes with HT on, 2) eight cores using four nodes with HT off, 3) 
sixteen cores using four nodes with HT on, and 4) sixteen cores using eight nodes with 
HT off. Model 1-5 is simulated with three different configurations: 1) sixteen cores using 
four nodes with HT on, 2) sixteen cores using eight nodes with HT off, and 3) sixteen 
cores using eight nodes with HT on. To show the simulations are valid and correct, the 
results, RCS of aircraft, are compared and presented in Figure 4.9. All the simulation 
times are presented in Table 4.3. 
 
Figure 4.9 Simulation results, RCS (Phi cut, Phi=90°) of Models 1-3, 1-4, and 1-5 with 
and without HT 
 
 
 
 
 
 
 
 
35 
 
 
 
Table 4.3 Simulation times: with and without HT on Cluster-2 
Model HT NUN Threads Nodes 
Time (minutes) 
Filling Solving Total 
Model 1-3 Off 50844 2 1 69 554 653 
On 50844 4 1 95 592 717 
  Off 50844 4 2 48 272 335 
  On 50844 8 2 53 336 406 
  Off 50844 8 4 27 142 177 
  On 50844 16 4 31 169 209 
  Off 50844 16 8 15 76 96 
Model 1-4 On 100463 8 2 271 2711 3016 
  Off 100463 8 4 137 1150 1304 
  On 100463 16 4 144 1194 1358 
  Off 100463 16 8 72 548 630 
Model 1-5 On 204331 16 4 973 11464 12481 
  Off 204331 16 8 490 4860 5370 
 
On 204331 32 8 507 5573 6106 
 
The comparisons between the same number of threads with and without HT are 
made. The total simulation time of Model 1-3 using 4 threads without HT is faster by 382 
minutes than the time of model 1-3 using 4 threads with HT. The similar results are 
shown in comparisons between same numbers of threads in Model 1-3, Model 1-4 and 
Model 1-5. The comparisons between the same number of nodes with and without HT are 
made. The total simulation time of Model 1-3 using 2 threads, 1 node without HT is 
faster by 64 minutes than the time of Model 1-3 using 4 threads, 1 node with HT. Similar 
results are shown in comparisons between same numbers of nodes in Model 1-3, Model 
1-4 and Model 1-5.  
As mentioned previously, HT is software multiple threading technology in a 
physical core. Through these comparisons, one can observe that running with HT is 
degrading the performance even though the number of threads is twice the number 
running without HT. 
36 
 
 
 
4.3. Single-core system versus multicore system 
In this section, the cluster with single-core CPUs and the cluster with multicore 
CPUs are compared. The advantage of the single-core CPU is clock speed, since 
multicore CPU did not have faster clock speed than single-core. Another advantage is 
there will be more available system buses than multicore system. Multicore CPU system 
has equal or more total number of system buses than single-core CPU system. But if you 
compare bus per core, multicore CPU will have fewer buses per core than single-core 
CPU. In worst case, cores will share the bus for system communication and it will cause a 
bottle neck. On the other hand, an advantage of multicore is less communication among 
systems connected with networks in parallel computing. Network communication means 
communicating among nodes. During the simulation, data will be distributed to the cores 
and when computing is done, data will be collected. Compared to single-core system, 
fewer nodes are needed to run simulations on the multicores system. For the single-core 
system, more nodes are needed to run with same amount of cores than multicore system. 
Even though the multicore CPU has lower clock speed than single-core CPU, the clock 
speed is not the only parameter that affects the performance of Parallel CEM tools. These 
aspects are considered when the performance comparison is made in the next section. 
 
4.3.1. Comparison using 16 cores: Cluster-2 and Cluster-3 
Cluster-2 is the single-core system with 20 cores and Cluster-3 is the quad-core 
system with 32 cores. The same simulation is run using 16 cores on each platform to 
compare the performance between the single-core system and the quad-core system. The 
scattering from a full size aircraft in Figure 4.10 is simulated. Model 2 is the full size 
37 
 
 
 
aircraft structure, which is 17.32m in length, 11.4 m in width, and 3.7 m in height. It is 
modeled as a Perfect Electric Conductor (PEC). Different frequencies will be used to 
scale the electrical size of the model, as was done in the previous sections. 
 
Figure 4.10 Full size aircraft, Model 2 
 
Four simulations are run on Cluster-2 and Cluster-3 using 16 cores each. Each 
simulation uses the same Model 2 structure but with different frequencies to vary NUN. 
For the analysis, NUN are set to be about 50K, 100k, 150k, and 200k. For Model 2-1, 
operating frequency is set to 305MHz and it has 50,104 NUN. For Model 2-2, operating 
frequency is set to 450MHz and it has 103,638 NUN. For Model 2-3, operating frequency 
is set to 800MHz and it has 150,189 NUN. For Model 2-4, operating frequency is set to 
1.1GHz and it has 200,622 NUN. The electrical size of the aircraft is shown in Table 4.4 
and RCS results are shown in Figure 4.11. The RCS results are reasonable and 
overlapped each other perfectly. The simulation times are listed for comparison in Table 
38 
 
 
 
4.5. Figure 4.12 shows the graph of simulation time on both Cluster-2 and Cluster-3 for 
each model. 
Table 4.4 NUN and electrical size of Model 2 with four operating frequencies 
Model 
Operating 
Frequency 
(MHz) 
NUN Length (λ) 
Width 
(λ) 
Height 
(λ) 
Model 2-1 305 50,104 17.61 11.59 3.76 
Model 2-2 450 103,638 25.98 17.10 5.55 
Model 2-3 800 150,189 46.19 30.40 9.87 
Model 2-4 1100 200,622 63.51 41.80 13.57 
 
 
Figure 4.11 Simulation results, RCS of Models 2-1, 2-2, 2-3, and 2-4 on Cluster-2 and 
Cluster-3 
 
 
 
 
 
 
 
39 
 
 
 
Table 4.5 Simulation times using 16 cores: Cluster-2 and Cluster-3 
Model System
 
NUN
 
CPU type
 
Time (minutes)
 
Filling Solving Total 
Model 2-1 Cluster-2 50104 Single-core 64 69 132 
 Cluster-3 50104 Quad-core 39 62 101 
Model 2-2 Cluster-2 103638 Single-core 239 599 838 
 Cluster-3 103638 Quad-core 145 510 655 
Model 2-3 Cluster-2 150189 Single-core 555 1829 2384 
 Cluster-3 150189 Quad-core 336 1507 1843 
Model 2-4 Cluster-2 200622 Single-core 1011 4432 5443 
 Cluster-3 200622 Quad-core 621 3553 4174 
 
 
Figure 4.12 Graph of simulation times (minutes): Cluster-2 and Clsuer-3 
 
Cluster-3, quad-core system, is faster than Cluster-2, single-core system, by 31 
minutes in Model 2-1, 183 minutes in Model 2-2, 541 minutes in Model 2-3 and 1,269 
minutes in Model 2-4. In Figure 4.12, one can find that as NUN increases, the time 
40 
 
 
 
difference between Cluster-2 and Clsuter-3 also increases. What makes these 
performance differences between two systems? Intel website provides CPU specifications 
in detail and comparison of selected CPUs [11]. Also one needs to check other hardwares 
on each system that affect the performance. The hardware specifications of the two 
clusters are compared side by side in Table 4.6 and the highlighted cells have the better 
performance. 
Table 4.6 CPU specification details: Cluster-2 and Cluster-3 
 
Cluster-2
 
Cluster-3
 
 
(Single-core) (Quad-core) 
CPU type Intel Xeon Intel Xeon E5430 
CPU Speed 3.6GHz 2.66GHz 
Cache 2 MB L2 12MB L2 
System Bus 800 MHz 1333MHz 
RAM 4 GB / Core 2 GB / Core 
Hard Disk type SCSI SAS 
RAID No RAID controller RAID 0 
Network 8 nodes Gigabit Ethernet 
2 nodes 
Gigabit Ethernet 
 
The CPU clock speed of Cluster-2 is faster than that of Cluster-3. But FSB speed 
of Cluster-3 is faster than FSB speed of Cluster-2. Cluster-2 has more RAM size per core 
than Cluster-3. Cluster-3 has SAS hard disk with RAID controller and Cluster-2 has SCSI 
hard disk without RAID controller. Comparing maximum throughput of SCSI and SAS, 
SAS is faster than SCSI.  
In conclusion, the multicore system has better performance than the single-core 
system on every simulation with parallel processing despite the multicore system having 
slower CPU clock speed and less RAM size per core. Before the multicore system came 
41 
 
 
 
out, CPU speed and RAM size were the dominant aspects for computing performance but 
now every aspect needs to be considered for the parallel computing. For example, having 
more number of cores became the very important parameter for the performance of 
HPCC using the parallel computing. Also reading and writing speed of storage is 
important since OCS will use storage for dense matrix. These clusters are not just one 
node systems. These clusters have multiple nodes and the bandwidth of the network also 
becomes important in parallel computing. 
 
4.3.2. Comparison between Cluster-2 and Cluster-3 using same number of nodes 
In the previous section, the comparison is made between single-core system and 
multicore system and CPU speed is not the only parameter to impact performance. The 
type of RAM is chosen according to CPU types so mostly one can just vary the size of 
RAM which will decide the size of the problem that can be solved for ICS and work as 
buffer for OCS. Optimizing IASIZE for OCS is already done in the parallel book [1]. 
Also there is network effect. As mentioned in the previous section, the network is the 
connection among nodes. One advantage of having a multicore system is having less 
network communication than single-core system since the speed of system bus is faster 
than the speed of network communication and the number of bus per core is larger than 
the number of network communication. The network became a more important aspect for 
the parallel code when multicore systems are used. 
Comparisons are made between using one node and using four nodes on each 
cluster with same number of cores. Two cores on one node are used on Cluster-2 and 
42 
 
 
 
Cluster-3 for the simulation. Model 2-1 with 50,104 NUN and model 2-2 with 103,638 
NUN are simulated and simulation time is presented in Table 4.7. 
Table 4.7 Simulation times using 2 cores (one node): Cluster-2 and Cluster-3 
Model System NUN Cores Nodes 
Time (minutes) 
Filling Solving Total 
Model 2-1 Cluster-2 50104 2 1 246 481 797 
 
Cluster-3 50104 2 1 204 320 572 
Model 2-2 Cluster-2 103638 2 1 1001 4514 5614 
 
Cluster-3 103638 2 1 975 3125 4167 
 
The time comparisons are made between Cluster-2 and Cluster-3 with simulations 
using one node which means no network time in the simulation time. For 50,104 NUN 
model, Cluster-3 is faster than Cluster-2 and the difference of total simulation time 
between Cluster-2 and Cluster-3 are 225 minutes. For 103,638 NUN model, Cluster-3 is 
faster than Cluster-2 and the difference of total simulation time between Cluster-2 and 
Cluster-3 is 1,447 minutes. Although, Cluster-2 has faster CPU clock speed and more 
RAM size for OCS, Cluster-3, multicore system shows better performance than Cluster-2, 
single-core system, without network communication. 
Other performance comparisons are made between Cluster-2 and Cluster-3 with 
the same network effect. Each cluster uses four nodes and each node is using two cores. 
Total 8 cores are used for the simulation and Model 2-1 with 50,104 NUN and Model 2-2 
with 103,638 NUN are simulated. Simulation times are presented in Table 4.8. 
 
 
 
43 
 
 
 
Table 4.8 Simulation times using 8 cores (four nodes): Cluster-2 and Cluster-3  
Model System NUN Cores Nodes 
Time (minutes) 
Filling Solving Total 
Model 2-1 Cluster-2 50104 8 4 77 138 235 
 Cluster-3 50104 8 4 58 87 159 
Model 2-2 Cluster-2 103638 8 4 267 1165 1463 
 Cluster-3 103638 8 4 234 833 1086 
 
For 50,104 NUN model, total simulation time on Cluster-3 is faster than Cluster-2 
by 36 minutes. For 103,638 NUN model, total simulation time on Cluster-3 is faster than 
Cluster-2 by 377 minutes. Similar results are shown as previous comparison. Although 
Cluster-2 has the advantages of CPU clock speed and RAM size, the multicore system is 
faster than single-core system. 
 
4.4. New generation of multicore 
When the multicore CPU came out, the first concern was system bus since the 
number of the bus per core of multicore CPU is much less than that of single-core CPU 
which could create a bottle neck. As the time goes by, new generation multicore CPU 
comes with new technology called Turbo boost and Quick Path Interconnect (QPI). Intel 
Turbo Boost Technology provides even more performance when needed on new 
generation Intel Core processor-based systems. Intel Turbo Boost Technology 2.0 
automatically allows processor cores to run faster than the base operating frequency if 
they are operating below power, current, and temperature specification limits [12]. Intel 
QPI provides high speed, a point-to-point links inside and outside of the processor. These 
links speed up data transfers by connecting distributed shared memory, the internal cores, 
44 
 
 
 
the I/O hub, and other Intel processors. Intel QPI accelerates the flow of data by 
introducing multiple pairs of high speed serial links. By grouping several paths together 
with intelligent management, it can deliver up to 25.6 GB/s between components and 
maintain communications even when a link fails. Figure 4.13 shows the configuration of 
Intel QPI and Figure 4.14 shows the block diagram of processor with Intel QPI [13]. Also 
new generation multicore CPU includes integrated memory controllers with channels 
which will connect DIMM slots. This means each processor has direct access to its 
memories. Integrated controllers yield more bandwidth than prior CPU architectures, 
which had a separate memory controller outside of the processor. Figure 4.15 shows the 
memory architecture of new generation of multicore [14]. 
 
Figure 4.13 Configuration of Intel QPI 
45 
 
 
 
 
Figure 4.14 Block Diagram of Processor with Intel QPI 
 
 
Figure 4.15 New generation CPU Memory Architecture 
 
At first glance, it is impossible to check the performance of these new 
technologies individually since they are built in with the CPU and cannot be tested 
separately. But since Intel turbo boost will be active when cores run below power, current 
and temperature specification limits, this technology will often not be used by the parallel 
46 
 
 
 
code. 80 to 100% of CPU is used during running simulation and there will be no chance 
to boost the CPU speed. Most of improvement of performance will come from the other 
new technologies. With these new technologies, the comparison of the performance is 
made between previous generation CPU and new generation CPU using HOBBIES. 
 
4.4.1. Comparison: Cluster-3 and Cluster-8 
In this section, comparisons are made between Cluster-3 and Cluster-8. The CPU 
of Cluster-3 is Intel Xeon E5430 quad-core CPU with FSB which is the older system bus 
type and the RAM of Cluster-3 is DDR SDRAM. The CPU of Cluster-8 is Intel Xeon 
X5650 hexa-core CPU with Intel Turbo boost and QPI which is the new system bus type 
and includes memory controllers. The RAM of Cluster-8 is DDR3 SDRAM. Table 4.9 
summarizes specification comparison between two systems. All the specification data are 
from Intel processor comparison website [11]. The difference in specifications between 
Cluster-3 and Cluster-8 are highlighted in Table 4.9. 
Table 4.9 CPU specification details: Cluster-3 and Cluster-8 
 
Cluster-3 cluster-8 
Processor Number E5430 X5650 
# of Cores 4 6 
# of Threads 4 12 
Cache 12.0 MB 12.0 MB 
Clock Speed 2.66 GHz 2.66 GHz 
Max Turbo Frequency - 3.06 GHz 
Bus Type FSB QPI 
Memory DDR DDR3 
Hard disk SAS (15k RPM) SATA (7,200 RPM) 
Network 3 nodes Gigabit Ethernet 
4 nodes 
Gigabit Ethernet 
47 
 
 
 
The same scatterings from full-size aircraft with four different operating 
frequencies, Model 2-1, Model 2-2, Model 2-3 and Model 2-4, are simulated. To make a 
fair comparison, simulations are run using 24 cores on both clusters. Three nodes are 
used on Cluster-3, and four nodes are used on Cluster-8. Table 4.10 shows the 
comparison of simulation times between Cluster-3 and Cluster-8. Figure 4.16 graphs the 
simulation times for Cluster-3 and Cluster-8. 
Table 4.10 Simulation times using 24 cores: Cluster-3 and Cluster-8 
Model System NUN 
Time (minutes) 
Filling Solving Total 
Model 2-1 Cluster-3 50104 27 60 93 
 Cluster-8 50104 16 34 56 
Model 2-2 Cluster-3 103638 81 340 430 
 Cluster-8 103636 52 309 370 
Model 2-3 Cluster-3 150189 173 1020 1199 
 Cluster-8 150189 105 875 986 
Model 2-4 Cluster-3 200622 331 2455 2801 
 Cluster-8 200622 182 2003 2197 
 
Figure 4.16 Graph of simulation times (minutes): Cluster-3 and Cluster-8 
48 
 
 
 
Although Cluster-8 has slower SATA 7,200 rpm hard disk compare to SAS 
15,000 rpm hard disk on Cluster-3 and also has more network communication since more 
nodes are used, the performance of Cluster-8 with new generation multicore CPU shows 
better performance than that of Cluster-3 in all the simulations. The differences in total 
simulation time between Cluster-3 and Cluster-8 are 37 minutes for Model 2-1, 60 
minutes for Model 2-2, 213 minutes for Model 2-3 and 604 minutes for Model 2-4. As 
NUN increase, the difference of the simulation time between Cluster-3 and Cluser-8 also 
increases. 
 
4.4.2. Comparison between Cluster-3 and Cluster-4 
In this section, Cluster-3 with Intel Xeon E5430 quad-core CPU and Cluster-4 
with Intel Xeon X5670 hexa-core CPU are compared. Table 4.11 shows comparison 
between the two CPUs [11]. The differences in specification between Cluster-3 and 
Cluster-4 are highlighted. 
Table 4.11 CPU specification details: Cluster-3 and Cluster-4 
 
Cluster-3 Cluster-4 
Processor Number E5430 X5670 
# of Cores 4 6 
# of Threads 4 12 
Cache 12.0 MB 12.0 MB 
Clock Speed 2.66 GHz 2.93 GHz 
Max Turbo Frequency - 3.33 GHz 
Bus Type FSB QPI 
Memory DDR DDR3 
Hard disk SAS (15k rpm) SAS (15k rpm) 
Network 3 nodes Gigabit Ethernet 
2 nodes 
Gigabit Ethernet 
 
49 
 
 
 
Both clusters are Seneca Data Nexlink 7875 but they have different types of CPU, 
Bus and RAM. The CPU of Cluster-4 is hexa-core with Intel Turbo Boost with 3.33 GHz 
maximum turbo frequency and QPI. Also DDR3 SDRAM is used. The CPU of Cluster-3 
is quad-core with FSB and DDR SDRAM is used. The same simulations using 24 cores 
are run for the comparison between Cluster-3 and Cluster-4. Three nodes are used on 
Cluster-3, and two nodes are used on Cluster-8. Table 4.12 summarizes simulation times 
results. 
Table 4.12 Simulation times using 24 cores: Cluster-3 and Cluster-4 
Model System NUN 
Time (minutes) 
Filling Solving Total 
Model 2-1 Cluster-3 50104 27 60 93 
 Cluster-4 50104 16 31 52 
Model 2-2 Cluster-3 103638 81 340 430 
 Cluster-4 103636 47 285 340 
Model 2-3 Cluster-3 150189 173 1020 1199 
 Cluster-4 150189 93 780 881 
Model 2-4 Cluster-3 200622 331 2455 2801 
 Cluster-4 200622 164 1734 1907 
 
Because of differences in CPU clock speed, the simulation times need to be 
adjusted. For the fair comparison, all the simulation time of Cluster-4 is multiplied by 
2.93/2.66 and Table 4.13 shows the simulation time with CPU clock speed matched A 
graph of the simulation time is shown in Figure 4.17. 
 
 
 
 
 
 
50 
 
 
 
Table 4.13 Simulation times with CPU speed normalized: Cluster-3 and Cluster-4 
Model System NUN 
Time (minutes) 
Filling Solving Total 
Model 2-1 Cluster-3 50104 27 60 93 
 Cluster-4 50104 18 34 57 
Model 2-2 Cluster-3 103638 81 340 430 
 Cluster-4 103636 52 314 375 
Model 2-3 Cluster-3 150189 173 1020 1199 
 Cluster-4 150189 102 859 970 
Model 2-4 Cluster-3 200622 331 2455 2801 
 Cluster-4 200622 181 1910 2100 
 
 
Figure 4.17 Graph of simulation times (minutes) with CPU speed normalized: Cluster-3 
and Cluster-8 
 
The simulation time of Cluster-4 with the new generation of multicore CPU is 
faster than that of Cluster-3 in all the simulations. The differences in simulation time 
between Cluster-3 and Cluster-4 are 36 minutes for Model 2-1, 55 minutes for Model 2-2, 
51 
 
 
 
229 minutes for Model 2-3 and 701 minutes for Model 2-4. As NUN increases, the time 
difference in simulation time also increases. 
In conclusion, technologies in the new generation of multicore CPUs are effective. 
The performance is improved with the QPI and integrated memory controllers which will 
connect cores to memories directly. 
 
4.5. Hard disk 
Because of the limitation of RAM size in the system, it is hard to simulate the 
actual size air craft or vehicle with the complicated antennas structures. But HOBBIES is 
providing parallel OCS for those models. The OCS uses hard disk to calculate dense 
matrix instead of RAM and it is less expensive to increase the size of hard disk than the 
size of RAM. Processing with RAM has some advantages over processing with hard disk, 
but processing with the hard disk enables the capability to simulate larger sized structures 
that cannot be solved with only RAM. The research regarding parallel OCS performance 
test is provided in the parallel book [1]. 
Hard disk is an important parameter in parallel CEM simulation software. 
Specifically, during the solving process, I/O read/write time depends on the type and 
specification of hard disk. CPU is reading a portion of large dense matrix to calculate and 
when it is done, CPU writes the intermediate results back to hard disk and repeat the 
same process until entire large dense matrix is solved. Compared to RAM, the capacity of 
hard disk is larger and the price of hard disk is cheaper. Also there is a limit on the 
amount of RAM that can be installed in a node and the maximum capacity of RAM in a 
node is typically much less than a hard disk. There are three hard disk parameters that can 
52 
 
 
 
impact simulations. The first is hard disk type, the second is presence of a RAID 
controller, and the third is RPM of the hard disk. For the first parameter, as mentioned, 
SAS has better performance than SATA. For the second parameter, more hard disks 
using the RAID controller per node is better and there is no bottleneck associated with 
the corresponding I/O process [1]. For the third parameter, the faster the RPM is the 
better the performance. 
In the next section, these three factors are tested and compared. Simulations with 
two different RPMs of hard disk are run and compared.  Simulations with local hard disk 
and disk array on network are also run and compared. All the simulations are done using 
parallel OCS, which uses hard disk for computing. 
 
4.5.1. Comparison: SAS and SATA 
There are many types of hard disk but in this section, SATA and SAS will be 
compared. In general, SATA is for general-purpose and SAS is for high performance 
server. This means SAS is more reliable and has a good performance but the price per 
GB (gigabyte) of SATA hard disk is cheaper than that of SAS hard disk. As mentioned, 
the solving time which will read and write on the hard disk frequently during the 
simulation when it is running parallel OCS will be affected more than the filling time. 
Table 4.14 shows the comparison of specification between Cluster-4 and Cluster-8 [11]. 
The differences in specification between Cluster-4 and Cluster-8 are highlighted. 
 
 
 
 
 
 
53 
 
 
 
Table 4.14 CPU specification details: Cluster-4 and Cluster-8 
 
Cluster-4 Cluster-8 
Processor Number X5670 X5650 
# of Cores 6 6 
# of Threads 12 12 
Cache 12.0 MB 12.0 MB 
Clock Speed 2.93 GHz 2.66 GHz 
Max Turbo Frequency 3.33 GHz 3.06 GHz 
Bus Type QPI QPI 
Memory DDR3 DDR3 
Hard disk SAS (15,000 RPM) SATA (7,200 RPM) 
Network 2 nodes Gigabit Ethernet 
4 nodes 
Gigabit Ethernet 
 
Clucster-4 and Cluster-8 are used for this comparison and the difference is that 
Cluster-4 has SAS hard disks with 15K RPM and RAID controller and Cluster-8 has a 
SATA hard disk with 7200 RPM and no RAID controller. There are other differences 
that Cluster-4 has faster CPU clock speed than Cluster-8 and Cluster-8 has more RAM 
per core than Cluster-4. Cluster-4 has 2.93 GHz CPU clock speed and 2 GB RAM per 
core. Cluster-8 has 2.66 GHz CPU clock speed and 4 GB RAM per core. IASIZE for 
OCS is set to 2 GB per core on both clusters but Cluster-8 will have more cache than 
Cluster-4 since there is RAM that is not involved in the simulation. Two nodes are used 
for Cluster-4 and four nodes are used for Cluster-8. Cluster-8 will have more network 
communication than Cluster-4 because it is using more nodes. It is challenging to 
separate all the parameters for fair comparison so one needs to keep the differences in 
mind when comparison is made.  
54 
 
 
 
The same scatterings from full size aircraft with the different frequencies, Model 
2-1, Model 2-2, Model 2-3 and Model 2-4, are simulated. Table 4.15 presents the 
simulation time. 
Table 4.15 Simulation times using 24 cores: SAS and SATA 
Model System NUN 
CPU 
Speed 
(GHz) 
RAM 
(GB 
Per 
core) 
Nodes 
Time (minutes) 
Filling Solving Total 
Model 
2-1 
SAS  
15,000 
RPM  
50104 2.93 2 2 16 31 52 
SATA  
7,200 
RPM  
50104 2.66 4 4 16 34 56 
Model 
2-2 
SAS  
15,000 
RPM  
103638 2.93 2 2 47 285 340 
SATA  
7,200 
RPM  
103636 2.66 4 4 52 309 370 
Model 
2-3 
SAS  
15,000 
RPM  
150189 2.93 2 2 93 780 881 
SATA  
7,200 
RPM  
150189 2.66 4 4 105 875 986 
Model 
2-4 
SAS  
15,000 
RPM  
200622 2.93 2 2 164 1734 1907 
SATA  
7,200 
RPM  
200622 2.66 4 4 182 2003 2197 
 
As described previously, there are other parameters which need to be considered. 
Among them, simulation time affected by CPU clock speed can be compensated for the 
fair comparison. The CPU clock speed ratio is multiplied on the simulation time of 
Cluster-8. For example, simulation time with 50,104 NUN of Cluster-8 is 56 minutes and 
CPU clock speed ratio, 2.66/2.93 is multiplied on the time. After the compensation, 
55 
 
 
 
simulation time for 50,104 NUN of Cluster-8 is 51 minutes. This is done on every 
simulation time of Cluster-8. Table 4.16 shows the compensated time. 
Table 4.16 Simulation times with CPU speed normalized:  SAS and SATA 
Model System NUN 
Time (minutes) 
Filling Solving Total 
Model 2-1 SAS 15,000 RPM  50104 16 31 52 
 SATA 7,200 RPM  50104 15 31 51 
Model 2-2 SAS 15,000 RPM  103638 47 285 340 
 SATA 7,200 RPM  103636 47 281 336 
Model 2-3 SAS 15,000 RPM  150189 93 780 881 
 SATA 7,200 RPM  150189 95 794 895 
Model 2-4 SAS 15,000 RPM  200622 164 1734 1907 
 SATA 7,200 RPM  200622 165 1818 1995 
 
Both clusters have similar filling time but solving time which includes I/O time is 
different. For the hard disk performance comparison, solving times are compared. For the 
Model 2-1 with 50,104 NUN, solving time is the same between Cluster-4 with SAS 
15,000 rpm and Cluster-8 with SATA 7,200 rpm hard disk. For Model 2-2 with 103,638 
NUN, Cluster-8 is faster by 4 minutes than Cluster-4. For Model 2-3 with 150189, 
Cluster-4 is faster by 14 minutes than Cluster-8. For Model 2-4 with 200,622 NUN, 
Cluster-4 is faster by 88 minutes than Cluster-8. As NUN increases, the difference of 
solving time also increases, which means more time is needed to read and write to the 
hard disks.  
 
4.5.2. Comparison: 10,000 RPM and 15,000 RPM 
When hard disk is seeking data, the actuator arm is moving and the spindle is 
rotating. To seek data quickly, the RPM needs to be fast. For general purposes, 7,200 
56 
 
 
 
RPM hard disk is sufficient for most of desktop and laptop computers. But for servers, 
10,000 or 15,000 RPM hard disk is typically used.  
Cluster-2 has two hard disks, one with 146 GB with 15,000 RPM and the other 
one with 300 GB with 10,000 RPM. Among 10 nodes, 8 nodes are used for the 
comparison and the head node and first compute node are excluded. Cluster-2 is set four 
nodes to use 146 GB with 15,000 RPM hard disk and the other four nodes to use 300 GB 
with 10,000 RPM hard disk. Full size aircraft Models 2-1, 2-2 and 2-3 are run using 8 
cores. Table 4.17 shows the comparison of the simulation times. 
Table 4.17 Simulation times: 10,000 RPM and 15,000 RPM on Cluster-2 
Model RPM NUN Time (minutes) 
Filling Solving Total 
Model 2-1 10,000 50104 124 135 279 
 15,000 50104 124 133 277 
Model 2-2 10,000 103638 488 1240 1759 
 15,000 103638 490 1223 1743 
Model 2-3 10,000 150189 1093 4026 5132 
 15,000 150189 1091 3974 5078 
 
The difference in solving times between the simulation with 15,000 RPM hard 
disk and simulation with 10,000 RPM is 2 minutes for Model 2-1 with 50,104 NUN, 17 
minutes for Model 2-2 with 103,638 NUN and 52 minutes for Model 2-3 with 150,189 
NUN. As NUN increases, the time difference also increases. The 15,000 RPM hard disk 
shows better performance as expected. 
 
4.5.3. Comparison between local hard disk and disk array 
In this section, a comparison is made between local hard disk and network 
attached hard disk arrays. Cluster-6 has a JetStor SCSI 416S, rackmount RAID System. 
57 
 
 
 
This disk array is connected to Ultra 320 SCSI card on head node and Gigabit Ethernet 
switch to every node. The connection of Cluster-6 is shown in Figure 3.6 in chapter 3. 
There are sixteen 300GB hard drives for a total of 4.8 TB of storage is in disk array. 
Comparison is made between SAS 146 GB hard disks with RAID1 on local and disk 
array on Gigabit Ethernet. The half size aircraft with symmetry Model 1-6 is simulated 
using 80 cores on Cluster-6 both local hard disks and disk array. The NUN is 99,973 with 
operating frequency 2.6 GHz. Simulation times are shown in Table 4.18. 
Table 4.18 Simulation times: local hard disk and disk array on Clsuter-6 
Model System Type of HDD NUN 
Time (minutes) 
Filling Solving Total 
Model 1-6 Cluster-6 Local Hard disk 99973 8 92 127 
  Disk Array 99973 19 1078 1169 
 
Cluster-6 with local hard disk is much faster than Cluster-6 with disk array. The 
difference on solving time between local hard disk and network attached disk array is 986 
minutes. It is expected because there will be bottle neck to access the disk array. While 
the simulation is running, eighty cores are trying to access the disk array at the same time 
through the network connection. In Figure 4.18, Ganglia [8] shows in the red box that 
“WAIT CPU” (orange part) is displaying and “User CPU” (blue part) which is actual use 
of CPU is low at those points. CPUs take turns to access the disk array since there are not 
many buses and while one CPU access to disk array, other CPUs need to wait for their 
turn. 
58 
 
 
 
 
Figure 4.18 Ganglia CPU report on Cluster-6 
 
4.6. Network 
The advantage of having a cluster is that one can expand the system with the 
network connection. There is a limitation on performance that a single system can have 
and this system with the performance will be expensive. The multiple systems with 
network connection are more affordable and the performance will be comparable or 
better with the parallel codes. The performance of multiple system clusters is affected by 
the bandwidth and the speed of the network. There are several types of networks, such as 
Ethernet, Fast Ethernet, Gigabit Ethernet, 10 Gigabit Ethernet and Infiniband. Gigabit 
Ethernet is typically used for general purpose systems. 10 Gigabit Ethernet and 
Infiniband are typically used for high performance computing, which needs more 
bandwidth and speed than Gigabit Ethernet. 
First, the effect of the network is tested by comparing the simulation without and 
with network connection which means using the same conditions but comparing 
simulations using one node and using multiple nodes. Second, simulations are performed 
with different types of network and performance is compared. 
 
59 
 
 
 
4.6.1. Comparison: without and with network on Cluster-3 
To see the network effect, the same amount of cores and type of CPUs are used in 
each computer without and with network. Simulation without network connection means 
running on one node. Cluster-3 is used for the simulation. The total number of cores per 
node is eight. Simulations are run using one node with eight cores per node and using 
four nodes with two cores only per node. Both simulations are run the same number of 
cores, which is eight. The total simulation time is compared. 
Model 3 is a full-size Global Hawk with 61 dipole antennas under the wings. A 
diagram of the Model 3 is shown in Figures 4.19 and 4.20. The length between nose and 
tail is 14.1m, wing span is 35.36m and height of head is 2.4m. It is modeled as a PEC 
[15]. The length of each antenna is 0.5 m and radius is 5 mm. The spacing between dipole 
antennas and the wing of the Global Hawk is 0.25 m and spacing between the dipole 
antennas is 0.5 m. 
 
 
Figure 4.19 Top view and dimensions of Global Hawk, Model 3 
 
 
60 
 
 
 
 
Figure 4.20 Side view and dimensions of Global Hawk, Model 3 
 
Six simulations are run on Cluster-3. Three models are set with different 
operating frequencies to vary the NUN of the simulation. Model 3-1 has frequency set to 
550 MHz for 50104 NUN. Model 3-2 is set to 1,010 MHz for 103,638 NUN. Model 3-3 
is set to 1,650 MHz for 200,959 NUN. Figure 4.21 shows the radiation pattern for all the 
simulations which shows agreement with different operating frequencies without and 
with the network. The simulation times are summarized in Table 4.19. 
 
 
(a)      (b) 
 
61 
 
 
 
 
(c)      (d) 
 
(e)      (f) 
 
Figure 4.21 Simulation results, Radiation Pattern of dipole antenna array, (a) Phi = 0° cut, 
(b) Phi = 90° cut of Model 3-1, (c) Phi = 0° cut, (d) Phi = 90° cut of Model 3-2,  
(e) Phi = 0° cut, (f) Phi = 90° cut of Model 3-3 
 
Table 4.19 Simulation times: without and with network on Cluster-3 
Model Network NUN 
Cores 
per 
node 
Number 
of nodes 
Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
3-1 
No 50104 8 1 8 23 111 137 
Yes 50104 2 4 8 22 91 116 
Model 
3-2 
No 103638 8 1 8 99 783 887 
Yes 103638 2 4 8 96 763 864 
Model 
3-3 
No 200959 8 1 8 646 7163 7824 
Yes 200959 2 4 8 643 7370 8021 
62 
 
 
 
For Model 3-1, the time with network is faster by 21 minutes but for Model 3-2, 
the time without network is faster by 23 minutes. For Model 3-3, the time without 
network is faster by 197 minutes. Cluster-3 has 8 cores and 16GB RAM on every node. 
When Model 3-1 is running on 8 cores with 4 nodes, the system has more RAM for the 
cache so that the simulation does not need to access the hard disk as frequently to get the 
blocks of the dense matrix. When one node is running with 8 cores, fully loaded on the 
node, each core can have 2GB RAM for simulation. But when 4 nodes are running with 2 
cores on each node, each core can have 8 GB RAM: 2GB RAM for simulation and 6GB 
RAM for the cache. That is likely the reason the simulation with network is faster in the 
case of Model 3-1 with 50,104 NUN and Model 3-2 with 103,638 NUN. For Model 3-3 
with 200,959, the simulation without network is faster than the one with network. 
The next simulation will run the exact same RAM size without extra RAM. To 
remove the effect of the extra RAM used as cache, extra RAM is taken out from the 
system to have 2 GB per core in the simulation. After the removal of RAM, each node 
has 4GB RAM and the simulation was rerun using 8 cores with 4 nodes which is 2 cores 
per node. With each core having 2 GB RAM for simulation, there is no extra RAM for 
cache. Table 4.20 provides the previous simulation time of Cluster-3 using 8 cores on one 
node and simulation time of Cluster-3 using 8 cores on 4 nodes with 2 GB RAM per core 
and no extra RAM. 
 
 
 
 
 
 
 
63 
 
 
 
Table 4.20 Simulation times: without network and with network with only 2 GB RAM 
per core on Cluster-3 
Model Network NUN 
Cores 
Per 
node 
Number 
of nodes 
Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
3-1 
No 50104 8 1 8 23 111 137 
Yes 50104 2 4 8 22 96 121 
Model 
3-2 
No 103638 8 1 8 99 783 887 
Yes 103638 2 4 8 96 770 871 
Model 
3-3 
No 200959 8 1 8 646 7163 7824 
Yes 200959 2 4 8 644 7386 8038 
 
The simulation times with network are faster than the simulation times without 
network except Model 3-3. Even though the extra RAM is removed, the result is not 
changed much. Differences in simulation time is that the run with network is faster by 16 
minutes for Model 3-1, and the run with network is faster by 16 minutes for Model 3-2. 
The run without network is faster by 214 minutes for Model 3-3. The difference in time is 
smaller than the previous simulations. 
Same simulations are run on Cluster-4. For Cluster-4, total number of core per 
node is twelve cores. Simulations are run using one node with twelve cores per a node 
and using three nodes with four cores per node and the total simulation time is compared. 
Radiation patterns of models are same as Figure 4.21. The simulation times are presented 
in Table 4.21. 
 
 
 
 
 
 
 
 
64 
 
 
 
 Table 4.21 Simulation times: without and with network on Cluster-4 
Model Network NUN 
Cores 
per 
node 
Number 
of nodes 
Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
3-1 
No 50104 12 1 12 15 73 89 
Yes 50104 4 3 12 9 83 93 
Model 
3-2 
No 103638 12 1 12 39 460 502 
Yes 103638 4 3 12 39 460 502 
Model 
3-3 
No 200959 12 1 12 183 3366 3560 
Yes 200959 4 3 12 302 3933 4240 
 
For Model 3-1, simulation time without network is faster by 4 minutes and for 
Model 3-2, there is no difference in time. For Model 3-3, simulation time without 
network is faster by 680 minutes. Cluster-4 has 12 cores and 48 GB RAM on every node. 
When one node is running with 12 cores which is full loading on a node, each core can 
have 4 GB RAM for simulation. But when 3 nodes are running with 4 cores on each node, 
each core can have 12 GB RAM, 4 GB RAM for simulation and 8 GB RAM for the 
cache. Even though simulation using 3 nodes has the advantage of extra RAM used as 
cache, the simulation without the network shows better performance. 
 
4.6.2. Comparison: Gigabit Ethernet and InfiniBand 
In this section, the performance of two types of networks will be compared. 
Gigabit Ethernet is a network for transmitting Ethernet frames at a rate of a gigabit per 
second and it is defined by the IEEE 802.3 standard for Ethernet [16]. InfiniBand is a 
network for high-performance computing and its advantage is having high throughput, 
low latency [17]. For choosing a network for computational parallel processing, 
understanding the parallel CEM software traffic type and load for the application is 
65 
 
 
 
important. Gigabit Ethernet is recommended for the cluster with 1 to 16 nodes and 
InfiniBand for the cluster with more than 32 nodes. Choosing the network for clusters 
with 16 to 32 nodes depends on the traffic, but InfiniBand is typically better. Another 
aspect for choosing the network type is the project size. The network type is chosen not 
only by the total number of nodes in the system but also by the number of nodes that the 
user typically runs simultaneously in the HPCC for their projects. For example, if there is 
a total of 64 nodes in the system, but the user is running only 8 nodes at a time, the user 
may not need InfiniBand, since only a small number of nodes are typically run 
simultaneously. If the user runs simulations using more than 16 nodes simultaneously, 
InfiniBand is recommended [18]. Projects using larger numbers of cores and higher NUN 
will have performance more impacted by the correct choice of network type. 
 
4.6.2.1. Comparison: Gigabit Ethernet and InfiniBand on Cluster-6 
Cluster-6 has two types of network connected to all the nodes: Gigabit Ethernet 
and InfiniBand as shown in Figure 3.6 in Chapter 3. Simulations are run using Cluster-6 
to compare the performance between Gigabit Ethernet and InfiniBand. Half size aircraft 
with symmetry, model 1-6 is simulated. The operating frequency is 2.6 GHz and the 
NUN is 99,973. The simulation time is shown in Table 4.22. The simulations are run 
using 80 cores which is using 10 nodes. 
Table 4.22 Simulation times: Gigabit Ethernet and InfiniBand on Cluster-6 
Model System Network NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-6 Cluster-6 
Gigabit 99973 10 9 92 101 
InfiniBand 99973 10 8 97 105 
66 
 
 
 
Expected result is that InfiniBand will be faster than Gigabit Ethernet since 
InfiniBand has better performance than Gigabit Ethernet. But in this simulation, Gigabit 
Ethernet shows slightly better performance than InfiniBand. The difference of simulation 
time is 4 minutes. As stated in previous section, because of the small number of cores and 
small NUN, the difference in performance is not clear in the simulation. This simulation 
uses just 10 nodes and Gigabit Ethernet is recommended for this kind of system. 
 
4.6.2.2. Comparison: Gigabit Ethernet and InfiniBand on Cluster-3 and Cluster-5 
In this section, Cluster-3, which has Gigabit Ethernet, and Cluster-5, which has 
InfiniBand, are used for performance comparisons. Both clusters have the same 
generation of CPU but different CPU clock speeds. Since Cluster-5 has an older version 
of HOBBIES which has serial post processing (not parallelized), only the filling time and 
solving time are used to compare the performance of the two different networks. Model 
1-7, the half size aircraft with symmetry, is used for the simulations. Its operating 
frequency is 2.8 GHz and NUN is 110,994. The maximum number of cores that can be 
used for simulation is 32 cores, since Cluster-3 only has 32. The CPU speed is 
normalized for comparing the simulation time. The actual simulation time for each 
cluster is shown in Table 4.23. 
Table 4.23 Actual simulation times: Cluster-3 with Gigabit Ethernet and Cluster-5 with 
InfiniBand 
Model System Network CPU NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 2.66 110994 4 34 440 474 
Cluster-5 InfiniBand 3 110994 4 37 400 437 
 
67 
 
 
 
Cluster-5 with InfiniBand is faster than Cluster-3, but the difference CPU speed 
needs to be taken into account. For a fair comparison, the CPU clock speed needs to be 
normalized before the comparison is made. All the simulation time of Cluster-5 is 
multiplied by 3/2.66 and Table 4.24 shows the simulation time with CPU clock speed 
matched. 
Table 4.24 Simulation times with CPU speed normalized: Cluster-3 with Gigabit 
Ethernet and Cluster-5 with InfiniBand 
Model System Network NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 110994 4 34 440 474 
Cluster-5 InfiniBand 110994 4 42 451 493 
 
As mentioned, comparison is made on filling time and solving time. The 
difference in filling time is 8 minutes and in solving time is 11 minutes. Cluster-5 with 
Infiniband is slightly slower than Cluster-3 with Gigabit Ethernet. The result is the same 
as found in the previous section: Gigabit Ethernet is faster than InfiniBand. This 
performance comparison shows that the performance impact of the network is not clearly 
shown in the simulations due to the small number of cores and the small NUN. The 
project is simulated using 4 nodes and for this project, still Gigabit Ethernet is enough for 
this kind of system. 
 
4.6.2.3. Comparison: Gigabit Ehternet and InfiniBand on Cluster-3 and Cluster-5 
with larger NUN 
In this simulation, the NUN of the project is increased. The same model as was 
used in the previous section, except with a higher frequency. Model 1-8 is simulated with 
68 
 
 
 
operating frequency set to 4.1 GHz with 232,705 NUN. The NUN is twice that of the 
previous simulations and the same clusters are used with the same number of cores. The 
actual simulation time is shown in Table 4.25.  
Table 4.25 Actual simulation times: Cluster-3 with Gigabit Ethernet and Cluster-5 with 
InfiniBand with larger NUN 
Model System Network CPU NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-8 
Cluster-3 Gigabit 2.66 232705 4 210 4631 4841 
Cluster-5 InfiniBand 3 232705 4 191 2707 2898 
 
Because of difference of CPU clock speed, the time is normalized, as was done in 
the previous section. The results are shown in Table 4.26. 
Table 4.26 Simulation times with CPU speed normalized: Cluster-3 with Gigabit 
Ethernet and Cluster-5 with InfiniBand with larger NUN 
Model System Network NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-8 
Cluster-3 Gigabit 232705 4 210 4631 4841 
Cluster-5 InfiniBand 232705 4 215 3053 3268 
 
After normalizing the CPU clock speed, both clusters still show approximately the 
same filling time. But Cluster-5 with InfiniBand is faster than Cluster-3 with Gigabit 
Ethernet by 1578 minutes in solving time. As the NUN increases, Cluster-5 with 
InfiniBand is faster than Cluser-3 with Gigabit Ethernet. 
Although the Gigabit Ethernet is recommended for system with 4 nodes, 
InfiniBand is showing the better performance with larger NUN. 
 
69 
 
 
 
4.6.2.4. Comparison: Gigabit Ethernet and InfiniBand on Cluster-3 and Cluster-5 
with less number of cores 
In this section, simulations with less number of cores are run to see the 
performance impact. It would be better to run simulations with more number of cores to 
compare Cluster-3 with Gigabit Ethernet and Cluster-5 with InfiniBand, but 32 cores is 
the maximum number of cores available. So this time, the simulation is done with less 
number of cores on each cluster. 
The same model as previously, the half size aircraft with symmetry, Model 1-7, is 
run on Cluster-3 and Cluster-5 using 8 cores on each cluster. The operating frequency is 
2.8 GHz and the NUN is 110,994. Table 4.27 shows the actual simulation times from 
both runs. Table 4.28 shows the simulation time with the CPU clock speed normalized. 
Table 4.27 Actual simulation times using 8 cores: Cluster-3 with Gigabit Ethernet and 
Cluster-5 with InfiniBand 
Model System Network CPU NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 2.66 110994 2 122 1243 1365 
Cluster-5 InfiniBand 3 110994 2 113 1657 1770 
 
Table 4.28 Simulation times using 8 cores with CPU speed normalized: Cluster-3 with 
Gigabit Ethernet and Cluster-5 with InfiniBand 
Model System Network NUN Number  
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 110994 2 122 1243 1365 
Cluster-5 InfiniBand 110994 2 127 1869 1996 
 
70 
 
 
 
The result of simulations using 8 cores shows that simulation time of Cluster-3 
with Gigabit is faster than simulation time of Cluster-5 with InfiniBand. The time 
difference is 626 minutes. 
Table 4.29 shows the result of running the same simulations but increasing the 
number of cores to 16 on both clusters. Table 4.30 shows the simulation time with the 
CPU clock speed normalized. 
Table 4.29 Actual simulation times using 16 cores: Cluster-3 with Gigabit Ethernet and 
Cluster-5 with InfiniBand 
Model System Network CPU NUN Number 
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 2.66 110994 4 63 592 655 
Cluster-5 InfiniBand 3 110994 4 83 773 856 
 
Table 4.30 Simulation times using 16 cores with CPU speed normalized: Cluster-3 with 
Gigabit Ethernet and Cluster-5 with InfiniBand 
Model System Network NUN Number  
of nodes 
Time (minutes) 
Filling Solving Filling 
+Solving 
Model 
1-7 
Cluster-3 Gigabit 110994 4 63 592 655 
Cluster-5 InfiniBand 110994 4 94 872 966 
 
The simulations using 16 cores show that Cluster-3 with Gigabit is faster than 
Cluster-5 with InfiniBand. The time difference is 280 minutes. As mentioned, InfiniBand 
does not show performance improvement when less than 16 nodes are used. However, 
when the project size increases, InfiniBand shows its superior performance, even with 
only using 8 nodes. 
71 
 
 
 
4.7. Effect of number of cores  
Before the multicore processer came out, CPU speed is the most significant 
parameter for CEM simulation tools. Now, with multicore CPUs, the most significant 
parameter is the number of cores. Most parameters are optimized in the code by 
developer. However, the number of cores needs to be chosen by the user. The code 
cannot decide how many cores to use for the project because all projects need different 
resources according to NUN. Another factor is how fast the simulation result is needed. 
Of course, speed of execution is limited according to the hardware specification and 
configuration of the system. 
It is true that most of time, simulating with the maximum number of cores 
possible will give the user the best performance for a single simulation run on a cluster. 
Sometimes, however, the user can make a bad choice of the number of cores. For 
example, choosing a prime number of cores will affect the performance and could make 
the simulation run slower than if less cores were selected with a better process grid. The 
process grid, P×Q needs to be as square as possible. The correct choice of number of 
cores for the project can optimize the performance such that even using more cores will 
not improve the simulation time. If multiple simulations with similar NUN need to be run, 
the user needs to choose the number of cores wisely to run the projects fast and most 
efficiently. Choosing the right number of cores for multiple simulations will improve the 
total performance. 
 
 
 
72 
 
 
 
4.7.1. Process Grid P×Q using hundred and more numbers of cores 
The parallel book [1] describes the process grid parameter, the shape of the 
process grid. The best shape of the process grid is to have P and Q approximately equal, 
with Q slightly larger. Properly choosing the number of cores and the shape of the 
process grid is the key to getting the best performance of a parallel CEM codes. If the 
number of cores is a prime number, the only possible grid size is 1×Q.  That is the worst 
possible scenario and it would be better to choose less or more cores to have the grid as 
square as possible [1]. HOBBIES will arrange the process grid as square as possible with 
the condition P < Q. But HOBBIES cannot constrain users to not choose a prime number. 
Users are free to select the number of cores and HOBBIES then determines the best 
process grid, given the user’s choice of number of cores. For optimum performance, users 
need to choose a proper number of cores. The process grids for fourteen cases of numbers 
of cores are tested on Cluster-7. 
 
4.7.1.1. Process Grid: fourteen cases varying the number of cores and process grid 
on Cluster-7 
Cluster-7 was chosen for this simulation because it is the largest cluster and a 
range of number of cores can be chosen. The full size aircraft, Model 2-5 is simulated 
with 500 MHz operating frequency and 114,844 NUN. Simulations are run using 
between 120 and 200 cores either with full nodes, meaning all cores in a node are used, 
or at most one node with fewer cores such that the overall number of cores produces an 
approximately square process grid. For example, for an 11×11 square process grid, the 
total number of core needed is one hundred twenty one. Since each node has eight cores, 
73 
 
 
 
fifteen full nodes and one core in a sixteenth node are needed. For a 13×13 square 
process grid, the total number of core needed is one hundred sixty nine. That means 
twenty-one full nodes and one core in a twenty-second node are needed. For a 14×14 
square process grid, the total number of cores is one hundred ninety six, so twenty-four 
full nodes plus four cores in a twenty-fifth node are needed. The number of cores that 
produce a square process grid is included, even it is not a full node, in order to confirm 
that the square process grid provides the best performance and to compare it with the 
nearest full node, which is not a perfect square process grid. The simulation times are 
listed in Table 4.31 and plotted in Figure 4.22. 
Table 4.31 Simulation times: varying the number of cores and process grid on Cluster-7 
Model NUN Number 
of cores PxQ 
Number 
of nodes 
Time (minutes) 
Filling Solving Total 
Model 2-5 114844 120 10x12 15 35 143 178 
 114844 121 11x11 16 33 139 172 
 114844 128 8x16 16 24 85 109 
 114844 136 8x17 17 21 80 101 
 114844 144 12x12 18 20 75 95 
 114844 152 8x19 19 19 69 88 
 114844 160 10x16 20 18 67 86 
 114844 168 12x14 21 18 65 83 
 114844 169 13x13 22 18 65 83 
 114844 176 11x16 22 17 62 79 
 114844 184 8x23 23 18 58 76 
 114844 192 12x16 24 17 56 73 
 114844 196 14x14 25 17 57 73 
 114844 200 10x20 25 17 53 69 
 
74 
 
 
 
 
Figure 4.22 Plot simulation times: fourteen cases varying the numbers of cores on 
Cluster-7 
 
In Table 4.31, there is 63 minutes difference between simulation time using 121 
cores and simulation time using 128 cores. Using 7 cores more in simulation makes large 
difference on the performance. As number of core increases, simulation time improves. 
The results also show that simulations times using more than 128 cores are not improving 
as significantly as the number of cores increases. This is because the NUN is small. In 
next section, process grid performance is further investigated using forty two cases of 
numbers of cores and process grids by running simulations with a larger NUN. 
 
4.7.1.2. Process Grid: forty two cases varying the number of cores and process grid 
with larger NUN on Cluster-7 
Cluster-7 is used again to run simulation on the full size aircraft, Model 2-6 is 
used with the frequency set to 1 GHz to give 179,472 NUN. Forty two cases are run with 
120 140 160 180 200
60
80
100
120
140
160
180
Number of Cores
To
ta
l S
im
u
la
tio
n
 
Ti
m
e
 
 
114,844 Number of Unknown
75 
 
 
 
numbers of cores varying from 56 to 448 cores. Table 4.32 lists the simulation times for 
the forty two cases. 
Table 4.32.a Simulation times: forty two cases varying the numbers of cores and process 
grid with larger NUN on Cluster-7 
Model Case NUN Cores Nodes P×Q Time (minutes) 
Filling Solving Total 
Model 
2-6 
Case 1-1 179472 56 7 7×8 111 739 850 
Case 1-2 179472 64 8 8×8 97 621 718 
Case 1-3 179472 72 9 8×9 91 533 623 
 Case 2-1 179472 80 10 8×10 84 542 626 
 Case 2-2 179472 81 11 9×9 88 544 631 
 Case 2-3 179472 88 11 8×11 71 538 609 
 Case 3-1 179472 96 12 8×12 77 415 492 
 Case 3-2 179472 100 12 10×10 72 454 526 
 Case 3-3 179472 104 13 8×13 62 408 470 
 Case 4-1 179472 120 15 10×12 72 372 443 
 Case 4-2 179472 121 16 11×11 60 364 423 
 Case 4-3 179472 128 16 8×16 60 330 390 
 Case 5-1 179472 136 17 8×17 56 309 365 
 Case 5-2 179472 144 18 12×12 55 349 405 
 Case 5-3 179472 152 19 8×19 57 302 359 
 Case 6-1 179472 168 21 12×14 49 301 350 
 Case 6-2 179472 169 22 13×13 40 252 293 
 Case 6-3 179472 176 22 11×16 40 311 350 
 Case 7-1 179472 192 24 12×16 41 275 317 
 Case 7-2 179472 196 25 14×14 56 312 368 
 Case 7-3 179472 200 25 10×20 42 231 273 
 Case 8-1 179472 224 28 14×16 41 235 276 
 Case 8-2 179472 225 29 15×15 37 218 255 
 Case 8-3 179472 232 29 8×29 40 224 264 
 Case 9-1 179472 248 31 8×31 40 197 237 
 Case 9-2 179472 256 32 16×16 38 228 266 
 Case 9-3 179472 264 33 12×22 39 214 253 
 
76 
 
 
 
Table 4.32.b Simulation times: forty two cases varying the numbers of cores and process 
grid with larger NUN on Cluster-7 (cont’d.) 
Model Case NUN Cores Nodes P×Q Time (minutes) 
Filling Solving Total 
Model 
2-6 
Case 10-1 179472 288 36 16×18 36 214 249 
Case 10-2 179472 289 37 17×17 33 196 229 
Case 10-3 179472 296 37 8×37 38 175 214 
 Case 11-1 179472 320 40 16×20 22 141 164 
 Case 11-2 179472 324 41 18×18 21 142 162 
 Case 11-3 179472 328 41 8×41 25 144 168 
 Case 12-1 179472 360 45 18×20 19 149 168 
 Case 12-2 179472 361 46 19×19 25 129 154 
 Case 12-3 179472 368 46 16×23 19 144 163 
 Case 13-1 179472 392 49 14×28 19 133 152 
 Case 13-2 179472 400 50 20×20 18 121 139 
 Case 13-3 179472 408 51 17×24 18 114 132 
 Case 14-1 179472 440 55 20×22 17 110 127 
 Case 14-2 179472 441 56 21×21 17 110 127 
 Case 14-3 179472 448 56 16×28 17 106 123 
 
Although the process grid is square, for some cases, the simulation time is better 
using less number of cores. Comparing case 1-3, case 2-1 and case 2-2, case 2-2 should 
be the fastest because it has more number of cores and P×Q is 9 square, but three cases 
have similar time, and case 1-3 is the fastest. Even though case 1-3 has nine less cores 
than case 2-2, case 1-3 is faster by eight minutes. Comparing case 3-1 to 3-3, case 3-2 
using 100 cores with the 10×10 square P×Q is the slowest and same result can be found 
among case 5-1 to 5-3 that case 5-2 using 144 cores with the 12×12 P×Q is the slowest. 
But comparing case 6-1 to 6-3, case 6-2 using 169 cores with the 13×13 square P×Q is 
the fastest. Case 7-2 using 196 cores with the 14×14 P×Q is the slowest among case 7-1 
to 7-3. Case 7-2 even has similar times as case 5-1, case 5-3, case 6-1 and case 6-3, which 
77 
 
 
 
is not expected. Comparing case 8-1 to 8-3, case 8-2 using 225 cores with the 15×15 
square process grid is the fastest. Case 9-1 is the fastest among case 9-1 to 9-3 and also 
faster than case 10-1 using 40 cores more than case 9-1. Comparing case 11-1 to 11-3, 
case 11-2 using 324 cores with the 18×18 P×Q is the fastest but case 11-3 is the slowest. 
Case 12-2 using 361 cores with the 19×19 P×Q is the fastest among case 12-1 and 12-3. 
In conclusion, it is not always true that squarer process grid with more number of 
cores improves the performance. Choosing the proper number of cores has a strong 
impact on the simulations. More tests are needed to find out what is the optimum of 
number of cores for a given project size. 
 
4.7.2. Efficient way to run several similar projects 
Up to this point, simulation cases have been run one at a time, using set number of 
cores. For several cases, the maximum number of cores provided the maximum 
performance. In this section, multiple simulations will be run simultaneously to illustrate 
how to choose an optimum number of cores for efficiency. The first step is to use the 
NUNs of project and to calculate the memory/storage size required for the matrix 
calculation. Double precision is used for most accurate results. The storage needed by a 
project with a given NUN is calculated using the simple equation.  
(Total storage required) = (NUN) ^ 2 × 16. 
Depending on the resources available, the minimum number of cores that project 
needs is then calculated with another simple equation.  
(Minimum number of cores needed) = (Total storage required) / (storage per core) 
78 
 
 
 
Again, depending on the resources available, specifically, storage capacity of the 
cluster and the storage required for the project, the user determines whether the projects 
can be run ICS using memory, or OCS using storage. If the project size exceeds the 
amount of RAM, the only option is to use hard disk storage and an OCS. Using fully 
loaded nodes is recommended, so the number of cores is determined by multiplying the 
maximum number of cores in a node. 
HOBBIES has a function called “Pre-HOBBIES.” As name implies, before 
running a simulation, the project can be checked for errors. If there are no errors, the Pre-
HOBBIES window wil display information on project, including numbers of points, 
number of elements, NUN, number of frequencies, number of excitations, post 
processing, etc [5]. A screen shot of the Pre-HOBBIES window is shown in Figure 4.23. 
 
Figure 4.23 Screen shot of “PRE-HOBBIES” 
79 
 
 
 
If the user needs to run 16 projects with the same model but each project has 
different antennas or excitations, how would the user decide the number of cores to solve 
these projects with maximum performance? As mentioned previously, more cores will 
run faster and most of users will run 16 projects serially suing the maximum number of 
cores available on the system. Is that the fastest way to run multiple projects? In this 
section, a comparison is made between running projects serially, one by one and running 
projects in parallel, simultaneously.  
Two new simulation models are introduced. Model 4-1 has horn antennas serving 
as both probe and AUT (Antenna Under Test). Model 4-2 has one with horn antenna 
serving as probe and the AUT is a helical antenna. These two new models are shown in 
Figures 4.24 and 4.25. This simulation is for non-anechoic measurements. From these 
measurements results, the free space radiation pattern can be reconstructed using impulse 
response simulation [19]. The probe and the AUT are separated by 2m and the azimuth 
angle θ of the AUT is varied from -90° to 90° with a 5° step. PEC (Perfect Electric 
Conductor) walls surround the model for realistic example. Operating frequency is from 
6 GHz to 12 GHz with 121 points of 50 MHz steps. Each simulation model has 36 
projects. The NUN of Model 4-1, with horn antenna as AUT, is 61,030. The NUN of 
Model 4-2, with the helical antenna as AUT model, is 59,942. The simulations are run on 
Cluster-5 which has 512 cores on 64 nodes. 
According to the calculation, 58 GB of storage is needed for 60,000 NUN and at 
least, 4 nodes are required to simulate a single project using ICS. Four nodes have 32 
cores and 64 GB of RAM. Five cases are simulated and multiple simulations are run 
80 
 
 
 
simultaneously using different number of cores. Sixteen projects are simulated for every 
case. 
.  
Figure 4.24 Model 4-1 with two horn antennas with a walled environment 
 
 
Figure 4.25 Model 4-2 with a horn antenna probe and a helical Antenna AUT with a 
walled environment 
 
81 
 
 
 
Model 4-1 and Model 4-2 are run one simulation at a time using 512 cores. 
Simulation times are shown in Table 4.33. 
Table 4.33 Simulation times: one simulation at a time using 512 cores for each model 
Model Project NUN Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
4-1 horn00 61030 512 47 487 539 
Model 
4-2 spiral05 59942 512 54 453 508 
 
For Model 4-1, one simulation takes 539 minutes and if all 16 simulations are run 
one by one using 512 cores, the whole job will finish in 8,624 minutes, approximately 6 
days. For Model 4-2, one simulation takes 508 minutes and if all 16 simulations are run 
in series, the task will take 8,128 minutes, approximately 5.7 days. 
The next case uses Model 4-1 and runs two simulations at the same time, in 
parallel, using 256 cores. Simulation time is shown in Table 4.34. 
Table 4.34 Simulation times: two simulations in parallel, using 256 cores for each model 
Model Project NUN Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
4-1 
horn15 61030 256 55 914 979 
horn20 61030 256 53 915 979 
 
The simulation times to run two models in parallel, using 256 cores, takes 979 
minutes. To complete 16 simulations, the two models would need to be run in parallel 8 
times. Since the simulation times are similar, the total time to run all 16 simulations is 
7,832 minutes, less than 5.5 day. Running projects in parallel takes less time than running 
the same projects serially, and in this case, the saving is 792 minutes, approximately 13.2 
hours, a little over half a day. 
82 
 
 
 
The next case again uses Model 4-1 and runs four simulations at the time, in 
parallel, using 128 cores. The simulation time is shown in Table 4.35. 
Table 4.35 Simulation times: four simulations in parallel, using 128 cores for each model 
Model Project NUN Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 4-1 horn25 61030 128 101 1459 1573 
 horn30 61030 128 101 1461 1575 
 horn35 61030 128 102 1460 1574 
 horn40 61030 128 101 1455 1569 
 
The maximum time for the four simulations is 1,575 minutes. To complete 16 
simulations would require 4 parallel runs, which would take 6,300 minutes, 
approximately 4.4 hours. This is a further improvement over running two simulations 
with a saving of 1,532 minutes, which is a little over one day. 
The next case again uses Model 4-1 and runs eight simulations at a time using 64 
cores for each. The simulation times are shown in Table 4.36. 
Table 4.36 Simulation times: eight simulations in parallel using 64 cores for each model 
Model Project NUN Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 
4-1 
horn45 61030 64 194 2898 3105 
horn50 61030 64 194 2898 3105 
 horn55 61030 64 194 2896 3103 
 horn60 61030 64 194 2896 3103 
 horn65 61030 64 185 2896 3094 
 horn70 61030 64 186 2896 3095 
 horn75 61030 64 194 2895 3102 
 horn80 61030 64 194 2900 3107 
 
The maximum simulation time takes 3,107 minutes, just over 2 days. Only 2 
parallel runs are needed to complete all 16 simulations. Every simulation time is similar, 
83 
 
 
 
so the worst case total simulation time to complete all 16 projects is 6,214 minutes, about 
4.3 days. This is a modest time savings of 86 minutes, almost 1.5 hours, over running 
four simulations with 128 cores at a time. 
The next case runs 16 projects in parallel. Model 4-2, with the helical antenna 
AUT, is used with 32 cores per simulation. The simulation times are shown in Table 4.37. 
Table 4.37 Simulation times: sixteen simulations in parallel using 32 cores for each 
model 
Model Project NUN Number 
of cores 
Time (minutes) 
Filling Solving Total 
Model 4-2 spiral10 59942 32 763 7563 8372 
 spiral15 59942 32 769 7574 8391 
 spiral20 59942 32 758 7577 8384 
 spiral25 59942 32 779 7612 8418 
 spiral30 59942 32 763 7614 8436 
 spiral35 59942 32 767 7582 8391 
 spiral40 59942 32 763 7538 8345 
 spiral45 59942 32 769 7590 8396 
 spiral50 59942 32 755 7572 8390 
 spiral55 59942 32 782 7587 8395 
 spiral60 59942 32 756 7575 8379 
 spiral65 59942 32 753 7576 8383 
 spiral70 59942 32 765 7573 8376 
 spiral75 59942 32 757 7557 8371 
 spiral80 59942 32 760 7619 8472 
 spiral85 59942 32 773 7575 8381 
 
The total simulation time for the 16 projects is 8,472 minutes, just under 6 days. 
Even though the horn antenna model has more NUN, the helical antenna model takes 
longer compared to running two horn antenna models simultaneously using 256 cores. 
The difference is 2,258 minutes, over 1.5 days.  
84 
 
 
 
A summary of the simulation times for running 16 projects using five cases of 
number of cores are shown in Table 4.38. 
Table 4.38 Summary of simulation times: 16 projects with similar NUN 
Model Number 
of cores 
Project(s) 
running 
at once 
Number 
of repeat 
runs 
Time(minutes) 
Once Total 
Model 
4-1 
 
512 1 16 539 8,624 
256 2 8 979 7,832 
 128 4 4 1,575 6,300 
 64 8 2 3,107 6,214 
Model 
4-2 32 16 1 8,472 8,472 
 
The user can choose the number of cores to distribute the parallel simulations of 
similar size on the cluster for the maximum performance. The case with running eight 
projects at a time using 64 cores shows the best performance and the case with running 
one project at a time using 512 cores shows the worst performance. The time difference 
between the best and the worst case is 2,410 minutes, approximately 1.7 days. 
The user can choose the number of cores to give the best performance according 
to what projects need to be run. It is true that using maximum number of cores will have 
the best performance for a single project. However, for cases where there is no benefit 
getting some data before other runs complete, running multiple projects in parallel can 
provide the most efficient use of the available resources. The user just needs to choose an 
efficient numbers of cores to have improved performance. 
 
 
85 
 
 
 
4.8. IASIZE: more than 4 GB per core 
IASIZE is the size of the in-core buffer on each core when HOBBIES runs OCS. 
IASIZE is an important factor for achieving best performance. IASIZE test is done in the 
parallel book [1] up to 4 GB per core, since the price of RAM at that time was much 
higher and adding hard disk capacity was a better way for parallel CEM simulation tools 
to improve performance. Hard disk is still more efficient than RAM in storage price per 
size, but RAM is now much more affordable compared to when this research was 
initiated in 2005. The user can select IASIZE in the environment setup in HOBBIES [5]. 
 
4.8.1. Comparison: six different IASIZE using 20 cores on Cluster-9 
IASIZE test is done using Cluster-9 which has 256 GB RAM. Maximum IASIZE 
of each core is 12.8 GB RAM. A new simulation model is introduced. Model 5 is an 
Airbus A380. The size of A380 is 62.9 m in length, 71.4 m in width and 21.8 m in height 
[20]. The operating frequency is set to 137 MHz and NUN is 200,242. This model is 
simulated with 2, 4, 6, 8, 10 and 12 GB IASIZE using 20 cores. The A380 model is 
shown in Figure 4.26 and the RCS result is shown in Figure 4.27. The simulation time is 
listed in Table 4.39 and plotted in Figure. 4.28. 
86 
 
 
 
 
Figure 4.26 Airbus A380, Model 5 
 
 
Figure 4.27 Simulation results, RCS (horizontal plane) of Model 5-1 simulated with six 
different IASIZE 
87 
 
 
 
 
Table 4.39 Simulation times using 20 cores: six different IASIZE on Cluster-9 
Model  IASIZE 
Time (minutes) 
Filling Solving Total 
Model 5-1 2GB 1163 620 1815 
 4GB 653 548 1232 
 6GB 439 519 988 
 8GB 352 515 898 
 10GB 309 517 857 
 12GB 267 517 816 
 
 
Figure 4.28 Filling, solving and total simulation times using six different IASIZE using 
20 cores on Cluster-9 
 
First of all, it is important to recognize that the rest of the memory, that is not 
used for IASIZE, is used as cache for the cases with smaller IASIZE. This additional 
memory cache is advantageous to the performance. In Table 4.39, as IASIZE increases, 
filling time is reduced the most and solving time is reduced less. This means that filling 
time is more dependent on the proper selection of IASIZE than solving time. For the 
2 4 6 8 10 12
200
400
600
800
1000
1200
1400
1600
1800
IAsize (GB)
Si
m
u
la
tio
n
 
Ti
m
e
 
(m
in
u
te
)
 
 
Filling time
Solving time
Total time
88 
 
 
 
performance only, using 12 GB per core as IASIZE is the best but for the price and 
performance, using 6 or 8 GB per core as IASIZE is more efficient. Screen shots of 
Ganglia [8] during the IASIZE testing are shown in Figure 4.29 and Figure 4.30. 
 
Figure 4.29 CPU usages during the IASIZE simulation. 
 
 
Figure 4.30 Memory usages during the IASIZE simulation. 
 
89 
 
 
 
The graph shows CPU usage and memory usage during the running of simulations. 
The first block is the simulation with 12 GB IASIZE, the second one is 10 GB IASIZE, 
the third one is with 8 GB IASIZE, the fourth one is with 6 GB IASIZE, the fifth one is 
with 4 GB IASIZE and the sixth one is with 2 GB IASIZE. One can see that simulation 
time increases as the IASIZE decreases. On memory usage, blue part is memory used by 
HOBBIES for the simulation and the green part is the cache. Also one can see the 
simulation time is increasing as the IASIZE (blue part) gets smaller. 
 
4.8.2. Comparison: six different IASIZE using 10 cores on Cluster-9 
Using fewer cores is one way to increase the IASIZE. With using 10 cores, 
IASIZE can increase up to 24 GB. The same A380, model 5-1 is used as in previous 
section. Simulations are performed with 4, 8, 12, 16, 20 and 24 GB IASIZE using 10 
cores. The simulation times are listed in Table 4.40 and a graph of the simulation times is 
shown in Figure 4.31. 
Table 4.40 Simulation times using 10 cores: six different IASIZE on Cluster-9 
Model IASIZE (GB) 
Time (Minutes) 
Filling Solving Total 
Model 5-1 4 1291 2054 3382 
 8 703 1461 2203 
 12 525 1283 1845 
 16 435 1298 1770 
 20 390 1248 1676 
 24 346 1119 1503 
 
 
 
90 
 
 
 
 
Figure 4.31 Filling, solving and total simulation time using six different IASIZE 
using 10 cores on Cluster-9 
 
One can see that simulation time is improved as IASIZE increases. From 12 GB 
to 24 GB IASIZE, there are not much differences in the time. As a result, 12 GB IASIZE 
is best choice for both price and performance. 
  
  
5 10 15 20 25
500
1000
1500
2000
2500
3000
3500
IAsize (GB)
Si
m
u
la
tio
n
 
Ti
m
e
 
(m
in
u
te
)
 
 
Filling time
Solving time
Total time
91 
 
 
 
5. CONCLUSION 
 
The first question in this dissertation was “What is the impact of multicore and 
advanced technologies on computational science software?” and the next was “Are these 
helpful for researchers and engineers?” The answer to the first question is described in 
detail in this dissertation. The answer to the second question is yes, if the software is 
parallelized, the system hardware is appropriately selected, and the hardware and 
software parameters are balanced and optimized. The parallelization of the software is 
now inevitable and the finding detailed in this dissertation will help the software 
developer and the user regarding how the parallelized software works best with typical 
hardware platforms. HOBBIES is a software package that is capable of satisfying all the 
necessary conditions on nine typical hardware platforms. 
In this dissertation, one can see that even though clock speed of multicore CPUs 
is less than that of single-core CPUs, multicore CPUs can be beneficial for research 
purposes. Multicore CPUs for parallel CEM software can provide faster and better 
performance, when used properly, and can solve very large problems in reasonable time. 
For example, E2C Hawkeye (real size aircraft with Radome) was simulated and it took a 
month on single-core CPU. Now, with HOBBIES parallel code, the same project can be 
solved within a day. Multicore CPU clock speed is approaching the speeds previously 
only available on single-core CPUs. The multicore CPUs also have new technologies, 
such as QPI with additional system buses and memory controller for direct access to 
designated memories in CPU. These advances are providing larger bandwidth and faster 
system bus speed that help each core to run the simulation with improved performance. 
RAM is also improving with more bandwidth and faster bus speed. For the hard disk, it 
92 
 
 
 
still needs more investigations with more number of core and more NUN, SAS type hard 
disk with 15,000 RPM and binding them with RAID controller to show better 
performance for parallel CEM software but it is not recommended to use network 
attached storage (NAS) since there is bottle neck that each core has to wait for their turn 
to access the storage. Network also shows the performance depending on number of core 
and NUN used in the simulation. If the system has less than 16 nodes, Gigabit Ethernet is 
sufficient since it is not only simple to install and manage but also its performance is 
good enough comparing with InfiniBand. If the system has more than 32 nodes, 10 
Gigabit Ethernet or InfiniBand is recommended. Between 16 and 32 nodes, one can 
choose the network according to the simulation size and the number of cores. If one often 
runs small projects, InfiniBand does not offer better performance thn Gigabit Ethernet. 
There are two parameters that the user can use to control the performance of parallel 
CEM software. One is the number of cores and the other is IASIZE. The process grid and 
the efficient way to simulate are depending on the number of core that the user selects. 
The user can determine the minimum number of cores needed based on the calculated 
NUN, which is the size of project. Based on this number, the user can choose right 
number of cores for their purpose. With RAM becoming less expensive, the user adjust 
IASIZE to use more RAM, which improves the performance. The user can choose the 
size of RAM according to their budget and the expected maximum project size. 
Researchers use various computer platforms, with different CPU, RAM, hard 
disks, and network environments. According to those parameters, the user needs to know 
how to choose the right hardware for parallel CEM simulation purpose and to know the 
optimum way to run different sizes of simulation models. As mentioned previously, small 
93 
 
 
 
projects will not be affected much by optimizing parameters. However, the impact of the 
parameters is significant for larger projects, such as a radiation of antennas on real size 
vehicles, the scattering of real size aircraft, an antenna array in radome, an antenna 
pattern including real environment and similar real world projects. 
The main contribution of this dissertation is that new hardware and technologies 
are benchmarked with HOBBIES parallel CEM software, to see the performance from 
the perspective of CEM researchers. Previous researchers have not investigated the 
consideration that new hardware and technologies may degrade the performance of CEM 
software. New hardware and technological advancements were assumed to result in 
performance improvements. Usually testing and comparisons of new hardware and 
technologies are done using benchmark programs which are programmed by software 
engineers, not by CEM developers. But in this dissertation, parameters that affect parallel 
CEM software are investigated using HOBBIES on nine computer platforms. 
Performance effects are more significant with large projects (large NUN) and large 
numbers of cores. With the information detailed in this dissertation, scholars and 
engineers can identify how to use their CEM software on specific platforms with the 
highest achievable efficiency, regardless if the platform is their laptops, a workstation or 
a computing cluster. 
  
94 
 
 
 
BIBLIOGRAPHY 
 
[1] Yu Zhang, Tapan K. Sarkar, Parallel Solution Of Integral Equation-based EM 
Problems In The Frequency Domain, IEEE-Wiley Press, Hoboken, NJ, 2009 
[2] Pam Frost Gorder, Multicore Processors for Science and Engineering, IEEE 
Computing in Science & Engineering, Volume: 9 , Issue: 2, Page(s): 3-7, March-
April 2007 
[3] Jack Dongarra, Dennis Gannon, Geoffrey Fox, and Ken Kennedy, The Impact of 
Multicore on Computational Science Software, CTWatch QUARTERLY, Volume 3 
Number 1, Page(s): 3-10, February 2007 
[4] Microprocessor Quick Reference Guide, [Online] 
Avaiable at: http://www.intel.com/pressroom/kits/quickrefyr.htm#IntelTop 
[5] Y. Zhang, T. K. Sarkar, X. Zhao, D. Garcia-Donoro, W. Zhao, M. Salazar-Palma, and 
S. Ting, Higher Order Basis Based Integral Equation Solver (HOBBIES), Hoboken, 
NJ, Wiley, 2012 
[6] T. K. Sarkar, B. Kolundzija, M. Salazar-Palma, Use of Higher Order Basis in 
Solution of Electromagnetic Field Problem, Ultra-Wideband, Short-Pulse 
Electromagnetics 7, Chapter 18, pp. 150-158, Springer New York 
[7] R. F. Harrington, Field Computation by Moment Methods, Macmillan, New York, 
1968 
[8] Ganglia Monitoring System, [Online] 
Available at: http://ganglia.info/ 
[9] ROCKS, [Online] 
Available at: http://www.rocksclusters.org/ 
95 
 
 
 
[10] Intel, Hyper Threading Technology, [Online] 
Available at: http://www.intel.com/content/www/us/en/architecture-and-
technology/hyper-threading/hyper-threading-technology.html 
[11] Intel processor comparison, [Online]  
Available at: http://ark.intel.com/ 
[12] Intel, Turbo Boost Technology, [Online] 
Available at: http://www.intel.com/content/www/us/en/architecture-and-
technology/turbo-boost/turbo-boost-technology.html 
[13] Intel, An Introduction to the Intel QuickPath Interconnect, Document Number: 
320412-001US, Jan 2009 
Available at: http://www.intel.com/content/www/us/en/io/quickpath-
technology/quickpath-technology-general.html 
[14] HP, DDR3 Configuration Recommendations for HP Proliant G6 Servers, Best 
Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series 
Engineering Whitepaper, 1st Edition, May 2009 
[15] Northrop Grumman Global Hawk, [Online] 
Available at: http://www.northropgrumman.com/capabilities/globalhawk 
[16] IEEE Standard for Ethernet, IEEE Computer Society, 28 December 2012 
Available at: https://standards.ieee.org/about/get/802/802.3.html 
[17] Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing, 
HPC Advisory Council Whitepaper, 2009 
Available at: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf 
 
 
96 
 
 
 
[18] DELL Enterprise Technology Center , HPC Cluster networks, [Online] 
Available at: http://en.community.dell.com/techcenter/high-performance-
computing/w/wiki/hpc-cluster-networks.aspx 
[19] Jinhwan Koh, De, A., Sarkar, T.K., Hongsik Moon, Weixin Zhao, Salazar-Palma, 
M., Free Space Radiation Pattern Reconstruction from Non-Anechoic Measurements 
Using an Impulse Response of the Environment, Antennas and Propagation, IEEE 
Transactions, Vol 60, Issue 2, Part 2, 821 - 831, 2012  
[20] A380, Airbus [Online] 
Available at: http://www.airbus.com/aircraftfamilies/passengeraircraft/a380family/ 
  
97 
 
 
 
VITA 
 
Name of Author: Hongsik Moon 
 
Education: 
M.S., Syracuse University, Syracuse NY, USA                                                    May 2008 
B.E., Kyoungpook National University, Daegu, Republic of Korea                      Feb 2003 
Bloomfield College, Bloomfield, NJ Mar 2002 to Aug 2002  Mar 2002 – Aug 2002 
o IT Education Program in Advanced Technology Institute supported by Ministry 
of Information and Communication in Korea 
 
Awards and Honors: 
Research Assistantship in EECS, Syracuse University, Syracuse NY 
             Fall 2006 - Aug 2013 
 
Member of the Phi Beta Delta International Honor Society            Mar 2008 
 
Experience: 
Technician in OHRN ENTERPRISE INC.          Oct 2014 - Present 
 
IEEE APS 2012 in Chicago US, OHRN Booth, HOBBIES demonstration          July 2012 
 
ROCKS and HOBBIES installation and demonstration in University of Macau, Macau, 
China                      Feb 2012 
 
IEEE APS 2010 in Toronto Canada, OHRN Booth, HOBBIES demonstration    July 2010 
 
TIDES and HOBBIES installation and demonstration, maintenance and trouble shooting 
of HP ProLiant blade server in NAVAIR, Maryland, USA     
          Nov 2008, April 2010, Sep 2010, Dec 2010 
  
98 
 
 
 
Publications: 
Journal  
[1] T. Tantisopharak, H. Moon, P. Youryon, K. Bunya-athichart, M. Krairiksh and T. K. 
Sarkar, "Nondestructive Determination of the Maturity of the Durian Fruit in the 
Frequency Domain Using the Change in the Natural Frequency," in IEEE 
Transactions on Antennas and Propagation, vol. 64, no. 5, pp. 1779-1787, May 2016. 
[2] W. Lee, T. K. Sarkar, H. Moon, and M. Salazar-Palma, “Identification of multiple    
objects using their natural resonant frequencies,” IEEE Antennas Wireless Propag. 
Lett., vol. 12, pp. 54-57, Mar. 2013. 
[3] W. Lee, T. K. Sarkar, H. Moon, A. G. Lamperez, M. Salazar-Palma, “Effect of 
Material Parameters on the Resonant Frequencies of a Dielectric Object,” IEEE 
Antennas Wireless Propag. Lett., vol. 12, pp. 1311-1314, 2013.  
[4] W. Lee, T. K. Sarkar, H. Moon, and M. Salazar-Palma, “Computation of the natural 
poles of an object in the frequency domain using the Cauchy method,” IEEE 
Antennas Wireless Propag. Lett., vol. 11, pp. 1137-1140, Oct. 2012.  
[5] W. Lee, T. K. Sarkar, J. Koh, H. Moon, and M. Salazar-Palma, “Generation of a 
wide-band response using early-time and middle-frequency data through the use of 
orthogonal functions,” Prog. Electromagn. Res. M.(PIERM), Vol. 25, pp. 115-126, 
July 2012.  
[6] W. Lee, T. K. Sarkar, J. Koh, H. Moon, and M. Salazar-Palma, “Generation of a 
wide-band response using early-time and middle-frequency data through the Laguerre 
functions,” Prog. Electromagn. Res. Lett.(PIERL), Vol. 30, pp. 115-123, Mar. 2012. 
99 
 
 
 
[7] J. Koh, A. De, T. K. Sarkar, H. Moon, W. Zhao, M. Salazar-Palma, “Free Space 
Radiation Pattern Reconstruction from Non-Anechoic Measurements Using an 
Impulse Response of the Environment,” IEEE Antennas and Propagation, IEEE 
Transactions on Volume: 60, Issue: 2, Part: 2, Page(s): 821-831, Feb. 2012 
[8] Yu Zhang, Taylor, M., Sarkar, T., De, A., Mengtao Yuan, Hongsik Moon, Changhong 
Liang, “Parallel in-core and out-of-core solution of electrically large problems using 
the RWG basis functions,” IEEE Antennas and Propagation Magazine, Volume: 50, 
Issue: 5, pp. 84-94, Oct. 2008  
[9] Yu Zhang, Taylor, M., Sarkar, T., Moon, H., Yuan, M., “Solving large complex 
problems using a higher-order basis: parallel in-core and out-of-core integral-equation 
solvers,” IEEE Antennas and Propagation Magazine, Volume: 50, Issue: 4, pp. 13-30, 
Aug. 2008 
 
Conference  
[1] D. G. Donoro, M. Salazar-Palma, T. K. Sarkar, Y. Zhang, H. Moon, S. W. Ting, 
"Use of optimization in designing complex electromagnetic radiatingstructures," 
in Wireless Symposium (IWS), 2014 IEEE International, pp.1-2, 2014 
[2] Sarkar, T.K., Zhang, Y., Garcia Donoro, D., Moon, H., Salazar-Palma, M., Ting, 
S.W., "Solving large complex problems using a higher order basis: Parallel out-of-
core integral equation solvers involving a million unknowns," in Wireless 
Symposium (IWS), 2014 IEEE International, pp.1-4, 24-26 March 2014 
 
100 
 
 
 
[3] Zhang, Y., Moon, H., Garcia Donoro, D., Sarkar, T.K., Salazar-Palma, M., Ting, 
S.W., "The art of parallelization in solving large electromagnetic field problems out-
of-core," in Wireless Symposium (IWS), 2014 IEEE International , pp.1-3, 24-26 
March 2014 
[4] Salazar-Palma, M., Garcia Donoro, D., Sarkar, T.K., Zhang, Y., Moon, H., Ting, 
S.W., "Advantage of using a higher order basis for the solution of large 
electromagnetic field problems," in Wireless Symposium (IWS), 2014 IEEE 
International , pp.1-4, 24-26 March 2014 
[5] W. Lee, T. K. Sarkar, H. Moon, M. Salazar-Palma, "Identification of multiple objects 
using their natural resonant frequencies from both frequency and time domain data," 
in Radar Conference 2013, IET International, pp.1-7, 2013 
[6] W. Lee, T. K. Sarkar, H. Moon, M. Salazar-Palma, "Identification of an object located 
on the ground using its natural poles using both FD and TD data," in Antennas and 
Propagation Society International Symposium (APSURSI), IEEE, pp. 23-24, 2013 
[7] T. Tantisopharak, M. Krairiksh, H. Moon, W. Lee, and T. K. Sarkar, “Identification 
of maturity of fruit in the frequency domain using its natural frequencies,” in Proc. 
IEEE APCAP, Aug. 2012.  
[8] W. Zhao, M. HongSik and T. K. Sarkar, “Retrieval of Free Space Radiation Pattern 
through Non-Anechoic Data,” in Antennas and Propagation Society International 
Symposium (APSURSI), IEEE, pp. 1-2, 2012 
 
 
101 
 
 
 
[9] W. Lee, T. K. Sarkar, H. Moon, and L. Brown, “Detection and identification using 
natural frequency of the perfect electrically conducting (PEC) sphere in the frequency 
domain and time domain,” in Proc. IEEE AP-S/URSI Int. Symp., pp. 2334-2337, July 
2011. 
[10] Yu Zhang, Sarkar, T.K., Taylor, M., Moon, H., “Solving MoM problems with 
million level unknowns using a parallel OCS on a high performance cluster,” IEEE 
Antennas and Propagation Society International Symposium, APSURSI, pp. 1-4, June 
2009 
[11] Yu Zhang, Sarkar, T.K., Moon, H., Taylor, M., Donoro, D.G., Salazar-Palma, M., 
“Parallel MoM simulation of complex EM problems,” IEEE Antennas and 
Propagation Society International Symposium, APSURSI, pp.1-4, June 2009  
[12] Ghosh, D., Hongsik Moon, Sarkar, T.K., “Design of through-the-earth mine 
communication system using helical antennas,” IEEE Antennas and Propagation 
Society International Symposium, pp. 1-4, July 2008  
[13] Zhang, Y., Sarkar, T.K., Moon, H., De, A., Taylor, M.C., “Solution of large complex 
problems in computational electromagnetics using higher order basis in MOM with 
parallel solvers,” IEEE Antennas and Propagation Society International Symposium, 
pp. 5620-5623, June 2007 
