Static and Quasi-Dynamic Load Balancing in Parallel FDTD Codes for Signal Integrity, Power Integrity, and Packaging Applications by Seguin, Sarah A. et al.
Missouri University of Science and Technology 
Scholars' Mine 
Electrical and Computer Engineering Faculty 
Research & Creative Works Electrical and Computer Engineering 
01 Aug 2004 
Static and Quasi-Dynamic Load Balancing in Parallel FDTD Codes 
for Signal Integrity, Power Integrity, and Packaging Applications 
Sarah A. Seguin 
Michael A. Cracraft 
James L. Drewniak 
Missouri University of Science and Technology, drewniak@mst.edu 
Follow this and additional works at: https://scholarsmine.mst.edu/ele_comeng_facwork 
 Part of the Electrical and Computer Engineering Commons 
Recommended Citation 
S. A. Seguin et al., "Static and Quasi-Dynamic Load Balancing in Parallel FDTD Codes for Signal Integrity, 
Power Integrity, and Packaging Applications," Proceedings of the IEEE International Symposium on 
Electromagnetic Compatibility (2004, Santa Clara, CA), vol. 1, pp. 107-112, Institute of Electrical and 
Electronics Engineers (IEEE), Aug 2004. 
The definitive version is available at https://doi.org/10.1109/ISEMC.2004.1350006 
This Article - Conference proceedings is brought to you for free and open access by Scholars' Mine. It has been 
accepted for inclusion in Electrical and Computer Engineering Faculty Research & Creative Works by an authorized 
administrator of Scholars' Mine. This work is protected by U. S. Copyright Law. Unauthorized use including 
reproduction for redistribution requires the permission of the copyright holder. For more information, please 
contact scholarsmine@mst.edu. 
Static and Quasi-Dynamic Load Balancing in 
Parallel FDTD Codes for Signal Integrity, Power 
Integrity, and Packaging Applications 
Sarah A. Seguin, Michael A. Cracraft, James L. Drewniak 
Department of Electrical and 
Computer Engineering 
University of Missouri-Rolla 
Rolla, MO 65409 
Abstract- The Finite-Difference Time-Domain (FDTD) 
method is a robust technique for calculating electmmagnetic 
fields, but praclical problem, involving complex or large 
geometries, ean require a long time to calculate on any one 
single-processor Computer. One computer mtb  many pmcessors 
or many single-processor computers can reduce the computation 
time. However, some FDTD cell types, e.g., PML cells, require 
more computation time than others. Thus, the size and 
shape of the individual process allocations can significantly 
inhence the computation time. This paper addresses these 
I d  b a h c i n g  issues with slpfic and quasi-dynamic approaches. 
The Message-Passing Interface (MPI) library is applied to a 
three-dimensional (3D) FDTD code. Timing results including 
speedup and efiiency, are presented for trials run on a cluster 
of sixteen processing, nodes and one sewer node. 
Two examples are shown in this paper, a power bus with 
16 decoupling capacitors and a five layer power distribution 
network. In such models, the problem size and complexity make 
modeling with a serial code impractical and time consuming for 
engineering. Models with several million cells take days to run, 
but proper implementation, including load balancing, can reduce 
this execution time to hours on a sufficiently powerful duster. 
I. INTRODUCTION 
Execution time is decreased with parallel computer codes 
by distributing the problem over many processes. Ideally, the 
calculation time for a parallel code is proportional to +, where 
N is the number of processes. However, communications 
and poor problem distributions increase calculation time. This 
increase constitutes a loss of efficiency. For parallel codes the 
efficiency [ I ]  is defined as 
TI is the time required to calculate the problem, using a serial 
code or the parallel code mn on a single process. TN is the 
time required to calculate the problem using N processes. A 
second measure for parallel codes is speedup [I], defined as 
1 1  S ( N )  = - 
TN 
Of the two problems decreasing efficiency mentioned above, 
communications are an essential part of the message passing 
scheme and can be streamlined but not removed. However, 
the problem distribution can be improved with proper load 
balancing, which can increase the efficiency of the code signif- 
icantly. Assume that each process depends on communications 
from the other processes during the calculation. Then, the 
overall calculation time is dominated by the slowest process. 
If the problem is improperly distributed, some processes will 
be more stressed than others. These stressed processors hold 
up the entire calculation. One approach is found in seeking 
a better load balancing scheme. This paper will describe 
methods for improving the load balancing for FDTD codes. 
The parallel 3D code used in this paper, originally developed 
in 121, to simulate the models uses the Message-Passing 
Interface (MPI) for the distribution of  processes 131. MPI 
is a library of functions written for Fortran, C, and C++, 
and is not a programming language itself. MPI was designed 
with portability as the primary consideration. A program 
written using MPI will run on any system with a working 
MPI implementation. It provides a medium for programs to 
exchange data between processes, i.e., functions to arrange 
processes and functions to send and receive data. Through 
MPI functions, a process can find out who it is, also known 
as its rank, and the total number of processes that are present. 
It is possible to write efficient parallel programs with as few 
as six MPI functions. 
Load balancing concerns how the problem is divided and 
distributed and can be split into two general categories: static 
load balancing and dynamic load balancing 141. In static load 
balancing, prior to the calculation, some criteria are used to 
determine how much of the problem each process should be 
given. In dynamic load balancing, the problem is broken into 
work units that are sent to each process on request during the 
calculation. 
FDTD is naturally a data-parallel algorithm, and it lends 
itself well to a static load balancing scheme. In a data-parallel 
algorithm, the model space is divided among the processes. In 
this paper, a weighting scheme is used to determine how many 
cells each process is given. As an example, calculation times 
of perfectly matched layer (PML) cells are compared with the 
calculation times of free space cells. The ratio of those times 
is defined as the cell weight for a PML cell. PML cells are 
&7803-8443-1/04/$20.00 8 IEEE. 107 
computationally intensive compared as with free space cells, 
and processes spatially located at the edge of the model space 
will have more PML cells than the interior processes. Thus, 
the edge processes will take longer to calculate than the other 
processes, slowing the entire calculation down. 
Load balancing attempts to alleviate imbalances, which are 
the major impediment to the efficiency of the parallel algo- 
rithm. Static load balancing divides the model at the beginning 
of the program to minimize differences in the process loads. 
Then, processes work on their portion of the model space for 
the duration of the program. Any imbalances in the load bal- 
ancing will cause delays in the program execution. Dynamic 
load balancing attemps to break up a problem into work units 
and distributes these work units as processes request them. 
This type of load balancing requires more communication 
between the Drocesses and, thus, delavs mav be incurred if 
the server process is mired in multiple requests at once or 
by the communication speed of the network. Though dynamic Fig. I .  Top view of lhe power bus with 16 decoupling capaciton. 
load balancing remains the most efficienr. 
Although FDTD is not suited to a dynamic load balancing 
approach, a quasi-dynamic load balancing approach is used 
for the models in this paper that may fix the same problems 
that a dvnamic aonroach fixes. onlv more conservativelv. .. 
The auasi-dvnamic load balancing builds on the static load P O l t  1 
I 
balancing method with PML weighting, but the code allows 
the process boundaries to shift to minimize the time per 
time step for all the processes. Basing the process boundaries 
on performance during the calculation, can not only fix the 
distribution problems but also can adjust for differences in 
processes, i.e., processor clock speed or model. 
11. TRIALS AND RESULTS 
Two examples were simulated using a parallel F?)TD algo- 
rithm and compared with a serial algorithm. The results for 
both of the examples were compared, and provide an example 
of the advantages of the PML weighting..A power bus with 16 
decoupling capacitors, shown in Figs. 1 and 2 was simulated 
above a pound  plane with a dielectric in a parallel 3D FDTD 
code. 
In addition, a five layer power bus, described in Fig. 3, 
shown in Figs. 4 through 8, was also simulated. 
Fig. 1 was discussed in [5] ,  it shows the board configuration. 
The lossy board dielectric in the test device is FR-4, 65 mil 
thick. The upper and lower planes are both copper, and an 
array of sixteen global decoupling capacitors connects to the 
power plane by wires passing through square via holes. 
Cell weighting in this paper refers to load balancing by 
weighting Perfectly-Matched Layer (PML) cells, which take 
longer to execute than the cells in the interior of the problem 
space. This may be ascertained from the complexity of the 
update equarions for rhe PMLs 161. By weighting the PML 
cells the problem may be more efficiently divided and dis- 
tributed to the processing nodes. Load balancing by weighting 
is employed by assigning a numerical weight to each cell 
according to its cell type: interior or PML. The weight for 
an interior electric field and magnetic field calculation is set 






Layer thickness = 14 mils 
Dielectric constant =4.24, loss tangent = 0.022 
dx=dy=0.5mrn,dz=O.l1853mm 
Fig. 3. T k  5 layer power bus derription. 
> ?  
t t&k 0 mtipad, m e  6cd!sx 6ctlllr '----.------.-- 
8 Through holevias 
Fig. 4. Firs layer of the 5 layer power bus. 
o (133.23) o (243.23) 
5 r  
Fig. 5. Second layer of the 5 layer power bus. 
d 




Fig. 6. Third layer of the 5 layer power h a .  
to 1 .  The weights for the electric field PML and the magnetic 
field PML calculations are then set higher, e.g., 3.5 and 5.3, 
respecitvely. The break weight is the ideal weight for each 
process. Ideally, the processes would be equally loaded and 
there would be no loss in efficiency due to load imbalances. 
However. it is unlikely that such a breakup is possible. 
The objective of this section is to determine how well the 
PML weighting scheme increases the efficiency of the parallel 
FDTD program. Two sets of data were computed for both 
examples. One case with, and one case without the PML 
weighting were run on two meshes with an increasing number 
of processors. The two meshes are labeled low density and 
high density, and are different only in their cell sizes. The cell 
dimensions of the high density mesh are half that of the low 
density mesh. Thus, there are eight times the cells in the high 
density mesh. 
The following sections discuss the data collected for each 
example. Each simulated example was compared and checked 
against the serial version as well as the measured results. All 
examples show the average time step, speedup, and efficiency 
5 
103 263 
Fig. 7. Foml layer of the 5 layer power bus. 
* (23.23) 0 (133,231 0 (2 3.23) 
Fig. 8. Fifth layer of the 5 layer power bus 
that were calculated from the results. The number of processes 
indicated on each graph is the number of worker processes. 
The total number is the indicated processes plus one, called 
the server process. In each example, the execution times, the 
speedup and the efficiency behave approximately as expected 
when compared with the tehoretical models. Although, the 
efficiency graph tends to be rather erratic for both examples. 
A. Power Bus with 16 Decoupling Capacirors 
The measured results and the simulated FDTD results are 
shown in Figs. 9 and IO. The results for the low and high 
resolution models are found in Figs. I 1  through 13. 
Fig. 12 shows the average time steps for the four sets of 
data collected and their ideal ( T I I N )  for the power bus with 
16 decoupling capacitors. In panicular for the high density 
model, the unweighted set is asymptotically approaching a 
time step larger than the weighted version. The weighted set 
is still decreasing with the ideal even at 16 processor limit of 
the cluster. 
Fig. 13 shows the speedup, which was expressed as the ratio 
of the single processor execution time to that of N processors. 
In this figure, the asymptotic limit mentioned for the time steps 
is even more apparent. With 16 processors the unweighted 
sets of the two models are already beginning to plateau while 
the weighted sets are falling away from the linear ideal more 
gradually. 
The efficiency is shown in Fig. 16. As with the speedup, 
the efficiency is a numerical gage showing how parallelizable 
the algorithm is. In the plot the weighted data set for both 
models has the higher efficicency. The efficiency is erratic 
in appearance, which can be attributed to the setup of the 
individual cluster, including the machines and networking 
equipment. 
The low density model of the power bus with 16 decoupling 
capacitors uses 100 by 50 by 20 in the internal region. The 
PML boundary is 8 cells thick, so there are 100,000 cells 
in  the internal region while there are 175,616 cells in the 
PML boundaries. The high density model doubles the number 
cells in each dimension, so there are 200 by 100 by 40 cells 
in the internal region. Using a 8 cell thick PML boundary, 
there are 800,000 cells in  the internal region and 1,403,136 
cells in the PML boundary. Estimate the overall calculation 
time of a PML cell as t P M L  = 8.8tFs, where tFS  is the 
calculation time for a free-space cell and 8.8 is the addition 
of the electric and magnetic field weights proposed earlier. 
Then, the calculation time per time step for each model can 
0-7803-8443-1/04/$20.00 Q IEEE. 109 
be estimated. The calculation time for the low density model 
is 
tL.D =(100000)tFS f ( 1 7 5 6 1 6 ) t p ~ ~  
=(~ooooo + 1545420.8)t~s 
=1645420.8t~s 
and the high density model calculation time is 
t H D  =(800000)t~s + (603136)tpnr~ 
=(SO0000 + 5307596.8)t~s 
=6107596.8t~s 
The ratio of the two time estimates yields e = 3.71. In 
Fig. 1 I ,  the times for one processor yields a ratio of around 
2.75, which is less than the estimated ratio but still within a 
factor of 2 of the estimate. 
As more processors are progressively used in the calcula- 
tions the average time step becomes dominated by commu- 
nication times between processes and other unparallelizable 
portions of the program. Therefore, there is a limit to the 
reduction of calculation time for a particular algorithm. For 
example, in Fig. 14, the low density models average time 
step is nearly constant for 12 through 16 processors without 
using weights. However, the trace for the weighted vials on 
the same model is still decreasing, which is evidence showing 
that weighting the PML cells provides for a better balanced 
problem distribution. 
I -gt I 
-10' ' ' ' I 




FDTD modeled and measured lSn/ for the power bus with 16 
B. Multi-hjer Power Distribution Network 
The results for the multi-layer power bus are similar to 
that of the power bus with 16 decoupling capacitors, although 
they are not identical. The timing results are somewhat model 
dependent, or the results of the two models should be more 
closely identical. The results for the low and high resolution 
models are found in Figs. 16 through 18. Fig. 16 shows the 
0 
- 5  
- 1 0 1  
I .  I .  




FDTD Modeled and Measured IS21 1 for the power bus with I6 
Aver- Time Step 
I ,  
0.9 - 
0.8 











Fig. 11.  
low resolution models for the power bus with 16 deccupling capaciton. 
The average time step for the two sets of dam from the high and 
Fig. I?.  
power bus with 16 decoupling capacitors. 
The speedup far the two sets of dam from the two models for the 
0-7803-8443-1/04/$20.00 0 IEEE. 110 
2 4 6 8 10 12 I4 16 
P- 
Fig. 13. The efficiency for the two sets of data from the two models for 
power bus with 16 decoupling capacitors. 
the 
an enlargement of Fig. 16, found in Fig. 19, where only the 
last 12 to 16 processes are shown. 
Fig. IS. 
distribution network. 
FDTD and FUI modeled IZ,, for the multi-layer power Aver- Time Step 




........... _______  ---------..- ... ..__.________________ 
0.6 
0.1 
0 3  . . 
U 2.23 U 13.3 14 143 I5 US 16 
P-= 0.2 
0.1 
Fig. 14. An enlargement of the average time step for the two sets of dab 
from the high and low resolution models for the power bus with 16 decoupling ‘0 2 4 6 8 IO 12 I4 
capacitors. process% 
Fig. 16. 
low resolution models for the multi-layer power distribution network. 
The average time step for the two sets of dab from the high and 
average time steps for the four of dara co~~ecte,j and their 
ideals ( T I I N )  for the multi-layer power bus. The difference 
in the weighted and unweighted cases in the speedup and 
efficiency c w e s  for the four cases, found in Fig. 17 and 
Fig. 18, respectively, are similarly apparent for the multi-layer 
power bus as in that of the 16 layer power bus. 
Fig. 18 shows the efficiency for the power bus. Again with 
this model, the weighted efficiencies are higher than that of 
the unweighted cases. However, there is little difference in the 
low and higher resolution models. For a larger model, there 
will be more cells to work on for each communication. It may 
be that the high resolution model does not have a significantly 
larger amount of calculations to make the differences apparent. 
Again, it is seen in Figure 18, that it is more profitable to use 
the parallel algorithm with weighted versions. The difference 
between the cases is more dramatically illustrated by viewing 
0-7803-8443-1/04/$20.00 8EEE. 111 
speedup 111. SUMMARY 
o,w.-. 
0.02 
2 1 6 8 10 12 I4 16 
prorerrn 
--. . .. __. . . ___. _.---... ~ . . 
t-... ..... ...I .............. . ............... . . .  . . .  
Fig. 17. 
multi-layer power distribution network. 
The speedup for the two 5ets of data from the two models for the 
Fig. 18 
multi-layer power distribution network. 
The efficiency for the two see of data from the two models for the 
Average The Step 
02, 
Both examples demonstrated that although communications 
are a defining part of the MPI, they impede the efficiency of 
the parallel algorithm. Load imbalances are the other major 
impedance to the efficiency of the parallel algorithm. There 
are two causes for these load imbalances. First, the data 
is not equally distributed between the processes or that the 
data is distributed equally, hut part of the problem requires 
more computation time. Second, the processors used have 
different capabilities, such as running at different clock speeds 
or having different speed random access memory (RAM). The 
load balancing in the parallel code for this report attempted to 
alleviate these imbalances by implementing a quasi-dynamic 
load balancing approach that addresses, in part, both the first 
and second issues. 
Adding a weighting scheme for PML cells has its benefits. 
The weighted versions of the program continually outper- 
formed the unweighted versions with various numbers of pro- 
cessors. When illustrating the significance of the difference in 
execution times, it may be assumed that there are 16 processes 
running and, in the high density mesh, the average time steps 
for the wighted and unweighted cases are approximately 0.5 
to 1.4 seconds, respectively. Over 10,000 time steps amounts 
to a difference of approximately 60 minutes more than what 
is required for the weighted PML run. The power bus, with 16 
decoupling capacitors, and the multi-layer power distribution 
network are just small examples in comparison to other 
models that might be considered. In addition to much larger 
problems, models with vastly different dimensions or mixed 
scale problems, for example a thin power bus considering the 
skin depth, are well-suited to a parallel scheme. The high 
density model only represents mostly simple dielectrics and 
PEC sheets for both cases. 
REFERENCES 
[I] Pacheco. Peter S. Pnrdlel Programming wirh MPI .  San Francisco. 
Californh Morgan Kaufmann Publishers, Inc.. 1997. 
[2] Michael A. Cracraft. Porallelizing r? Finire-Difference 7ime-Domin 
Code Using the Message-Pmsing hrerfoce. Master's thesis. University 
of Missouri - Rolla, May 2W2. 
131 Gropp. W.. E. Lusk. and A. Skjellum. Using M P I .  Cambridge, Mas- 
whusetls: The MIT press. 1999. 
[4] Wibsooo, Barry and Michael Allen. Porollel Programming: Techniques 
nrld Apllicotions Using Nenuorked Worksrotions ond Pamllel Con~purers. 
Upper Saddle River, New Jersey: Prentice Hall. Inc., 1999. 
[5 ]  X. Ye, M. Y. Koledintwa. M. Li. and 1. L. Drewniak. D C  power- 
bus design using FDTD modeling wirh dispersive media ond mrfnce 
mounr rechnology compow~rrs. IEEE Transactions on Electromagnetic 
Compatibility, vol. 43. no. 4. pp. 579-587. Nov. 2001. 
161 TaRoue. Allen and Susan Hagness. Compsfafionol Ele'crdynamics: 
n e  Finite-Difluence F m - D o m i n  Merhod-2nd Edirion. Boston, Mas- 
sachusetts: Anech Houw. 2 W .  
