Multithreaded self-scheduling: application of multithreading on loop scheduling for distributed shared memory multiprocessor by Yung, NHC et al.
Title Multithreaded self-scheduling: application of multithreading onloop scheduling for distributed shared memory multiprocessor
Author(s) Hung, KP; Yung, NHC; Cheung, YS
Citation Ieee International Conference On Algorithms And ArchitecturesFor Parallel Processing, 1995, v. 2, p. 680-689
Issued Date 1995
URL http://hdl.handle.net/10722/45975
Rights Creative Commons: Attribution 3.0 Hong Kong License
Multithreaded Self-Scheduling: 
Application o€tw@hpading on Loop Scheduling for 
Distributed Shared Memory Multiprocessor 
K P Hung, N H  C Ymg and Y SCheung 
Dcporinrent of Electrical & Elecmnic Engineering 
lRe Wniversiw of Hong Kong 
Haking Wong Building, Posfulam Road, HONG KONG 
Abstract 
A new loop scheduling scheme called mul- 
tithreaded self-scheduling (MY$) for disfributed 
shared memory multiprocessor is pposed Based on 
the principles of multithreading, attempts to hide 
the remote memory access latencies by switching be- 
tween multiple contexts of threads. Consequent&, 
loops scheduled by using MST can obtain better per- 
formance comparing to the single-thread approoches. 
In this paper, a series of simulation results corre- 
sponding to various parameter changes arc presented 
which provides o measure of the eflectiveness of Ms5 
under different boundary conditions and suggests the 
ways for jiirther improvements. 
1. Introduction 
1.1 Distributed Shared Memory1 [Li] 
Shared memory multiprocessor offers a favourable 
programming paradigm of a global address space for 
parallel programs such that concurrent executing pro- 
gram components can communicate through shared 
variables. However, building an &dent network 
connecting all the processing elements can be expen- 
sive. Furthermore, such system has poor scalability. 
g g ....... p 
&.....a 
. -  
Figure 1. con~~- khmn F ” o r  N o d ~  
I I I L U f f d R O p E U B h g P ~  
On the other hand, distributed memory multiprocessor 
is more scalable, but its programming is clumsy and 
difiicult as communications between processor nodes 
need to be explicitly coded using message passing 
Figure I). Therefore, the concept of distributed 
sharcd memory multiprocessor is developed to take 
and eliminate some oftheir pitfalls. 
The hardware architecture ofDSM multiprocessor 
is the sameor very similar to that ofdistributed mwn- 
ory multiprocessor with the addition of a sdhvarc or 
hardware layer (a DSM abstraction) to enable the 
~ D S M  is ) m o m  LI shucd v i d  M- (SVM) 
advantage of both Systems’ desirable characteristics 
0-7803-2018-W5/$4.00  1995 IEEE 
~~ -~ 
“ry modules distributed over processor nodes to 
firm a global a d h  space Figure 2). One of the 
C- ’ ’cs of DSM m u l t i v r  is non-uniform 
mawry access (lVU.44). When a “OIY access is 
&&, itwillbefastuifthe information requested is 
located at the memory module of the local processor 
node. Howewr, it wil l  take longer ifthe information 
requested is located at the memory module o f a  remote 
processor nodq though the “sm . for tDsuring 
data coasistencyovcrtheglobal address space, and the 
d a t a t l a ”  ‘ ‘on from remote memory module to lo- 
cal memoly module are transparent to application 
Dro” 
prpur I L U f f d  Memory Mod& for M ~ p r o c u w r  
Although the abstraction of DSM can be imple- 
mented at different levels inside the computer hard- 
wan or system sofiwa~, application saftware may not 
observe their difference in term of functionality 
(Figure 3). For examples, Kendall Square Research’s 
Flpm 3. Implementdon of DSM in Merent lev& 
KSR-12 uses a hardware implementation of DSM 
called ALLCACHE3, Open Software Fundation’s 
OSF/l-AD4 operating systcm supports an in-kernel 
DSM server. There are also many other institutes de- 
sign user level DSM servers on top of common dis- 
tributed operating systems such as Mach. 
1.2 Latency Hiding Techniques 
While DSM has merits of shared memory pro- 
gramming paradigm and more scalable. its memory 
latency problem demands serious consideration. Un- 
’KSR-1 U a t n d c m u k c € K e d . U ~ R e a e d ~  
3ALLCACHE is a p.tcnttd i n d m  of K d  Square R-h 
‘OSFII-AD is a cndanuk of Open S o h  Fwndatio~ Inc. 
680 
like parallel program executing on distributed memory 
multiprocessor which may be scheduled and parti- 
tioned to match the underlying architecture for mini- 
mizing time consumption in remote memory m, 
DSM assumes a shared memory such that application 
programs running on top of it have no knowledge 
about local and remote memory accesses. Come- 
quently, its number of remote memory accesses may 
be more than expected. Several latency hiding tech- 
niques mw] listed below have been developed to cater 
for this situation. 
prefetching 
coherent caches 
relaxed memory consistency 
multipleantexts 
Prefetching hides the latency of memory accesses 
by issuing them in advance and expecting them to be 
available when the executing program needs them 
[Sa]. Coherent caches try to reduce cache misses by 
hardware, and hence less remote memory accesses 
frequency resulted. Technique of relaxed memory 
consistency models is by pipelining and buffering 
memory accesses to hide the latency. Multiplecon- 
texts technique attempts to hide latency by switching 
between contexts of different program execution com- 
ponents when a latency (remote memoty access or 
Jynchronizution) is encountered. 
1.3 Processes versus Threads 
To just@ the technique of latency hiding using 
multipleantexts, context switching time is a deter- 
mining factor. The context of a process can be divided 
into two parts: system resources and execution states. 
Typical system resources associated with a process are 
addressable memory space, opened files, allocated 
communication ports, access control information, etc. 
They are often the large part of a process context es- 
pecially the memory address space which contains a 
large buffer called table lookahead4ookaside buffer 
(TLB) for address space mapping. On the other hand, 
the context of execution state consists mainly of the 
processor registers, stack pointer, and program 
counter. It is often a small part of a process context. 
In traditional operating system, a process (Figure 
4u) supports only a single flow of execution (also 
called thread). Therefore, switching between different 
flow of executions requires to save and restore the cor- 
I ExMhons(.Ll@nd) I 
Figure 4. T~aditianni Process and Multithmaded Proeras. 
responding process contexts which may take a very 
long time because of the sigruficant system resources 
involved. In modem operating system, a process 
Figure 46) often supports multiple flow of executions 
(multifhreud/, and switching between different threads 
may be faster as only the contexts of execution states 
need to be saved or restored. For this reason, threads 
are often called light weight processes and the pro- 
grams that contain multiple threads are often de- 
scribed as multithreaded programs. 
1.4 New Loop Scheduling Scheme for DSM 
In this paper, a new loop scheduling technique €or 
DSM multiprocessor is proposed. By multithreading 
the chunks in guided self-scheduling (GS) scheme, 
the remote memory ~ccess latencies, that fkquently 
happen in DSM multiprocessor, may be effectively 
hidden by switching between multiple contexts of 
threads. Therefore, this new scheme is named as 
multithreaded self scheduling w). 
In order to compare and ana lp  the effectiveness 
in latency hiding by MSS, a series of simulation 
experiments were performed in comparing with the 
GSS. The simulation results suggest the boundary 
conditions for which MSS can obtain the best 
performance which may be useful as the criteria for 
improving both of the working mechanism of threads 
and the algorithmic approach of MSS. 
1.5 Organization of the Paper 
This paper is organized into the following sec- 
tions. Section 2 revisits some well known loop 
scheduling schemes for shared memory multiproces- 
sor. Then, multithreaded self-scheduling scheme is 
introduced in Section 3 with an explanation of its 
working principles and its suitability for DSM multi- 
processor In section 4, a simulation model is devel- 
oped, and some simulation cases are studied in section 
5 so as to compare characteristics of multithreaded 
self-scheduling and guided self-scheduling schemes 
under different simulation conditions. Lastly, discus- 
sions on the simulation results and their implications 
are presented in section 6, and followed by a conclu- 
sion in section 7 
2 Loop Scheduling Schemes 
Parallel loops are recognized as a great source of 
parallelism when parallelizing a program. Thus, a 
number of loop scheduling schemes [Lj] have been 
suggested. For a loop with no dependency between it- 
erations, every iteration of the loop may be executed in 
parallel and it is sometimes called a doall loop. Doall 
loops can be scheduled on a shared memory multi- 
processor statically @rescheduling) or dynamically 
(self-chedulingl. Prescheduling assigns loop itera- 
tions evenly distributed on processor nodes. It expects 
no runtime overhead in scheduling and a good load 
balancing if the execution time for each iteration is the 
same. However, varying completion time for different 
iterations may result in imbalanced loading, and dy- 
namic loop scheduling schemes are developed to solve 
this problem by moving the scheduling decision from 
681 
compile or load time to run-time. The self-scheduling 
schemes allow processors responsible to allocate its 
own job. Scheduling one loop iteration at a time may 
introduce a significant scheduling+- and 
chunk scheduling scheme is proposed to schedule 
equal- size chunks of a number of iterations at a time 
to reduce this overhead. To further hpmve the load 
balancing performance of chunk scheduling, Guided 
self scheduling (GSS) Po] is developed. It is a practi- 
cal loop scheduling scheme which compromises be- 
tween load balancing and scheduling overhead. GSS 
uses the strategy of decreasing chunk size for scbedul- 
ing each successive chunk. The idea of GSS is to re- 
duce scheduling overhead by scheduling chunks of 
warser grain at the beginning, as well as to maintain a 
good balance of load by scheduling chunks of tiner 
grain near the end. Referring to an example illus- 
trated in Figure 5 ,  a doall loop of lo00 iterations is 
scheduled on a four-processor shared memory multi- 
processor by GSS scheme. For each job requested by 
a processor after completing a previous job, a chunk 
with iteration-size of c will then be scheduled, and c 
can be calculated by the following equation. 
C = I wp1,  where R is number of M o n s  remained. 
and p is number of p~ocascrs 
IWO ltcnbons Dodl Locp 
Flgum 5. E v m p k  olGSS with Loop Count 1000 8nd 4 Pmcluon 
There are also some other scheduling schemes 
which derived from the GSS by further improving the 
load balancing (Factoring) [Hu] or scheduling over- 
head (Trapezoid selfscheduling) pz].  One of the 
common features between all of these dynamic loop 
scheduling schemes is that they schedule a lot of loop 
iterations in a chunk at a time. This " m o n  feature 
does not only reduce the scheduling overhead but also 
allow multiple-contexts (by multithreading) latency 
hiding technique to be applied. 
3. Multithreaded Self-Scheduling 
Although the above self-scheduling schemes are 
well known as appropriate methods to schedule loops 
on shared memory multiprocessor, using the same 
technique on DSM multiprocessor may result in sub- 
stantial performance degradation due to the large 
number of remote memory accesses. Consequently, 
multithreaded self-scheduling Fiss) scheme is pro- 
posed in this paper to address this issue. 
h executing doall loops with large number of it- 
erations on multiprocessor system, they are often di- 
vided into chunks. Each chunk contains a number of 
iterations andsdxeoutes on an allocated processor node 
BS a PKKXSS. For example using GSS scheme in figure 
6, a doall loop with loop count of 1000 may be divided 
into a number of chunks. These chunks can then be 
scheduled on processor nodes as smaller processes 
(sub-tasks). 
C l p . L A D . . I I h l p - m w m r . ~ M ~ n a s n r  
A sub-task can further be divided into smaller 
processes such that each iteration is itself a process. 
Furthermore, these one-iteration processes may share 
the same system resources in execution. Hence, it is 
appropriate to define them as threads and to encapsu- 
late them into a process sharing the same system re- 
sources instead. This configuration of multithreaded 
sub-task supports multiple contexts of threads with ef- 
ficient thread management operations. It is also the 
basic chunk defined in the scheme of MSS. 
When a sub-task is executing on a processor node, 
situations may arise that latency is introduced. Two 
kind of latencies are often common in DSM multi- 
processor, namely remote memory access latency and 
synchronization latency. Executing an instruction may 
involve some operands, and these operands may be 10- 
cated at different processor nodes. For example, as 
described in figure 7, the operands, Ma and Mb, are 
located at processor node A and B respectively. If the 
processor node, say A, is not the same node where the 
instruction is executing, this operand needs to be re- 
quested from the remote processor node A. Thus, a 
remote memory access latency would be expected. 
Moreover, the two operands, Ma and Mb, are likely to 
be available at different time t l  and t2 respectively. 
For this reason, the execution may have to wait until 
both of them are available. Thus, a synchronization 
latency (t2 - t l)  would be expected. 
Sl sa ..... S" 
Figure 7. A Typial Mollithmded Sub-task 
Remote memory acccss latency may be. considered 
as substantial. However, synchronization latency is 
difficult to forecast, it may be small or large and varies 
according to different program behaviours and execu- 
tion environments. In MSS, only the remote memory 
access latency is intended to be hidden. 
682 
The working principle of MSS (Figure 8) is based 
on the cooperative work of DSM server and sub-task’s 
thread scheduler. When a memory access is issued, 
the DSM server determines the availability of the in- 
formation requested in the local memory module. If it 
is in the local memory module (a hir), a local memory 
access is performed and the current executing thread 
continues. However, if it is not in the local memory 
module (a miss), the DSM server resolves the memory 
access by requesting it from a remote processor node. 
There are various methods to perform this resolution 
in different DSM schemes pi]. In MSS, DSM server 
needs to acknowledge the thread scheduler on a miss, 
and the scheduler can base on this information to 
block the current executing thread and allocate the 
processor to another runnable thread. As the remote 
access is completed and the information is transferred 
from the remote memory module to the local memory 
module, the thread scheduler is acknowledged again 
such that the previously blocked thread can then be 
changed to a runnable thread for reallocation. If the 
time cost for managing the threads is small (or cheap) 
and there are sufficient number of threads for switch- 
ing, the remote memory access latency may be effec- 
tively hidden. 
Figure 8. Block Diagram of MdtlUvuded Self-Sehednllng 
Although only GSS is multithreaded for our 
simulation study, MSS is a general technique which 
may be applied to most self-scheduling schemes with 
reasonable chunk size. Chunk self-scheduling, factor- 
ing, and trapezoid self scheduling are all possible be 
multithreaded. Furthermore, prescheduling can also 
be multithreaded as long as the chunk size is large 
enough. 
4 Simulation model 
In order to compare the performance of the mul- 
tithreaded self-scheduling scheme against the tradi- 
tional self-scheduling scheme, a simulation model was 
built and tested b a d  on both the MSS and GSS. The 
corresponding chunks scheduled in MSS are the same 
size to those of GSS. The difference is only on the be- 
haviour of the chunks. In MSS, each chunk is a mul- 
tithreaded process, while it is a single-thread process 
in GSS as depicted in figure 9. 
The scheduler in figure 9a refers to the thread 
scheduler of MSS. It is responsible for multiple 
threads execution management. A thread can be of 
Merent states Figure 10) and any change of its state 
1s pcrfomed through the thread scheduler 
, .  
@) D d  Loop Exccutsd by Guided Self-Scheduling 
Figure 9. Dmerent &h.vlour of chrmkr In MSS and GSS. 
For the sake of simplicity in the simulation, the 
overheads on creating and destroying threads are ig- 
nored by the assumption that the number of context 
switching operations between threads is &ciently 
large compared with the h e a d  creation and destruc- 
tion operations, hence the total time of context 
switching operations dominates the overall execution 
time of a chunk contributed by thread management. 
This is often the case because thread creation and de- 
struction are mostly the allocation and deallocation of 
small storage for the execution states @rocessor regis- 
ters, stack pointer, program counter). These over- 
heads are generally small though it depends on the 
specific system and thread implementation method. In 
addition, the number of context switchings is substan- 
tial in MSS as remote memory accesses are common 
in DSM multiprocessor, and each remote memory ac- 
cess triggers at least one context switching (one for 
W 
Figure 10. State Diagram of a T h d  hi MSS. 
blocking this thread, and maybe another one for dis- 
patching this thread later when the remote memory 
683 
data is available). Furthermore, ~ynchrO&tions 
between threads are also ignored by the fact that there 
is no dependency between threads in doall loops. 
Intuitively, the trade-off between M$s and GSS is 
the context switching time and the remote memory ac- 
cess latency. Therefore, several simulation cases were 
investigated and simulated with varying simulation 
parameters in order to study this intuition in details. 
5. Simulation Cases 
The doall loop used in the simulation is shown in 
figure 11. It contains no interdependency between dif- 
ferent iterations within the loop, and each iteration is 
characterized by the portion of ExeOnly (execution 
time units without memory access) and the portion of 
ExeMem (execution time units with memory access). 
DOALL I = 1, N 
ith iteration fEueOnlv. EwMem)  _, I ENDDO I 
Figure 11. A D o d  loop for the iimuLtlon experiments. 
Throughout the simulation, the distribution of two 
kinds of execution time units (Ereonly and ExeMem) 
in an iteration is assumed to be random because mem- 
ory access distribution in a set of instructions is de- 
termined by the specific application, the coding 
method of the programmer, as well as the code gen- 
eration method of the compiler. With such complex 
factors affecting the memory access pattern, it is al- 
most impossible to forecast when a memory access 
will be issued. Therefore, the use of random memory 
access pattern seems appropriate in this simulatil 
Context Switching T": 200 or 400 Pnfh 
9l  % 
10 d t s  
1000 units 
Local Memory Accesses Hit Rotio: 
Local Memory Accesses Latency: 
Remote Memory Accesses Latency: 
h e o n l y  (Per Iterotion WithoutMemory 
Access Erecution Time Units): 1000 
ExeMem (Per Iteration With Memory 
Access Erecunon Time Units): 500 
Number offrocessors: 20 
Number ofLmp Iterotions: 1000 
F'lgurc 12. DeQdt SimuLUon Parmeten. 
study. Moreover, the occurrence of local memory ac- 
cess is also assumed to be random and fixed by a hit 
ratio. The reason for assuming random ocamence of 
local memory access is very similar to that of memory 
access pattern because it is also affected by program- 
ming method, code generation method, specific appli- 
cation as well as DSM schemes. The latencies of dif- 
ferent kinds of memory a m  are fixed to a constant 
so that remote memory access latency is a fixed time 
period everytime and so as the local memory access 
latency. It is the simplest model of a NUMA multi- 
processor without considering the variation of the 
memory access latency. In real situation, it i s  often 
acceptable as the time difference between local mem- 
ory access and remote memory access is large so that 
the lattncy variations of these two kinds of memory 
accesses a very localiwt Therefore, an acceptable 
approximation of thm using constant values as in 
this simulation. 
sets of simulation experiments are performed 
and they a based on di&rent context switching 
times, namely 200 and 400 time units. Unless a spe- 
cific simulation parameter is being varied, the default 
values for these parameters arc given in figure 12. 
These parameters an cardully chosen with the 
objective of decting Mnnt rtalistic situations. 
The default context switching time is 200 or 400 
units which may nas~nably reflect the context 
switching time for saving and restoring processor 
contexts. We do not assume a fast context 
switching time because most of the thread imple- 
mentations currently are performed at the software 
level. User-space threads have faster context 
switching time. white kernel-level threads have 
slower context switching time. 
The default hit ratio is 97% because the effect of 
locality is assumed; though it may vary on swap 
page size, specific application, etc. 
The default local memory access latency is 10 units 
and that of remote access latency is 1000 units. The 
local memory access time is very fast by its nature 
and is likely to be much faster than the context 
switching time, therefore 10 units are assumed. On 
the other hand, the remote memoxy access is 
generally considered as a slow process affected by 
the speed of the interconnection network and its 
contentions, routing algorithm, and DSM server 
overhead. Consequently, it is assumed to be several 
times slower than the default context switching 
time. 
ExeOnly and ExeMem are application specific and a 
reasonably long iteration is assumed. 
The default number of loop iterations is 1000 while 
the default number of processors is 20 so that a 
reasonably large chunk size for self-scheduling 
schemes is d t e d .  The numbers are not chosen to 
be excessively large so that the simulation cases can 
be completed within a reasonable time period. 
Avvlyc Procaror B ~ u y  Time 
Ovenll  E x d o n  Time. 
~ due pmcrva is bury mun thttbc pmccvor U podusiug outpur 
One of the aims for applying latency hiding tech- 
niques in DSM multiprocessor is to improve the effi- 
ciency of the processors by minimizing their idling 
time and keeping them busy as often as possible. 
Therefore, we present also the simulation results in the 
characteristics of processor efficiency by the following 
formula. 
5.1 Effect of Varying Thread Context Switching 
Time 
This simulation experiment is performed by 
varying the context switching time from IO to 1500 
684 
time units. MSS's overall execution time decreases a p  
proximately linearly as the context switching time de- 
creases. However, GSS shows a behaviour that is in- 
sensitive to context switching time changes. The two 
graphs intercept at a point with context switching time 
of 970 time units and execution time of 910,000 time 
units. For context switching time smaller than 970 
time units, MSS performs better than GSS and reaches 
35,000 execution time units as context switching time 
approaches to zero. Beyond 970 time units of context 
switching, MSS performs poorer than GSS. 
300000 4 
0 500 two t 500 
context Swttshlng Tlml UnlO 
Figure 13. Execuiion Time vs Context swltchlng Time 
Similar to the case of execution time versus con- 
text switching time, two graphs of processor efficiency 
versus context switching time intercept at a point with 
context switching time of 970 time units. However, 
MSS shows an exponential increase of processor effi- 
ciency as context switching time decreases. When the 
context switching time approaches to zero, the proces- 
sor efficiency approaches 21.5% which is more than 2 
folds better than that of GSS's 8.5%. 
I 
I 
0 5W IO00 two 
Conhxt S W h l n g  Tlml U n b  
Flgwe 14. Proeruor Efficiency VI Context Switching T h e  
5.2 Effect of Varying Local Memory Accesses 
Hit Ratio 
This simulation experiment is performed by 
varying the local memory-accesses hit ratio from 70% 
to 99%. For comparing the overall execution time as 
in figure 15, MSS performs better than GSS in this 
range of hit ratios. Furthermore, it may be expected 
by projection to have a higher improvement if the hit 
ratio goes below 70%. MSS has relatively small 
variation in overall execution time in this range (e.g. 
fiom 360,000 to 1894,0000 in MS-200) compared 
with that of GSS (from 360,000 to 7680,000 in GSS- 
200). However, all  the graphs converge as the hit ra- 
tio approaches to 100%. To an extreme, hit ratio of 
100% is actually the special case of shared memory 
multiprocessor. 
4 3 - 0 -  - M- 
I""" *- :- 
70 75 80 C 3  M M tW 
L O C ~ I  Mmow h e s s  nn m o  (XI 
Figure 15. Execution T h e  verclu Loal M u n o y  Accrues 
Hlt h t f o  
While execution time shows a linear relationship 
with the local memory access hit ratio, the processor 
efficiency exponentially improves as the hit ratio in- 
creases. With very high hit ratio, say 99%, both MSS 
and GSS have processor efficiency of 21%. However, 
GSS shows a substantial decrement (8% ut 97% hit 
ratio, 3% ut 90% hit ratiofor GSS-200) as the hit ratio 
70 75 80 C3 M 85 100 
LOU( M m o y  ACC*SS nn nauo (XI 
F i p m  16. Processor ElIidency v e m  L n d  Memory 
Arresra €Ut Ratlo 
decreases. Relatively speaking, MSS suffers smaller 
processor efficiency degradation on decrement of hit 
ratio (1 7% ut 97% hit rutio. 9% ut 90% hit ratio for 
5.3 Effect of Varying Local Memory Access 
MSS-200). 
Latency 
This simulation experiment is perfomed by 
varying the local memory access latency from 10 to 
250 time units. All the graphs in the plot of overall 
execution time versus local memory access latency are 
linear and very close to each other. However, MSS 
shows a constant improvement in overall execution 
bme compared wlth the GSS in this range of local 
memory access latency. 
In processor efficiency, the g r a p & & q y e  cyo- 
SWOMH) 
E 
5 -  
E 4woMKl 
Pmooooa I 
U 
; swoooo 
1000000 
0 
0 50 1w 150 zw 250 
Local Memorykceu h b n y  
Flgure 17. Execution The vernu Loal Mmory 
Accor Latency 
nentdly as the local memory access latency increases. 
For the latency larger than 100 time Units, all 4 graph 
are sufficiently close to each others such that no sig- 
nificant difference in processor efficiency can be 
found However, for the latency s d e r  than 100 time 
uruts, MSS (16% at 10 trme units latency) shows a 
two-fold improvement compared with GSS (8% at 10 
time units latency) 
0 50 100 150 2 0 0 m  
Local Memory Access Mmsy 
Figure 18. Processor Eftlciency v e m  L o 4  Memory 
Accws Latency 
Effect of Varying ExeMem (Per Iteration 
With Memory Access Execution Time Units) 
0 204 400 uw) 8w 
ExeMem (Per nenaon %" Memory Ae- %.cullon 
Tlme UniEl 
Figure 19. Execution Time Unlta versua ExeMcm 
This simulation experiment is performed by 
varying the ExeMem from 100 time units to 800 time 
units. Figure 19 shows that the execution time per- 
formance of MSS is always bener than GSS in this 
range of E M e m .  AS the ExeMun increases, im- 
provement in overall execution time for MSS from 
GSS increases. AU the graphs show a linear relation- 
ship between overall execution time and &&em, and 
are (by projection) to converge at a point 
near zero ExeMem. 
1 -. 
10 
0 4  I 
0 200 400 6w MH) 
ExeMem (Per Itention Wlm Memory k c e n  E x E d o n  
Tlme Unk)  
Fipu-c 20. Processor Emdency v e m  ErcMm 
Processor efficiency increases exponentially as the 
ExeMem decreases. In this range of ExeMem, MSS 
always performs better than GSS (38% at 100 Exe- 
Mem for MSS-200 versus 24 % at 100 ExeMem for 
GSS-200 and 13% at 800 ExeA4em for MSS-200 ver- 
sus 7% at 800 EreMem for GSS-200). 
5.5 Effect of Varying Number of Processors in 
the Multiprocessor 
0 50 100 150 zw 250 MO 
Number Of.Pmcessors 
F1pu-c 21. Execution T h e  Units venw Number of 
P m u r o n  
This simulation experiment is performed by 
Varying the number of processors from 16 to 256. The 
overall execution time increases exponentially as 
number of processors decreases. The graphs almost 
converge at the point with 256 processors while MSS 
shows substantial improvement over GSS with smaller 
number of pnmssors (e.g. 1100,000 time units at 16 
processors for GSS-200, and 570,000 time units at 16 
processors for MSS-200). 
Processor efficiency decreases linearly as the 
number of processors increases. However, the graphs 
686  
show a small variations in the processor efficiency 
changes with variation in the number of processors 
Ratio of MSS to 
GSS Execution 
In T 
Ratio of Context Switching 
Time to Remote Memow 
I 
0 SO 4 0 0  I50 200 2% 3W 
Number of Pmcessors 
Figure 22. Processor Emcicncy vcmu Number of 
Processon 
(1 6.2% at I6 processors and 14.2% at 256 processors 
for MSS-200, while 8.2% at 16processors and 7.2% at 
256processors for GSS-200). 
70 
from GSS to MSS 
(CS Time = 200) 
ftom GSS to &S 
(CS Time = 400) 
2 99 1 7 7  
70.8 50.0 
nq.9 80.0 
QI n 
~ 
103.8 
" a b l e  thread may effectively hide the latency. 
However, when runnable threads are exhausted, 
blocking a thread by switching may be redundant be- 
cause there is no other runnable thread that can use 
the processor. It may prolong the latency by switching 
back when it becomes a runnable thread again upon 
the availability of the remote memory data. 
Beyond the intercepting point, overhead of con- 
text switching becomes dominant, as such MSS results 
in a poorer performancc compared with the GSS. 
This simulation result suggests that MSS may be an 
effective way for latency hiding only if the context 
switching time is reasonably smaller than the remote 
memory access latency. Furthermore, the per proces- 
sor efficiency is improved substantially when context 
switching time is smaller than 30% of the remote 
memory access latency as indicated in figure 14. The 
processor efficiency is improved approximately 2% 
per each 100 time units of context switching time de- 
creases within the range of 0 to 300 time units of con- 
text switching time, while it is about 1.14% in the 
range of 300 to 1000 time units and even smaller for 
the range larger than 1000 time units. However, to 
shorten the context switching time in the former range 
may require substantially more efforts compared with 
the later range. 
The second simulation experiment investigates 
the effect of varying the local memory access hit ratio 
to the loop scheduling methods. From figures 15 and 
16, we can observe that both MSS and GSS converge 
to one point as the hit ratio approaches to 100% which 
is also the special case of shared memory multiproces- 
sor. It is because the number of remote memory ac- 
cesses approaches to zero and hence the thread 
switching benefits no longer exist. As the hit ratio de- 
creases, the performance improvement of MSS is more 
significant. From this result, it is suggested that GSS 
and MSS have very similar performance on shared 
memory multiprocessor, while MSS may be more 
suitable for DSM multimxessor if the simulation 
100.0 
parameters hold. 
I Hit I Percentageh- I Percentageh- 1 
I Ratio provement in  Proc- provement'm Proc- I (%) I essorEfficiency 1' essorEfficiency 
7.93 4.21 
7.67 4.61 
99 
Table 2. Percentage Improvement on Processor 
Efficiency with different Locnl Memory Access 
Hit Ratio. 
Another interesting observation is that the per- 
centage improvement of processor efficiency is not 
monotonous as depicted in figure 16. Moreover, some 
quantitative comparisons can be found in table 2. The 
percentage improvement reaches the highest point at 
hit ratio of around 96%. It may be explained as fol- 
lows. For the case with very high hit ratio, the number 
of remote memory accesses is small such that the 
merit of MSS in latency hiding ~ ~ ~ ~ e c t i i e l y  
shown. Hence, a relatively small processor efficiency 
improvement. On the other hand, a very low hit ratio 
results into a large number of remote memory accesses 
and relatively insul%cient available threads for effec- 
tive multiplecontexts latency hiding. Therefore, the 
processor efficiency improvement is relatively small 
too. For an optimal point, the hit ratio should not be 
too high or too low such that the number of threads 
matches the number of remote memory accesses for 
the best latency hiding effect. In short, the number of 
iterations (threads) in a chunk and number of remote 
memory accesses may require to match in order to 
obtain the best processor efficiency improvement by 
MSS. 
Referring to figure 17, we can observe that the 
variation of local memory access latency has no sig- 
nificant effect on the overall execution time difference 
between MSS and GSS. It can be interpreted as the 
improvement on latency hiding will not be affected by 
this parameter. It is logical as the switching of 
threads is merely determined by the signal from the 
DSM server to the thread scheduler which in turns is 
decided by the nature of the memory access (local or 
remote). If the local memory access latency is large 
compared with the context switching time, the per- 
formance of MSS would be improved further by 
switching threads on encountering any kind of mem- 
ory accesses. 
Since the number of local memory accesses is 
large for the default hit ratio (97%), the improvement 
on processor efficiency becomes insigruficant when 
the local memow access latencv increases as shown in 
100 
200 
400 
600 
800 
figure 18. Theie is no reason to spinning (spinning 
I ExeMem I Percentageh- I Percentageh- 1 
provement in provement in 
Processor Effi- Processor Effi- 
ciency from GSS 
to MSS (CS Time 
ciency from GSS 
to MSS (CS Time 
= 200) = 400) 
13.8 8.4 
11.8 1.3 
8.9 5.1 
7.0 4.3 
6.2 3.6 
efficiency is slightly deduced figure 20 and table 3) as 
ExeMem increases. It can be explained that the fewer 
the memory accesses (and hence fewer remote memory 
accesses) the better the MSS can hide the remote 
memo$ access latency. It is expected that the im- 
provement in processor efficiency may drop again 
when the number of memory ~ccesses is very small 
which is similar to the case with high local memory 
access hit ratio. 
The last simulation experiment is performed by 
varying the number of processor nodes in the multi- 
processor. This in turns varies the average chunk size 
on scheduling. From figure 21, it is obsewed that the 
performance improvement of MSS is more significant 
when the number of processors is small (execution 
provement in 
1 16 1 =? 1 Tim:;400) I 
32 7.6 4.5 
64 7 6  A S  
ficiency with &ere& Number of Processors. 
time improvedpom I1 00,000 to 579 000 time units at 
16 processors). In addition, figure 22 and table 4 
show that the percentage improvement on processor 
efficiency of MSS and GSS is converging slowly. 
For the case with fewer processors, the average 
chunk size is relatively large and multithreading on a 
larger chunk may result in a better latency hiding ef- 
fect. However, as the number of processor nodes in- 
creases, the performance of both GSS and MSS con- 
verges. In one extreme, the chunk size reduces to one 
when the number of processor increases to 1000. In 
this situation, the MSS and GSS have no difference as 
processor nodes in both schemes have only one thread 
to execute. Of course, the task completion time is 
faster with an increasing number of processors but the 
processor efficiency is poorer. Actually, scheduling 
one iteration at a time is not a good idea as the 
scheduling overhead is great, a slight modidcation of 
GSS had also been suggested in [Po] to define some- 
how a minimum chunk size which may also be bene- 
ficial to MSS. 
6.2 Different Level of Threads 
In the above simulation study, we have not as- 
sumed any specific threads implementation as well as 
the details of the DSM. Since their implementation 
decisions may be closely related, threads (like DSM) 
can also be implemented at different levels. As we 
have discussed the implementation choices of DSM 
before, let us look at the issues on threads now. In 
practice, threads can be classified as user-level threads 
688 
and kernel-level threads. User-level threads imple- 
mentation can result in fast thread management op- 
erations because the thread scheduling is in the user 
space. However, the scheduler has no way to access 
the information in the kernel and hence only non- 
blocking type kernel system calls may be used which 
are usually slower. In contrary, kernel-level threads 
implementation does not have this problem as the 
scheduler can obtain the kernel information for thread 
management but it suffers a serious performance 
drawback as thread management needs to be per- 
fomed through system calls. Furthemore, the crash 
of a kernel thread may corrupt the kernel so that pro- 
tection checking on kernel thread operations has to be 
performed which is time consuming. 
A hybrid kerneUuser-level thread management 
system based on scheduler activations have been sug- 
gested which contains the most benefits of user-level 
thread and kernel-level thread [An] pa ] .  The idea of 
this management system is that user-level threads are 
built on top of a kernel entity called scheduler activa- 
tion. Scheduler activation supports communications 
between user-level threads and the kernel by notifying 
the user-level threads of kernel events and vice versa. 
Therefore, the performance of such threads is good for 
they are executing in user-level, as well as retaining 
the functionality of kernel level threads. 
7. Conclusion 
From the above performance analysis of MSS and 
GSS, we conclude that MSS is an efficient loop 
scheduling scheme for DSM multiprocessor. How- 
ever, several considerations may be desirable in order 
to obtain the best performance from MSS. 
The context switching time between threads needs 
to be small compared with the remote memory ac- 
cess latency. For our default simulation parameters, 
the context switching time of 30% or below of (i.e. 
below 300 time units) the remote memory access 
latency can result in 1.8 to 2.6 times processor effi- 
ciency improvement or 1.7 to 2.4 times execution 
time improvement by using MSS instead of GSS. 
For the case with context switching time compara- 
ble to or larger than the remote access latency, MSS 
performs poorly. 
The number of remote memory accesses and the 
number of threads on a multithreaded sub-task need 
to be matched for the best processor efficiency. 
Therefore, calculation of chunk size (number of 
threads) can be modified from the method of GSS 
with consideration given to the number of remote 
memory accesses. 
Local memory accesses may also be hidden if con- 
text switching time is d c i e n t l y  small comparing 
to local memory access latency. 
MSS is more efficient in the situation where the 
number of processors in the multiprocessor is scarce 
and the loop (or total number ofthrearls) is rela- 
tively large. 
Several issues related to this simulation can be 
further investigated. A more sophisticated simulation 
model which considers both the memory locality effect 
of DSM and some real problem loops can be studied. 
A machine realization of MSS is now under investi- 
gation which would reflect the real execution environ- 
ment more accurately. Further investigation on apply- 
ing the concept of MSS to other scheduling schemes is 
in progress, which may eventually cover the doacross 
loop and with other dependencies. 
8. References 
[AnIT. E. Anderson, B. N. Bershad, E. D. Lazowska, 
H. M. Levy, "Scheduler Activations: Effective 
Kernel Support for the User-Level Management 
of Parallelism," Proc. of the 13th ACM Sympo- 
sium on Operating &stem Principles, in Operat- 
ing &stem Review, Vol25, No. 5, pp 95-109, Oct. 
1991. 
palp.  B. Davis, D. McNamee, R Vaswani, E. D. La- 
zowska, "Adding Scheduler Activations to Mach 
3.0," Technical Report 92-08-03, University of 
Washington, Aug. 1993. 
[HuIS. F. Hummel, E. Schonberg, and L. E. Flynn, 
"Factoring: A Practical and Robust Method for 
Scheduling Parallel Loops," Proc. Supercomput- 
ing 91, IEEE CSPress, pp. 610-619, 1991. 
pw]K. Hwang, "Advanced Computer Architecture: 
Parallelism, Scalability, Programmability," Com- 
puter Science Series, McGraw Hill, Inc., pp. 475- 
504, 1993. 
pi ]K.  Li. "Shared Virtual Memory System on 
Loosely Coupled Multiprocessor," Technical Re- 
port, Yale University, 1986. 
[Lj] D. J. Lilja, "Exploiting the Parallelism Available 
in Loops," IEEE Computer, Feb. 1994. 
[NiIB. Nitzberg, V. Lo, "Distributed Shared Memory: 
A Survey of Issues and Algorithms," IEEE Com- 
puter, Vol. 24, pp. 52-60, Aug. 1991. 
[PoIC. D. Polychronopoulos, D. J. Kuck, "Guided 
Self-Scheduling: A Practical Scheduling Scheme 
for Parallel Supercomputers," IEEE Trans. Com- 
puters, Vol36, No. 12, pp 1425-1439, Dec. 1987. 
[Sa]R H. Saavedra, W. Mao, K. Hwang, 
"Performance and Optimization of Data 
Prefetching Strategies in Scalable Multiproces- 
sors," Joumal of Parallel and Distributed Com- 
puting Vol. 22, 1994. 
[TzIT. H. Tze4 L. M. Ni, "Trapezoid Self Schedul- 
ing: A Practical Scheduling Scheme for Parallel 
Computers." IEEE Trans. Parallel and Distrib- 
uted@stems, Vol. 4, No. 1.  pp. 81-98, Jan. 1993. 
689 
