A measurement-based study of concurrency in a multiprocessor by Mcguire, Patrick John
t 
I 
E 
1 
I 
1 
IC 
R 
I 
I 
I 
I 
I 
E 
I 
I 
E 
e 
'+ -b I .  CSG-6 1 - 
A MEASUREMENT- 
BASED STUDY OF 
CONCURRENCY IN 
A MULTIPROCESSOR 
r 
Patrick John McGuire 
(AASA-CE-1803 I d )  A BEASUBEMENT-bASED S T U D Y  ~ a 7 - 2 ~  i o  
C €  CCNCUEPENCY I k  A E C L T I E E C C E S S C A  
/ I l l i n o i s  Uniw.) 5 8  f Avai l :  h l I S  HC 
A C 4 / H F  A01 CSCL 09B U n c l a s  
G3/6O 0063752 
-4 r r() v eci f L' r P u 13 I i c K z 1 ease. I1 is c I- i bu 1 ion C I1 1 111 1 ted 
https://ntrs.nasa.gov/search.jsp?R=19870017077 2020-03-20T09:49:17+00:00Z
Approved for public release; 
distribution unlimited 
NASA NSF DOE UILU-ENG-87-2210 (CSG-61) 
I 
I 
8 
E 
1 
B 
i 
II 
I,  I 
I University of Illinois I N/A 
6c ADORESS (City, State, and ZIPC&) 7b. ADDRESS (City, State, and Z I P C d )  
FIELD GROUP SUB-GROUP Concurrency, multiporcessors, Alliant FX/8 cache miss rate9 
bus activity 
I I 1101 W e  Springfield Avenue NASA: NASA-Langley Research Center 
Urbana, IL 6i801 Hamptpon, VA 
NSF: 1800 G St. N.W.. Washinpton. DC 20550 
3. NAME OF FUNDING I SPONSORING I8b. OFFICE SYMBOL 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 
oRGANIZATloN ASA, National Of applicabk) NASA: NAG-1-613 NSF: DCR-84-10110 
kience Foundation, & DOE DOE: DOE DE FG02-85ER25001 
t ADDRESS (City, State, and 2IPCodc) 
SASA: Hampton, VA PROGRAM TASK WORK UNIT 
10. SOURCE OF FUNOING NUMBERS 
ACCESSlON NO. No. SSF: Washington, DC 20550 
IOE: Washington, DC I I I I 
I. TITLE (hcludc securiiy Uaswiicabon) i 
i 
I A Measurement-Based Study of Concurrency in a Multiprocessor 
2. PERSONAL AUTHOR(S) 
3a. TYPE OF REPORT 
McGuire, Patrick John I 
13b. TIME COVERED 14. DATE OF REPORT (Year, Mmth, Day) 
January 1987 TO - Technical FROM 
6. SUPPLE ME NTARY NOTATION 
- 
7.  COSATI COOES I 18. SUBJECT TERMS (CMinue on revem if wceswfy and identie by Hock number) 
I I I 
9. ABSTRACT !ConunUc on nwm if necessary and idcntie by block numbd 
This thesis develops a systematic measurement-based methodology for characterizing 
the amount of concurrency present in a workload, and the effect 
systems performance idices such as cache m i s s  rate and bus activity. Hardware and software . 
instrumentation of an Alliant FX/8 at the Universtiy of Illinois Center for Supercomputing 
Research and Development was used to obtain data form a real workload environment. 
show that 35% of the workload is concurrent, with the concurrent periods typically using all 
available processores. Measurements of periods of change in concurrency show uneven usage of 
processors .during these times. Other system measures, including cache miss rate and processor 
bus activity, are analyzed with respect to the concurrency measures. Probability of a cache 
miss is seen to increase with concurrency. The change in the cache miss rate is much more 
sensitive to the fraction of concurrent code in the workload than the number of  processors 
active during concurrency. 
of concurrency on 
Results 
Regression models are developed to quantify the relationships 
20. DiSTRlBUTlON / AVAllABlLlTY OF ABSTRACT 121. ABSTRACT SECURITY CLASSIFICATION 
~UNClASSlFlED~NLlMlTEO 0 SAME AS RPT. 0 DTK USERS Unclassified 
22a. NAME OF RESPONSIBLE iNDIVIDUAL 22b. TELEPHONE (Inchrda Area Co&) 22C. OFFICE SYMBOL 
UNCWSSIFLED 
SCCURITY CLW:CICATIOY OC THIS CAOC 
r 
between cache miss rate, bus activity and the concurrency measures. 
miss rate predicts an increase in the median miss rate value of as much as 300% for a 
100% increase in concurrency inthe workload. 
The model for cache 
~ 
* UNCLASSIFIED 
SECURITY CLASSlflCATlON O f  THIS PAGE 
... 
111 
ABSTRACT 
This thesis develops a systematic measurement-based methodology for characterizing the 
amount of concurrency present in a workload, and the effect of concurrency on system perfor- 
mance indices such as cache miss rate and bus activity. Hardware and software instrumentation 
of an Alliant FX/8 at the University of Illinois Center for Supercomputing Research and 
Development was used to obtain data from a real workload environment. Results show that 35% 
of the workload is concurrent, with the concurrent periods typically using all available proces- 
sors. Measurements of periods of change in concurrency show uneven usage of processors during 
these times. Other system measures, including cache miss rate and processor bus activity, are 
analyzed with respect to the concurrency measures. Probability of a cache miss is seen to 
increase with concurrency. The change in the cache miss rate is much more sensitive to the frac- 
tion of concurrent code in the workload than the number of processors active during concurrency. 
Regression models are developed to  quantify the relationships between cache miss rate, bus 
activity and the concurrency measures. The model for cache miss rate predicts an increase in the 
median miss rate value of as much as 300% for a 100% increase in concurrency in the workload. 
V 
TABLE OF CONTENTS 
CHAPTER 
1 . INTRODUCTION ......................................................................................................... 
2 . BACKGROUND AND MOTIVATION .......................................................................... 
2.1 Related Research ................................................................................................ 
3 . EXPERIMENTAL ENVIRONMENT ........................................................................... 
3.1 System Description ............................................................................................ 
3.2 Alliant Concurrency ........................................................................................... 
3.3 Instrumentation ................................................................................................. 
3.4 Experiment Setup .............................................................................................. 
3.5 Measurements .................................................................................................... 
4 . ANALYSIS OF MEASURED DATA ............................................................................ 
4.1 Concurrency Measures ....................................................................................... 
4.3 Concurrency Transitions .................................................................................... 
4.4 Discussion of Results .......................................................................................... 
4.2 Workload Sampling Results ............................................................................... 
5 . CONCURRENCY AND SYSTEM MEASURES ........................................................... 
5.1 Cache Miss Rate ................................................................................................ 
5.2 Regression Models .............................................................................................. 
5.3 Discussion of Results .......................................................................................... 
6 . CONCLUSIONS .......................................................................................................... 
APPENDIX A .................................................................................................................. 
APPENDIX B .................................................................................................................. 
APPENDIX C .................................................................................................................. 
REFERENCES ................................................................................................................ 
1 
3 
4 
6 
6 
6 
8 
10 
10 
12 
12 
13 
15 
19 
20 
20 
27 
31 
32 
34 
37 
46 
48 
I 
vi 
LIST OF TABLES 
TABLE 1. Hardware Event Counts. ................................................................................. 10 
TABLE 2. Overall Concurrency Measures for All Sessions. .............................................. 14 
TABLE 3. Regression Models verses C,. .......................................................................... 28 
TABLE 4. Regression Models verses P, .......................................................................... 28 
Table A.l. Mean Concurrency Measures for Random Samples. ........................................ 35 
vii 
LIST OF FIGURES 
Figure 1 . Configuration of the Measured Alliant FX/8 ..................................................... 
Figure 2 . Execution of a Concurrent Loop ......................................................................... 
Figure 3 . Number of Records with N Processors Active / All Sessions .............................. 
Figure 4 . Distribution of Samples by Workload Concurrency / All Sessions ..................... 
Figure 5 . Distribution of Samples by Mean Concurrency Level / All Sessions ................... 
Figure 6 . Number of Records with N Processors Active / Concurrency Transition 
Periods .................................................................................................................... 
Figure 7 . Number of Records Active by Processor Number / Concurrency Transition 
Periods .................................................................................................................... 
Figure 8 . Missrate vs . Workload Concurrency .................................................................. 
Figure 9 . Missrate vs . Mean Concurrency Level ............................................................... 
Figure 10 (a) . Distribution of Miss Rate, Cw <= 0.4 ....................................................... 
Figure 10 (b) . Distribution of Miss Rate, 0.4 < Cw <= 0.8 ............................................. 
Figure 10 (c) . Distribution of Miss Rate, Cw > 0.8 ........................................................... 
Figure 11 (a) . Distribution of Miss Rate, Pc <= 6.0 ........................................................ 
Figure 11 (b) . Distribution of Miss Rate, 6.0 < Pc <= 7.5 .............................................. 
Figure 11 (c) . Distribution of Miss Rate, Pc > 7.5 ............................................................ 
Figure 12 . Plot of Regression Model, Missrate vs . Cw ....................................................... 
Figure 13 . Plot of Regression Model, CE Bus Busy vs . Cw ................................................ 
Figure 14 . Plot of Regression Model, CE Bus Busy vs . Pc ................................................. 
Figure A.l. Number of Records with N Processors Active / Session 1 .............................. 
Figure A.2. Number of Records with N Processors Active / Session 9 .............................. 
Figure A.3. Distribution of Samples by CE Bus Busy ....................................................... 
Figure A.4. Distribution of Samples by Miss Rate ............................................................ 
Figure A.5. Distribution of Samples by Page Fault Rate .................................................. 
Figure B.l. CE Bus Busy vs . Workload Concurrency ....................................................... 
Figure B.2. CE Bus Busy vs . Mean Concurrency Level ..................................................... 
Figure B.3 (a) . Distribution of CE Bus Busy, Cw <= 0.4 ................................................. 
Figure B.3 (b) . Distribution of CE Bus Busy, 0.4 < Cw <= 0.8 ...................................... 
Figure B.3 (c) . Distribution of CE BUS Busy, Cw > 0.8 .................................................... 
Figure B.4 (a) . Distribution of CE BUS Busy, Pc <= 6.0 .................................................. 
Figure B.4 (b) . Distribution of CE BUS BUSY, 6.0 < Pc <= 7.5 ....................................... 
Figure B.4 (c) . Distribution of CE BUS BUSY, Pc > 7.5 ..................................................... 
Figure B.5. Page Fault Rate vs . Workload Concurrency .................................................. 
Figure B.6. Page Fault Rate vs . Mean Concurrency Level ................................................ 
Figure B.7 (a) . Distribution of Page Fault Rate, Cw <= 0.4 ............................................ 
Figure B.7 (b) . Distribution of Page Fault Rate, 0.4 < cw <= 0.8 ................................. 
Figure B.7 (c) . Distribution of Page Fault Rate, Cw > 0.8 ............................................... 
7 
7 
14 
16 
16 
17 
18 
22 
23 
24 
24 
24 
25 
25 
25 
29 
30 
30 
34 
34 
35 
36 
36 
37 
38 
39 
39 
39 
40 
40 
40 
41 
42 
43 
43 
43 
viii 
............................................. Figure B.8 (a). Distribution of Page Fault Rate, Pc  <= 6.0 
Figure B.8 (b). Distribution of Page Fault Rate, 6.0 < Pc <= 7.5 
Figure B.8 (c). Distribution of Page Fault Rate, Pc > 7.5 
44 
44 
44 
45 
45 
.................................. 
................................................ 
.......................................... Figure B.9. Plot of Regression Model, Page Fault Rate vs. C, 
Figure B.10. Plot of Regression Model, Page Fault Rate vs. Pc ......................................... 
I 
8 
I 
1 
I 
I 
I 
1 
I 
I 
I 
1 
1 
1 
I 
I 
I 
I 
e 
~ 
A MEASUREMENT-BASED STUDY OF CONCURRENCY 
IN A MULTIPROCESSOR 
BY 
PATRICK JOHN MCGUIRE 
B.S.E.E., University of Santa Clara, 1981 
THESIS 
Submitted in partial fulfillment of the requirements 
for the degree of Master of Science in Computer Science 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1987 
Urbana, Illinois 
3 
ACKNOWLEDGMENT 
We thank Ed Davidson at CSRD for his support and for permitting us to instrument 
the Alliant system. Thanks are also due to Bob McGrath, Richard Barton, Allen Maloney 
and Tracy Tilton, at CSRD, for their invaluable assistance. 
This work was supported by NASA grant NAG-1-613 with additional support from 
the National Science Foundation NSF(DCR84-10110) and Dept. of Energy 3-6124(DOE DE 
FG02-85ER25001). 
PBECEDING PAGE BLANK NOT FbUm 
1 
CHAPTER 1. 
INTRODUCTION 
The study of concurrent operations is an important part of evaluating parallel processing 
machines. Several analytical and simulation studies, such as those found in [l], [2], [3], and [4], 
have been undertaken to  evaluate such machines, but few if any investigate multiprocessor per- 
formance in a real workload environment. Such measurements are important for developing 
realistic techniques to measure and model concurrent behavior of system workloads. 
This project is concerned with the development of a measurement-based technique t o  study 
the use of loop-level concurrency in a real production workload. Measurements were performed 
on the Alliant FX/8 system at the Center for Supercomputing Research and Development 
(CSRD) at the University of Illinois at Urbana-Champaign. The FX/8 measured in the study is 
used primarily for development of numerical applications software. Programs developed on the 
machine range from high level software (FORTRAN), such as structural mechanics and circuit 
simulation, to assembly-level kernels for linear system solving [5 ] ,  [6!. The Alliant is also 
networked to several other department machines. 
The two main objectives of this work are 1) to find the percentage of concurrent operations 
in the workload and the use of processing resources in these operations and. 2) to study the sys- 
tem overheads associated with concurrency in the workload environment, including the effect of 
concurrency on other system performance measures. The methodology developed is general and 
in principle can be applied to study other parallel systems. The thesis first proposes measures for 
characterizing concurrency in the system. Probability distributions of the values of these meas- 
ures show the extent of concurrent operations in the workload. Particular attention is paid to 
2 
the end of concurrent loops, and the corresponding overheads in these periods. Regression tech- 
niques are used to assess the impact of increased system concurrency on other system measures, 
including bus utilization and cache miss rate. 
Results from workload measurement show that the system workload is concurrent 35% of 
the time, and that concurrent periods typically use all available processors. Measurements of the 
end of concurrent operations indicate an uneven use of processors during these periods. Joint 
analysis of concurrency and system performance measures such as cache miss rate and processor 
bus activity shows that  the probability of high values of these measures increase as concurrency 
levels increase. Importantly, cache miss rate is shown to depend much more strongly on the frac- 
tion of parallel code in the workload than the number of processors active during concurrent 
operations. In particular, a 100% increase in the fraction of concurrent operations in the work- 
load results can result in greater than a 300% increase in cache miss rate. 
The following chapter presents background information and related research in the mul- 
tiprocessor performance evaluation area. Chapter 3 describes the measurement environment of 
the study, including the Alliant FX/8 'and the instrumentation used. Chapter 4 shows results of 
workload measurements, with special attention to  periods of changes in concurrency. Chapter 5 
deals with relationships between system measures and concurrency, and Chapter 6 summarizes 
results. 
5 
CHAPTER 2. 
BACKGROUND AND MOTIVATION 
Concurrent operations in multiprocessing systems typically exist at three levels. At a high 
level are processes or tasks, either independent or related, which execute on separate processors. 
A low level of parallelism is the pipelined vector processing available on supercomputers [TI. 
Several simultaneous arithmetic operations are possible in these machines, on data contained in 
special vector registers. An intermediate level is loop concurrency, in which multiple iterations of 
a program loop are assigned to  separate processors. A large effort has been made to apply this 
level of program concurrency to  groups of processors, for example, see [8] and [9]. This thesis is 
concerned with evaluation of loop concurrency under real workload conditions. 
In loop concurrent operations, processors may proceed independently, if there is no depen- 
dency between loop iterations. The efficiency of the execution of this loop (which may include 
several. more nested loops) is dependent on how the number of total iterations matches the 
number of available processors; for example, assignments which require some small number of 
processors to execute in parallel, while others remain idle (because there are no more iterations to  
execute) result in lower efficiency [8]. Performance is also dependent on contention between pro- 
cessors for shared resources (.e.g., cache, memory). Sets of loop iterations that contain dependen- 
cies are further restricted in their execution. Dependencies may cause additional loop execution 
overhead, since processors may have to wait on those executing previous iterations to satisfy the 
dependence [lo]. 
4 
2.1 Related Research 
There have been many studies of parallel machine performance and concurrency. Most 
have employed simulation and analytical-based techniques; examples are found in [l], [2), [3), and 
[4]. None of these studies have measured concurrency usage on a real workload of a multiprocess- 
ing system. Such measurements are important for determining the sensitivity of system perfor- 
mance to  changes in the level of concurrency. The knowledge of these effects are essential for 
developing performance evaluation methodologies. Furthermore, the results can be applied to 
control strategies, such as processor scheduling, within the multiprocessor. 
Two common measures of performance for a multiprocessing computer are Speedup and 
Eficiency. Speedup is defined as S = T1/Tp, where T1 is the execution time required for a pro- 
gram on a single processor, and T is the execution of the program on P processors. Efficiency is 
given by the ratio Ep = Sp/P, 0 < Ep < 1 [ll]. Speedup measures obtained from measurements 
on the Alliant FX/8 are given in 1121. Speedup and Efficiency yield information about the 
improvement in a program, but they are unable to provide a detailed characterization of the pro- 
gram or system behavior. Importantly, when performance evaluation of a real production work- 
load is of interest, there is no direct applicability of the Speedup and Efficiency measures. 
P 
Other studies of multiprocessor systems, such as (131 and [14], have investigated more 
detailed aspects of machine performance. In [14] i t  is shown that shared resource contention, 
which typically grows with the number of processors present in a multiprocessor, can be a limit- 
ing performance factor. In [15], measurements on the FX/8 deal specifically with the effect of the 
machine’s memory hierarchy. 
Research more closely related to the work in this thesis is presented in [IS] and [17]. These 
studies use hardware monitoring and special event marker instructions embedded in programs to 
acquire execution traces. Captured events on different processors are time-stamped, and the 
6 
composite trace yields information about the overlapping operations (concurrency) in the pro- 
gram. Since this technique requires specific code insertion in programs, i t  is difficult to apply to 
the observation of a real workload of programs generated by multiple users. 
None of the other measurement studies address the issues of evaluating the amount of con- 
currency in a workload, or relate this measured concurrency to the behavior of other system 
components, such as cache and main memory. This thesis studies three aspects of concurrency. 
The percentage of concurrency present in the workload is measured, along with the number of 
processors used during parallel operations. The end of loop-concurrent operations is studied to 
focus on overheads associated with concurrency. These operations are subject to performance 
degradations due to contention for shared resources, waiting associated with dependency resolu- 
tion, and less than full utilization of processing resources. The impact of concurrency on system 
performance measures is also analyzed to find the relationship between concurrency and other 
indices, including cache miss rate and processor bus activity. 
To investigate the issues described above, the Alliant FX/8 at CSRD was instrumented to 
extract data related to concurrency and other system performance measures. The following 
chapter includes a brief description of the FX/8 and its loop-concurrency mechanism, the instru- 
mentation, experiment setup, and the basic measurements made on the machine. 
6 
CHAPTER 3. 
EXPERIMENTAL ENVIRONMENT 
This chapter describes the Alliant FX/8, the instrumentation used for the measurements, 
and the experimental setup for the work. A more detailed description of the FX/8 may be found 
in Appendix C. 
3.1 System Description 
Measurements described here were performed on the Alliant FX/8 computer system. This 
machine is a shared-memory multiprocessor, with an advertised peak performance of 94.4 mil- 
lion floating point operations per second (MFLOPS) [18]. A diagram of the Alliant FX/8 
configuration used in the measurements is shown in Figure 1. 
Concurrency on the FX/8 is supported on the ''Computational Cluster" (hereafter referred 
to  as the Cluster) of eight Computing Elements (CEs). These processors have floating point and 
vector processing capabilities, and are linked through a common cache to shared memory. 
Interactive Processors (Ips) handle interactive traffic, operating system functions, and I/O. The 
machine supports an extension of 4.2 BSD UNIX, called Concentrix [18]. 
3.2 Alliant Concurrency 
This study is concerned with the measurement and evaluation of loop-level concurrency on 
the FX/8. The Alliant FORTRAN compiler attempts to transform DO loops or array operations 
into a parallel form, where the iterations of a loop will be executed on separate CEs. Figure 2 
shows how a concurrent loop is executed on the CE cluster. 
7 
U" MBUORY BUS 
I I c 
UEUORY - UODULL CACW CACli6 U O D W  0 UODULE 1 
I 4  KB I 4 K B  , 
P Ip 
C A M  0 CAcIflL 1 
a: KB 
I I  I I  
c 
MLUORY 
CROSSBAR XNTIRCONNPCT - UODULL IP ._ IP - 
YLUORY 
CL I CL I cn 4 CL I UODULL $$ " ' , " , ' '  CE 1 CL 8 CL I CL I 
CONCURRBNCY CONTROL BUS 
Figure 1. Configuration of the Measured Alliant FX/8. 
Figure 2. Execution of a Concurrent Loop. 
8 
A program is executed serially until a special instruction is encountered which enables the 
start  of concurrent operation. Iterations of the DO loop are assigned to CEs in a self-scheduled 
fashion [19]. Processors are assigned iterations until all iterations are executed. As shown in the 
figure, the processor which executes the last iteration will continue serial execution after all itera- 
tions are complete, and need not be the same processor that  entered the loop serially. The above 
loop execution may be complicated by dependencies which exist between iterations, where syn- 
chronization will be required between CEs to enforce correct program operation. 
The Alliant FORTRAN compiler attempts to transform DO loops or array operations into 
a parallel form, where the iterations of a loop will be executed on separate CEs. Both synchroni- 
zation and processor scheduling functions are handled in hardware, and make use of the Con- 
currency Control Bus shown in Figure 1 [18]. 
3.3 Instrumentation 
The Alliant FX/8 was instrumented at both the hardware and system software levels. 
Hardware measurements yielded information about CE concurrency and system bus activity, 
while software measurements consisted of counts of events logged by the operating system kernel. 
The two levels of measurements occurred simultaneously but independently. Minimal system 
overhead was incurred for measurements; the hardware monitoring is inherently non-intrusive, 
and statistics-gathering software was that which was normally running in Concentrix. A feature 
of this instrumentation approach was the fact that  no modifications were required to the system 
in order to perform the measurements. When monitoring a real workload, instrumentation must 
normally gather data without relying on any special operations of the system or programs moni- 
tored, since modification of user and system programs for measurement purposes is not possible 
. in many cases. 
I 
I 
8 
I 
1 
1 
8 
8 
I 
8 
8 
8 
I 
1 
I 
1 
I 
8 
e 
9 
The hardware monitoring was accomplished with a Tektronix DAS 9100 Series logic 
analyzer [20]. This instrument acquires the state of up to 80 signals (on the unit used), and 
stores this data in a 512-deep buffer memory. The DAS is fully controllable through an i/o port; 
all experiments used this feature to control the instrument, as well as to transfer acquired buffers 
to files resident on the Alliant system. 
Probes from the DAS were connected to the FX/8 at three different logical points: 
1) Bus opcode was monitored for each CE, where the bus was that between the 
CE and the CE Cache, on the CE’s side of the crossbar switch. Bus opcode indi- 
cates what type of operation (read, write, idle, etc.) is occupying the bus. 
2) The shared memory bus opcode was monitored, yielding information about in- 
teractions between memory and cache and between multiple caches. 
3) The Concurrency Control Bus was also monitored, to determine whether a 
processor was active in concurrent operation, or not in a concurrent-active state. 
As mentioned above, the software measurements were those normally collected by the Con- 
centrix kernel, made available by a program written internally a t  CSRD to extract the values. 
The operating system logs counts continuously for a variety of memory management, scheduling, 
and interrupt variables. In this study, the measurement extracted was page faults generated by 
the CEs. 
10 
3.4 Experiment Se tup  
The measurements were controlled by U N M  C-Shell script programs executing on the 
FX/8, which controlled collection of both the hardware and software data. The programs have 
the ability to configure the DAS monitor, enable the monitor’s triggering, transfer the data from 
the instrument to a host system, and reduce the acquired data to  appropriate event counts (e.g., 
number of cache read operations). Table 1 shows the reduced set of events derived from a moni- 
tor buffer. 
In these experiments, the objective was to observe the CE Cluster; therefore we chose the IP 
as the computing resource for executing the measurement control software. The Concentrix sys- 
tem allows control over what type of computing resource (IP, CE, Cluster with 1 to 8 CEs, or 
don’t care) a program will run on, provided that resource has the correct capabilities for the pro- 
gram [Zl). Using the IP kept measurement artifact a t  a minimum for our experiments. 
3.5 Measurements  
Two types of measurements were performed on the FX/8. The first used random sampling 
of the system to acquire data from the real workload on the machine. Nine sessions of this type 
HARDWARE MEASUREMENT EVENT COUNTS 
Name Event 
numj 
prof3 
ceopj 
drnbopj 
number of records with j processors active 
number of records with processor j active 
number of records with CE bus opcode = j 
number of records with mem bus opcode = j 
TABLE 1. Hardware  Event Counts. 
I 
I 
8 
8 
8 
8 
1 
8 
8 
1 
I 
8 
8 
I 
N 
I 
1 
I 
8 
11 
were performed on seven different midweek days, when the machine is used most heavily. Each . 
session lasted between four and eight hours. Five snapshots of the system were taken and 
grouped together in a five-minute interval. A real-time program was written t o  condense the 
acquisition into event counts; the result was then written to disk., Software measurements were 
taken simultaneously with the hardware measurements. These were recorded at the time that 
the hardware sample was stored. 
A second group of measurements was executed in order to extract system behavior when the 
system was executing with high concurrency. These experiments dealt with hardware measure- 
ments only. High concurrency operation was captured by triggering the hardware monitor at 
times when the FX/8 Cluster was operating in a concurrent mode. Two different trigger events 
were used. In ten of the experiment sessions, the monitor was triggered when all eight processors 
in the Cluster were active, while in five other experiments, the transition from eight processors 
active to  a smaller number active was the trigger event. This latter condition was chosen particu- 
larly to try to determine the behavior of the Cluster during times when the level of concurrency 
in the machine was changing. 
Processing of the measured data was performed with the Statistical Analysis System (SAS) 
package on an IBM 4381. SAS provides a large set of data analysis procedures, including graphi- 
cal presentation, regression, clustering, and analysis of variance [22]. 
The next chapter describes the analysis performed on the above measurements. The first 
section presents a definition of concurrency measures, and results obtained from the random Sam- 
pling of workload. A description of periods of transition of concurrency is then detailed. 
12 
CHAPTER 4. 
ANALYSIS OF MEASURED DATA 
The analysis of the measurements in this study are presented in this chapter. Measurements 
are defined which assist in characterizing concurrency in the workload, and distributions of these 
measures for the acquired data are given. Analysis of periods of transitions in the level of system 
concurrency is also performed, in order to examine the overheads associated with these transi- 
tions. 
4.1 Concurrency Measures 
In order to quantify concurrency in the workload, certain measures are defined. The work- 
load is one choice of scope for the measures; they could easily be applied a t  other levels, such as a 
program or sub-program. 
We first define a measure c . as follows: 
3 
cj = Prob(Number of Active Proeersorr = j) ( 4 4  
We call c . j-concurrency. From c .we derive the Workload Concurrency,  which is the probability 
that there is any level of concurrency (2 or more processors operating in parallel) in the system: 
3 3 
P 
C,, = Eci 
i=2 
The above measures deal with the amount of concurrency in the workload; we now restrict our 
attention to  times when the system is operating concurrently. 
13 
‘ j l e  = Prob(Number of Active Proceesors = j I Number of Active Processors > 1) (4.3) 
measures the j-concurrency value for only those times when concurrency exists in the s y 5  
tem. If all c . values from 2 to P are 0, this value is undefined. From c we calculate the Mean 
Concurrency Level, which is the average number of processors operating concurrently when a t  
c 
3 II 
least 2 processors are active. Pc is a measures utilization 
currency, and may vary from 2 to P. 
of computing resources during con- 
(4.4) 
The above measures may be applied at any level of multiprocessing capability of a given 
machine. In our experiments, we apply the measures to the specific case of loop-level con- 
currency in the overall workload of the multiprocessor. 
4.2 Workload Sampling Results 
Figure 3 shows the distribution of the number of active processors over all the measurement 
sessions. This figure shows the dominant concurrency states of the system as well as the ratio of 
concurrency to serial activity. The high points on the distribution at eight, one, and zero proces- 
sors active show that the CE Cluster spends the majority of its time in one of three states: full 
concurrency, serial, or idle.’ This analysis was performed on data from individual measurement 
sessions and the sum of all sessions. Distributions of processor activity in individual sessions 
showed significant variation during different periods; examples for two sessions are shown in 
Appendix A. 
Idle in this context is with respect to Concurrent-Mode operation. Detached processes (exclusively serial) may constitute a por- 
tion of these states. 
14 
C2 
From the processor distribution, the concurrency measures defined in equations 4.1-4.4 may 
be calculated. Table 2 shows these values for the sum of random-sample sessions. The value of 
C, shows that concurrency in the workload is at 35% for the full set of measurement sessions. 
Eight processors are active 28% of the time in the overall workload, but when the system is con- 
current, the 8-active state predominates at 93% (e+) resulting in a Mean Concurrency Level of 
7.66. 
cS "8 "8 1) 
For each sample (five minute period), a distribution of number of active processors was gen- 
erated, and the corresponding concurrency measures calculated. Figures 4 and 5 show the 
.O.OlOO 
"4" 
0.0331 
HunEER OF PROCESSORS 
0.0049 0.0015 0.0022 0.0005 0.0025 0.2795 0.3506 
" q c  "41" "4. "4. " 7 c  " q c  p C  
0.0164 0.0051 0.0074 0.0018 0.0083 0.9278 7.66 
i 
............................................. 
I 
I ** 
I 
I* 
I 
I ** 
I 
I* 
I 
I *** 
I 
I **** 
I . . . . . . . . . . . . . . . . . . . . . . . . . . .  
I 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
I 
----------+---------+---------+---------+---------+---------+---------+---------+-- 
10000 20000 30000 40000 50000 60000 70000 80000 
TOTAL 
Figure 3. 
Number of Records with N Processors Active / All Sessions. 
TOTAL 
42318.00 
377.00 
82.00 
337.00 
234.00 
748.00 
1514.00 
231112.00 
82170.00 
I 
B 
8 
I 
I 
1 
1 
I 
I 
1 
I 
8 
I 
I 
I 
I 
I 
I 
1 
16 
distribution of samples with particular values of Workload Concurrency and Mean Concurrency 
Level, respectively. Some points are of note in the distributions. The first is the large large per- 
centage of samples with Workload Concurrency measures at or near zero, indicating serial code 
execution or idle states during these periods. Some concurrency in the workload exists for 55% 
of the samples. Note that this number is significantly different from Cw found in the overall 
average. The reason for this is a difference between five-minute and overall average behavior. 
Note from Figure 4 that  there are many samples with low but non-zero Workload Concurrency, 
which do not contribute significantly to the value of Cw for the total group of measurement ses- 
sions. For samples with non-zero C,, greater than 94% of samples have a Mean Concurrency 
Level higher than 6.5. (Recall that for any Workload Concurrency value of zero, Mean Con- 
currency Level is not calculated, since it is a measure of the system when concurrent). Hence, 
concurrency which does appear in the measured workload has a characteristically high utilization 
of the total available concurrency resource. 
While concurrency level is most often high, there are some periods which are not maximum. 
Clearly, overheads due. to multiprocessing can contribute to this problem. Study of periods of 
change in concurrency can yield information about these overheads; this approach is described in 
the following section. 
4.3 Concurrency Transitions 
On the FX/8, transitions in concurrency typically happen at the end of a DO loop, when 
there are no remaining iterations to perform, and processors begin to become idle while waiting 
for serial execution to continue. These transitions affect the overall efficiency of parallel opera- 
tions. These idle periods correspond to a multiprocessing overhead; if the transition from P pro- 
cessors to one (serial) is instantaneous, processors do not incur any idle time, and Mean Con- 
16 
8 
c, 
0.000 
0.125 
0.250 
0.375 
0.500 
0.625 
0.750 
0.875 
1 .ooo 
FREQ CUM. 
FREQ 
I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  29 29 
I 
1 **** 
I 
I * * * * * * * * * * * * * * * * * * * *  10 41 
I 
I ************** 
I 
I ** 1 49 
1 
I **** 
I 
I********** 
1 
1 **** 
I 
I************** 
I 
2 31 
7 48 
2 51 
5 56 
2 58 
7 65 
----+---+---+---+---*---*---*---*---*---*---*---*---*---*-- 
2 4 6 8 10 12 14 16 18 20 22 24 26 28 
FREQUENCY 
Figure 4. 
PERCENT CUM. 
PERCENT 
44.62 44.62 
3 . 0 8  47.69 
15.38 63.08 
10.77 73.85 
1.54 75.38 
3.08 78.46 
7.69 86.15 
3.08 89.23 
10.77 100.00 
Distribution of Samples by Workload Concurrency. / All Sessions. 
PERCENT m. 
PERCENT 
0 . 0 0  0.00 
0 . 0 0  0.00 
0 .00  0.00 
2.78 2 .78  
2.78 5.56 
11.11 16.67 
83.33 100.00 
Figure 5. 
Distribution of Samples by Mean Concurrency Level / AI1 Sessions. 
currency Level, Pc, is maximum. To investigate the efficiency during transition periods, a 
specific set of measurements was performed in which monitoring began when processor activity 
changed from all processors active (full-concurrency) to a lower concurrency level. 
1 
8 
8 
I 
1 
17 
Figure 6 is the distribution of active processors for the concurrency transition periods. The 
transition of interest is between 8-concurrency and 1-concurrency; hence the distribution shows 
the number of records for 7 through 2 processors active. The state with 2 processors active 
accounts for 52% of the transition states. Average transition behavior, therefore, has transitions 
between 7 and 2 processors active which occur significantly faster than the transition from 2 pro- 
cessors t o  serial operation. 
Figure 7 shows the distribution of activity by individual processor during transitions. Pro- 
cessors 7 and 0 appear to be active significantly more often than the other processors, 
corresponding to the 2-concurrency peak in Figure 6, while processors 2, 3, and 4 are 
significantly less active than the others. 
A simple reason for uneven distribution of processor activity is a loop count which is I = 
8*j + 2, for integer j. If loop iterations throughout are equal in execution time, this would result 
in two iterations remaining after 8*j iterations have been executed, which would then take full 
NUMBER OF PROCESSORS 
1 
I 
7 I **********e 
TOTAL PERCENT CUM. 
PE?lCEm 
828.00 8.03 8.03 
831.00 8.08 18.11 6 I*********** 
I 
5 I******** 
I 
564.00 5.48 21.59 
1594.00 15.49 37.08 
3 I************** 1079.00 10.49 47.57 
5395.00 52.43 100.00 
4 . . . . . . . . . . . . . . . . . . . . . .  
I 
I 
I 
2 I************************************************************************ 
--------+-------*-------+------^+---------+-------+-------*-------+-------+---- 
600 1200 1800 2400 3000 3800 4200 4800 5400 
TOTAL 
Figure 6. 
Number of Records with N Processors Act ive / Concurrency Transition Periods. 
18 
iteration times to complete. A particular number of "leftover" iterations occuring dominantly in 
the workload would seem unlikely, however. Second, processors executing a loop may not follow 
the same execution path, due to conditional branching which is iteration-dependent. The proces- 
sor executing .the shorter path will obviously finish earlier, and be available for scheduling a new 
iteration while the longer iteration is still executing. Data access patterns are also likely to differ 
between iterations, which may result in differences in memory latency, if cache misses and/or 
page faults are generated by the processor. Such variation in latency causes processors to lead or 
lag one another. Dependencies in a program loop may also result in uneven distribution of 
activity among processors, as some processors are waiting more than others. Finally, contention 
for resources such as memory can contribute to uneven processor activity. If priority schemes 
favor particular processors, these will suffer greater delay, increasing the probability that they 
will trail other processors in execution at  the end of the loop. 
I 
I 
I 
I 
I 
m 
1 
I 
1 
I 
I 
I 
I 
I 
I 
I 
1 
19 
4.4 Discussion of Results 
Concurrency measures presented in this chapter describe what fraction of the workload is 
concurrent ( C,) and how many processors are used during concurrent operations (P,). Random 
sampling of the workload during nine different sessions show that C, = 0.35, indicating that 
about one-third of the workload is devoted t o  concurrent operations. An overall Mean Con- 
currency level of 7.66 shows that parallel operations have a high utilization of the machine's pro- 
cessors. 
Transitions in concurrency, typically at the end of parallel loops, show uneven use of proces- 
sors. The 2-concurrent state is significantly more frequent than other "transition" states. In 
particular, CEs 7 and 0 tend to show more activity than other processors during transition; as 
other processors begin to  become idle, these two typically continue to execute. Several reasons 
for this are possible, including a high frequency of loop counts that result in two "leftover" itera- 
tions. Also, uneven distribution of waiting time among processors, due to priority assignment for 
shared resources and/or loop iteration dependency may result in differences in activity between 
processors. 
The variation in system concurrency affects other system components related to processor 
activity; these include system busses and cache memory. Measured data is analyzed to  determine 
this relationship in the following chapter. 
20 
CHAPTER 5. 
CONCURRENCY AND SYSTEM MEASURES 
In this chapter the effect of system concurrency measures on key system performances meas- 
ures is described. Analysis was performed for both the combination of random sampling and 
high concurrency measurement periods. Three measures, CE Bus Busy, Cache Miss Rate 
(Missrate), and Page Fault Rate were calculated from the acquired data. CE Bus Busy is the 
fraction of processor-to-cache bus cycles that are not idle in the measured interval. The value 
shown in the following analysis is the average value of this fraction over all eight busses, and is a 
measure of the total information flow between CEs and Cache. Missrate is the fraction of total 
bus cycles corresponding to cache misses. Page Fault Rate is the sum of user-mode and system- 
mode page faults generated by the CEs during the measurement interval. 
5.1 Cache Mise Rate 
Figures 8 and 9 show the scatter-plots of the cache miss rate, against both Workload Con- 
currency and Mean Concurrency Level. Similar plots for CE Bus Busy and Page Fault Rate are 
shown in Appendix B. An inspection of Figure 8 shows that the highest Missrate values occur at 
maximum Workload Concurrency. In addition, an increase Workload Concurrency appears to 
increase the probability of a high Missrate value. 
Figure 9 also shows some increasing probability of high Missrate as Pc increases, although 
the Missrate is relatively unchanged after Pc > 7.0. 
The distributions for Missrate are plotted in Figures 10 and 11 for increasing values of Cw 
and Pc. (See Appendix B for the distributions of CE Bus Busy and Page Fault Rate.) Note tha t  
I 
I 
1 
1 
I 
I 
1 
1 
I 
I 
I 
I 
I 
I 
I 
1 
I 
1 
I 
1 
1 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
21 
the median Missrate value for 0.4 < C, <= 0.8 is .009, and increases sharply to 0.023 for C, 
> 0.8. Median value of Missrate shows no increase between the middle and high ranges of Pc, 
indicating less sensitivity to this measure than C,. 
22 
LE13pw): A - 1  089, B = 2  089, ElC. 
Mss-l 0 .276 + 
i 
0 . 2 I O  + 
i 
I 
0.226 + 
0.176 + 
0 . 1 2 6  + 
I 
i 
0.100 + 
i 
8.076 + 
o.os0 
0 .026 
0.000 
A 
A 
A 
A 
A 
A 
B 
A 
A 
A 
A 
A 
A 
A 
A 
A 
C 
A A  
A B  
B 
? 
I 
I 
I 
I 
I 
c. 
Figure 8. Missrate vs. Workload Concurrency. 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
22 
M S S W  
0.17a 
0.zco 
O.lfl 
0 .  zoo  
0.171 
0.161 
0.llK 
0.100 
0.07K 
0.060 
0 .02s 
0 . 0 0 0  
A 
A 
A 
A 
A 
A 
A 
A 
A 
A A  
M 
A 
A 
A 
A 
M 
A 
E 
A A  H 
A 
A 
A 
A 
B 
A 
A 
A 
A 
A A C  
A M  
A 
A 
A 
A A  
A 
AA 
A A  
D 
F 
E 
FA 
ca 
m c  A A  KB 
A B m  A A C  FA 
A A A A C A  F B  A B h  
A A A A  A A A  A A t  A A D  B A B a  
A A B M A A E B  CAI 
M A M AB S A M D C E  A M D W  
A A  A B A A  A M A M  EA- 
0.00 
0.01 
0 . 0 )  
. o . o a  
0 . 0 4  
0 . 0 6  
0 . 0 1  
O . O ?  
0 . 0 8  
0 . 0 9  
0 . 1 0  
24 
...................................................................................... 41 4 a  1 x . 8 8  
7 s o  11.88 
8 6 8  10.17 .. 1 17 1 - 6 8  
0 57 0 .08  
0 .  1 68  1 . 0 8  
I. LIEDIAN: 0 . 0 0 1  1 I 8  1 . 8 8  
0 I8  0 . 0 0  
0 I9 8 . 0 0  
0 I 8  0.mm 
8 6 9  0 . 0 0  
.............. ............ 
ham: 0 . 0 0 0  
Figure 10 (a). Distribution of Mise Rate, Cw <= 0.4 
MI SSRATE 
I 
0 .oo 
0.01 
0 . 0 1  
o . o a  
0.04 
0.01 
0.08 
0.07 
0 . 0 8  
0 . o q  
0.10 
I...... .......................................................... 18 18 
I .................................................................................... z1 37 I 
I 
I......... 
I 
....................... 10 47 I .................... 
i.... 
I I.. 
MEAN: 0 . 0 1 1  
-IAN: 0 . 0 0 8  
6 I2 
1 s a  
1 6 4  
0 64 
0 1 4  
HI SSRATE mEq a ~ .  PPXEW CUI. 
I 
meQ P m E W  
.... 1.1. . .11. . . . .1*~*1.1.l . . . ( . l . . .  ..................................... .................................................. I 41 18 1 8 . ~ 6  a i . f i x  
I 
1 * . . . . . * 1 . . . . . . l . * . . . . . . . I . . .  a o  1x8 1x.10 1 2 . 0 ~  
a i  180  ix.10 84.6) 
a a  188 ia.ai 77.81 
ia 206 4.84 8 x . 8 8  
.... ................................... ..................................... ................ .... ........... 
0.07 /:::** 
m: 0.014 I 
I 
I. MEDIAN: 0 . 0 2 )  1X X 1 7  4 . 8 4  87.10 
6 t z x  1 . 0 )  # * . s a  
0 . 0 8  a 2 x 6  1.21 9 0 . 7 a  
0 . 0 9  I..... 6 z a o  2 .0;  ~ z . 7 4  
I 
.. 18 2 4 8  7.20 100.00 I...........*o..... I 
-----t---t---+---t---t---t---t---~---~---t-- 
s IO IC 2 0  z c  a o  a s  40 46 s o  
FREQUENCY 
Figure 10 (c). Distribution of Miss Rate, C > 0.8 
W 
t 8 . 0 8  
n.ia 
ia.18 
S . 0 8  
1.1: 
1.1) 
0 . 0 0  
0 . 0 0  
l.8X 
8 . 0 0  
0 . 0 0  
P W W  
7X.81 
84.71 
8 4 . 8 X  
98.81 
88.81 
88. I1 
100.00 
100.08 
100.00 
1oo.om 
100.om 
m. 
pmcrn 
X 8 . 0 8  
87.X7 
81.46 
84.61 
s a . a a  
9 8 . 1 1  
88.18 
98.18 
100.00 
100.00 
101.00 
I 
I 
'I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
1 
I 
I 
I 
I 
I 
25 *. 
HI SSRATE 
8.00 
0.01 
0 . 0 2  
0 . 0 1  
0.04 
0 . 0 5  
0.01 
0.07 
0.08 
0.09 
0.10 
i ........................................................................................ 
I 
I.... ................ 
I I... ........... 
I.... 
MEAN: 0 . 0 1 9  
-IAN: 0 . 0 0 4  
MI SSRATE 
0 . 8 0  
0.01 
0.02 
0 . 0 1  
0.04 
0 . 0 5  
o.oa 
0.07 
0 . 0 8  
0.09 
0.10 
MISSRATE 
0.00 
0.01 
8 . 0 2  
0.01 
0.04 
0.01 
0 . 0 1  
0.07 
0.08 
0.09 
0.10 
-Eq 
.................................................... .............................................. t 1  1 2  
11 
19 
10 
I 
WIAN: 0.017 0 
8 
0 
0 
7 
.................................... .................................. .................... .... 
mM: 0 . 0 2 1  
.............. 
4 4  44 
IO 5 4  
7 SI 
2 01 
2 6K 
I 11 
2 S8 
1 0 8  
2 71 
2 7 1  
4 17 
5 9 . 1 4  
12.9) 
9.09 
2 .10  
2 .61  
1.11 
2 . a a  
1.10 
2.60  
1.10 
1.11 
20  25.t4 
4 1  2I.12 
17 1 7 . 4 1  
84 11.5B 
94 1.71 
98 1.94 
9 1  0 . 0 1  
9 1  0 . 0 0  
9 1  0.00 
91 0.00 
101 1 . 1 0  
$ 1 . 2 4  
47.17 
(5.05 
8 1 . 5 1  
e1.20 
9 8 . 2 0  
8 1 . t O  
9 1 . 2 0  
91.20 
1 1 . 2 0  
108.00 
2 4 1 1 10 1t 14 10 11 2 0  2 X  2 4  2 8  
PReQueKn 
Figure 11 (b). Distribution of Mise Rate, 6.0 < P c  €= 7.5 
57.14 
71.ia 
7n.22 
11-82 
(4.42 
1 5 . 7 1  
w . a i  
1 9 . 6 1  
BL.21 
94.81 
In8 . O O  
...................... 
MeAN: 0 . 0 1 1  
MPDIAN: 0.017 ...................... ........ 
.I.. ...... .............. 
t 4 8 8 i o  12 14 16 18  2 0  t t  (4  16 za a o  2 2  a 4  a 1  a i  4 0  4 2  
PR- 
Figure 11 (e). Distribution of Miss Rate, P > 7.5 
C 
4 1  
11 
104 
112 
144 
155 
lab 
170 
172 
I75 
1: 
21.91 
22.51 
11.54 
0.11 
12.09 
8.04 
8.04 
2.20 
1.10 
1 . 8 6  
3 . 8 6  
21.88 
4 5 . 1 0  
67.14 
17.01 
79.12 
8c.ia 
91.21 
91.41 
04.51 
91.15 
100.00 
26 
While probability of a cache miss increases with Workload Concurrency, and to a lesser 
extent, Mean Concurrency Level, a high value of Gw or P c  does not preclude a low Missrate 
value. Several observations exist at maximum Cw and low or zero Missrate. Periods of high 
Workload Concurrency and/or Mean Concurrency Level may generate low cache miss rates for 
several reasons. If the concurrent portion of the workload has well-behaved data and code local- 
ity, Missrate will obviously be low. Since each CE has an internal instruction cache (see Appen- 
dim C), loops and other program constructs which are local and "fit" in this cache will not gen- 
erate successive requests to the shared cache for instruction fetch. A high degree of register-to- 
register operations (which may include 32-element vector operations) will reduce data traffic 
between CE and cache, and consequently the average number of cache misses. Data dependency 
within concurrent loops may also reduce cache traffic. Processors are not required to access 
memory while waiting for synchronization, since this mechanism uses the physically separate 
Concurrency Control Bus [18]. Data and instruction locality acros8 processors also will lessen the 
overall impact on the cache of higher concurrency in the workload. Data which is fetched to the 
shared cache for one CE and is soon needed by one or more additional processors will not result 
in additional misses for these processors. 
In summary, the distributions in this section show a general increase in cache miss rate with 
an increased amount of parallel code, and little relation between Missrate and the number of pro- 
cessors active within concurrent operations. In the following section, regression models are 
developed to quantify relationships between system and concurrency measures. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
27 
5.2 Regression Models 
Regression models were developed to quantify the median behavior of system measures, 
with respect to concurrency measures. A median point is calculated with respect to C, by 
finding the median of the system measure for the set of points clustered around their closest 
Workload Concurrency midpoint (0.0, 0.1, ... 1.0). The resulting set of coordinate pairs is then 
used to  determine the model of the form system measure uetses C,. The same technique was 
used to calculate the median system measure points for Mean Concurrency Level (midpoints = 
2.0, 3.0 ... 8.0), to develop models of system measure verses Pc. 
Regression techniques were used to generate a fit of the median values described above and 
the corresponding concurrency measure midpoints. Second order linear models were determined 
to most accurately model the data. These models were of the form: 
System Measure = PI *C, + B2*Ct + C 
or 
System Measure = p1 *P, + P,*P,' + c (5.2) 
The regression model finds p,, p2 and C such that the equation 
is minimized, where (xi, yi) is a (concurrency measure, system measure) observation. One com- 
monly used measure of the accuracy of the model for predicting the data is given by R2, which 
indicates the amount of variability in the data predicted by the model [23]. The results of the 
regression modeling are shown in Tables 3 and 4. 
R' values are categorized in 1241 as: 0 = no relationship, 0.25 = moderately weak, 0.5 = moderate, 0.75 = moderately 
strong, 1.0 = perfect. 
28 
4 System Measure C R2 4 
I Median Miss Rate I -3.30~ lo* I 2.57 x I 2.62 x lo-' I 0.74 I " I  
Median CE Bus Busy 
Median Page Fault Rate 
2.18 x lo-' 1.01 x lo-' 2.47 x lo-" 0.89 
1.46 x lo4 -1.02 x lo4 1.07 x lo3 0.65 
Table 3. Regression Models verses C, 
System Measure 4 @2 C 
Regression Models 
System Measure vs. P- 
R2 
Median Miss Rate 
Median CE Bus Busy 
Median Page Fault Rate 
5.05 x -7.43 x lo4 1.86 x lo-' 0.07 
1.22 x 10-1 -7.79 x -1.82 x lo-' 0.66 
8.12 x lo3 -5.28 x lo2 -2.53 x lo4 0.61 
Table 4. Regression Models verses Pc 
The plot of Missrate verses Workload Concurrency model is shown in Figure 12. The model 
predicts that  an increase in C, from 0.5 to 1.0 will be accompanied by a greater than triple 
increase in Missrate, from .007 to .024. While the scatter-plots showed that Missrate values vary 
over a wide range for Workload Concurrency, the fact that  the median value is increasing shows 
that probability of higher values grows with C,. 
A similar analysis was performed to estimate a relationship between CE Bus Busy and the 
concurrency measures. The model for this measure verses C, and Pc are plotted in Figures 13 
and 14. The figures show that the activity on the CE busses is generally increasing with both 
Workload Concurrency and Mean Concurrency Level. Both concurrency measures establish the 
fraction of time that processors may be active; median bus activity then follows this fraction. As 
29 
0.03 
0.02 
0.01 
Misrrate 
- 
- 
- 
- 
- 
- 
- .- c I 
0 
0 
0 
0 
0.00 /
0.0 0.2 0.4 0.6 0.8 1 .o 
Figure 12. Plot of Regression Model, Missrate YS. Cwo 
expected, the model predicts almost linear increase in bus activity with Workload Concurrency 
(Le., the fraction of parallel code in the workload). With respect to Mean Concurrency Level, 
however, activity increases until Pc = 6.0, after which the Missrate levels off around 0.30. The 
results suggest that increased bus activity is more dependent on the percentage of parallel code in 
the workload (given by Cw) than the degree of concurrency within parallel operations. 
CE Bus Busy 
0.30 
0.20 
0.10 
0.00 
CE 
0 - 0 
- 
- 
- 
- 
- 
' r " ~ " ' ~ l " " l " ' ~ ~ ~ " " ~  " ' I " * ' " " ' I  L 
0.0 0.2 0.4 0.6 0.8 1.0 
cw 
Figure 13. Plot of Regression Model, CE Bus Busy vs. C,. 
Bus Busy 
Figure 14. Plot of Regression Model, CE Bus Busy vs. Pc.  
E 
I 
I 
1 
I 
R 
I 
I 
I 
I 
D 
I 
I 
I 
I 
I 
31 
5.3 Discussion of Results 
The analysis in this chapter presents a model for Missrate verses Workload Concurrency 
which predicts a sharp increase in Missrate, from 0.007 to 0.024, between Cw = 0.5 and Cw = 
1.0. Little correlation between Missrate and Pc is seen. This means that Missrate is much more 
sensitive to the fraction of parallel code in the workload than the number of processors active 
within parallel operations. The probable cause for this result is that  the kinds of functions which 
are suitable for parallel encoding, such as matrix and concurrent vector operations, are usually 
much more data intensive than general serial code. This can result in higher data traffic for 
parallel codes (see the regression models for CE Bus Busy in the previous section), and a greater 
number of cache misses. 
As mentioned in section 5.1, locality of data and code across processors lessens the impact 
of additional processors within a parallel operation on cache misses. This is an explanation for 
Concurrency Level’s lack of relationship with Missrate. Processors executing iterations of con- 
current loops will typically follow similar instruction execution paths, ensuring good code local- 
ity. In addition, data access patterns between loop iterations will usually be related, lowering 
any additive effects of growth in Pc. 
CE Bus activity shows a near-linear growth with increasing Workload Concurrency. With 
respect to  Mean Concurrency Level, Bus activity increases until reaching a maximum range a t  P 
= 6.0. Increase with Cw is explained as above; the inherent difference in concurrent and serial 
code results in greater traffic levels as Cw grows. Relatively constant bus activity after P c  = 6.0 
is likely a reflection of a higher degree of dependence-related waiting in periods of maximum con- 
currency (all processors active) in the workload; such waiting will reduce bus traffic. 
C 
The following highlights the key results arising from this work, and makes suggestions for 
future research. 
a2 
CHAPTER 6. 
CONCLUSIONS 
This study used measurements on an Alliant FX/8 multiprocessor at the University of DE- 
nois Center for Supercomputing Research and Development to evaluate concurrency found in a 
real workload on the machine. A systematic methodology was developed for characterizing the 
amount of concurrency present in the workload, and the effect of concurrency on system perfor- 
mance indices such as cache miss rate and bus activity. 
Two measures, Workload Concurrency and Mean Concurrency Level, were defined and then 
measured. Random sampling of the workload showed Workload Concurrency varies from 0 
(serial) to 1.0, and has an overall average of 35% for all measurement sessions. Idle, serial, and 
fully concurrent states dominate in the CE Cluster. Mean Concurrency Level, which measures 
the number of active processors during a concurrent operation, normally show a value close to 
maximum concurrency, or P = 8. 
C 
Analysis of transition periods between 8-concurrency and lower concurrency levels showed 
that processor usage was uneven during these times; for the measured data, periods of 2- 
concurrency dominate the transition periods. Possible reasons for this include a large percentage 
of concurrent loops with 2 "leftover" iterations, uneven distribution of dependency waiting times, 
and unbalanced sharing of resources during concurrent operation, where one or both of these pro- 
cessors experiences greater delays than the remaining CEs. 
System measures, including cache miss rate and CE bus activity were analyzed with respect 
to concurrency memures to observe what relationships exist. I t  was shown that in general, the 
higher the value of Cw or Pc, the higher the probability of increase in the system measure. For 
cache miss rate , neither concurrency measure established a lower bound on any of the system 
measure’s value. 
Second order linear regression models were developed to find the relationship between Cws 
Pc, and median system measure behavior. A model for cache miss rate showed a reasonable fit 
verses Ow, and predicted an increase in the median value of Missrate from .007 to 0.024, while 
Workload Concurrency increased from 50% to 100%. Missrate showed low correlation with Pc, 
yielding the result that  Missrate is more strongly related to the fraction of parallel code in the 
workload than the level of concurrency within parallel operations. CE bus activity was also 
modeled verses both concurrency measures, and was seen to  increase with both C, and Pc9 
although less strongly in the highest range of Concurrency Level. 
The methodology and results presented here are useful for multiprocessor evaluation and 
optimization. In particular, understanding of machine characteristics in the presence of a real 
workload is important, since the complexity of parallel systems makes prediction of performance 
difficult. The techniques used here can be applied to other parallel processing systems, and be 
extended to other levels of concurrency and new performance indices. 
Similar studies are suggested, in order to obtain a wide range of representative practicai 
results in concurrency evaluation. Future research in the measurement of concurrency should 
include evaluation of individual programs, to determine their behavior within the workload 
environment. Also, the relationship of concurrency and software-level parameters (such as those 
related to job scheduling) deserves attention. 
24 
APPENDIX A. 
NUMBER 
PROCESSORS 
8 
7 
6 
5 
4 
3 
2 
1 
0 
WORKLOAD SAMPLING DATA 
I 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
I 
I ** 
I 
I 
I 
I 
8 
** 
** 
. . . . . . . . . . . . . . . . . . . . . . . . . .  
************ 
"Em 
PROCESSORS 
I 
I************* 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I *  
I 
........................................ 
I 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
I 
--------+'------+-------+-------+---+---------+-------+-------*-------*-------*-------+ 
600 1200 1800 2400 3000 3600 4200 4800 5400 6000 
FREQUENCY 
Figure A.2. Number of Records with N Processors Active / Session 9. 
18867.00 
363.00 
62.00 
20.00 
203.00 
390.00 
392.00 
5111.00 
2490.00 
938.000 
0.000 
2.  000 
16.000 
6.000 
2.000 
101.000 
2919.000 
5078.000 
B 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
8 
1 
I 
I 
1 
I 
1 
E 
85 
Table A.1. Mean Concurrency Measures for Random Samples. 
HIDPOINT 
CE BUS BUSY 
0.00 
0 .os 
0.10 
0.15 
0.20 
0.25 
0.30 
0.35 
0.40 
0.45 
0.50 
FREQ CUI4 
I 
I 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  22 
I 
....................................... 19 
I 
I**************************** 14 
I 
I************** 7 
I 
....................... 11 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25 
**************** 
****** 
** 
*I******** 
n E Q  
25 
47 
66 
80 
87 
98 
106 
109 
109 
110 
115 
2 4 6 8 10 12 14 16 18 20 22 24 
FREQUENCY 
Figure A.3. Distribution of Samples by CE Bus Busy, 
PERCENT m. 
PERCENT 
21.74 
19.13 
16.52 
12.17 
6.09 
9.57 
6.96 
2.61 
0.00 
0.87 
4.35 
21.74 
40.87 
67.39 
69.57 
75.65 
85.22 
92.17 
94.78 
94.78 
95.65 
100.00 
18 
MIDPOINT 
MISSRATE 
0.00 
0 . 0 1  
0 .02  
0.03 
0 .04  
0.05 
0.06 
0.07 
0 . 0 8  
0 . 0 0  
0.10 
I 
................................................................................... 
I 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  
I 
I************ 
I 
I **** 
I 
I ** 
I 
I *I** 
I 
I 
I 
I 
I 
I 
1 
1 
I 
I 
I 
----+---+---+---+---*---+---*---*---*---*---+---*---*---*---*---*---*---*---+---*-- 
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32  34 36 38 40 
FREQUENCY 
Figure A.4. Distribution of Samples by Miss Rate. 
4 1  
13 
6 
2 
1 
2 
0 
0 
0 
0 
0 
MIDPOINT 
PACE FAULT RATE FREQ Cvn. 
0 
1000 
2000 
3000 
4000 
5000 
6000 
7000 
8000 
0000 
10000 
11000 
12000 
13000 
14000 
15000 
16000 
17000 
18000 
10000 
20000 
21000 
22000 
23000 
24000 
I 
1 ****** 3 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17 
I************+**t*f***$.**************L********** 24 
I**************** 8 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15 
I****************** 0 
******** 
************ 
************** 
**** 
**** 
****** 
** 
****** 
** 
**** 
****** 
**************** 
----+---+---*---+---+---+---*---*---*---+---*---*---* 
2 4 6 8 10 12 14 16 18 20 22 24 
FREQUENCY 
4 
6 
7 
2 
2 
3 
0 
0 
0 
0 
0 
1 
0 
0 
3 
1 
2 
3 
8 
FREQ 
3 
20 
44 
52 
67 
76 
80 
86 
03 
05 
97 
100 
100 
100 
100 
100 
100 
101 
101 
101 
104 
105 
107 
110 
118 
Figure A.5. Distribution of Samples by Page Fault Rate. 
Cvn. 
FREQ 
41 
54 
60 
62 
63 
65 
65 
65 
65 
65 
65 
PERCENT 
63.08 
20.00 
9.23 
3.08 
1.54 
3 .08  
0.00 
0 .00  
0.00 
0 . 0 0  
0 . 0 0  
CUM. 
PERCENT 
63.08 
83 .08  
02.31 
95.38 
06.02 
100 .oo 
100.00 
100.00 
100.00 
100 .oo 
100.00 
PERCENT Cvn. 
PERCENT 
2.54 
14.41 
20.34 
6.78 
12.71 
7.63 
3.30 
5.08 
5.93 
1.69 
1.69 
2.54 
0.00 
0.00 
0.00 
0.00 
0 . 0 0  
0 . 8 5  
0.00 
0.00 
2.54 
0.85 
1.60 
2.54 
6.78 
2.54 
16.95 
37.20 
44.07 
56.78 
64.41 
67.80 
72.88 
78.81 
80.51 
82.20 
84.75 
84.75 
84.75 
84.75 
84.75 
84.75 
85.59 
85.59 
85.50 
88.14 
88.08 
90.6B 
03.22 
100.00 
87 
APPENDIX B. 
CONCURRENCY VS. SYSTEM MEASURE DATA 
1.1 
1.0 
0.9 
0 .1  
0.Y 
0 . 1  
0.6 
0 . 4  
0.; 
0.) 
0 . 1  
0.0 
A 
A 
A 
A 
A A  
A 
D 
B 
A 
A 
A 
A 
A 
A 
A C 
C 
C 
A A A I U  A A A ?  
A A A  A A A C A S  
A A A A A M  
A I U A  A 
A A M  B I  
B 
A 
A A  
A A A  
A A A A M A 
A A A  A A  A B A B A  
B I 
A A A  A A 
+ A A C B 
ID 
IA 
I 
A A A  A A  A A  
i 
M A C  A A F  
A& A D  
B A A  B 
A A A  D 
M ? 
A A  A Y 
A A B 
A A D  
B K 
A A C Y  
A A A  
C Z K B B B Y  
1.1 + 
I  
1.0 
0.9 
0 . 8  
0.7  
0.0 
0.6 
0 .4  
0 . 1  
I . 2  
0 . 1  
0 . 0  
+ 
I 
! 
i 
I 
+ 
A 
A 
A 
A 
A 
1 
I 
i 
T 
i 
i 
i 
D 
A 
A 
A 
A 
A A 
A A A  
A A A A 
A 
C 
ABA A A B A  
A A A  AM A A  0 CFC 
A A  A M A A A B C  K L B  
A A A  B A D  M IC 
A A A  A M B B A B A  B 
A C  ABAB- A 
A A  A i I 
A A M A B A  A B 
A A A A B B 
c4 A A A A  M 
IC A A C A A  D 
A BA M 
A M A A B A  E 
?I 
A A A A E A A  A B A N l  
+ A A A B A  A A  
A A  A 
A A M A  A A B A  
C 
A + 
-4-------4-----4-------4---------- 
Figure B.2. CE Bus Busy VI. Mean Concurrency Level. 
4 . 1  1.) 1.6 8 . 8  6.6 7.0 7 . 1  8 . 0  ¶.I  1.0 1.6 4.0 2.0 
r e  
A 
A 
38 
A 
A 
B 
c8 Bus 
0 . 1  
0 . 1  
0 . 2  
0 . 1  
0 - 4  
0 . 5  
0 . 0  
0 . 7  
0 . 1  
0 . 0  
1 . 0  
0 . 0  
0 .1  
@ . I  
0.1 
0 . 4  
0 .6  
0 . b  
0 . 7  
0 . 8  
0 . t  
1 . 0  
BUSY 
. D  
MAN: 0.OSb 
-IAN: 0 .0046  
11 6 1  
2 4  6 6  
8 se 
1 e 8  
0 60  
0 6 #  
0 6 9  
0 5 8  
0 68  
0 6 9  
0 
6 2 . 6 4  
4 8 . b 8  
8 . 0 0  
1 . 0  
0 .00  
0 .00  
0 . 0 0  
0 . 0 0  
0 .oo 
0 . 0 0  
0 . 0 0  
ix.e4 
e a . i x  
- 8 . 1 1  
IOO.00 
100 .eo 
100 . o o  
1oo.om 
1 0 0 . 0 0  
100 .oo 
1 0 0 . 0 1  
1 0 0 . 0 0  
. . . . . . . . . . . , . 
S 4 0 8 10 1 2  1 4  1 b  18  20 2 2  2 4  2 b  20  10  
PREWm=Y 
Figure B.3 (a). Distribution of CE Bus Busy, C, <= 0.4 
1 1  
.............. O ......... o..... 
HEAN: 0 . 2 l b  
MEDIAN: 0 . 1 1 6  
C8 BUS BUSY 
0 . 0  
0 . 1  
0 . t  
0 . 1  
0 . 4  
0 . 6  
0 . 8  
0 .? 
0 . 8  
0 . 0  
1 . 0  
I... ....................... 
I . . . . . . . .  ..,...... 
! 
6 
4 4  
11 
UEAN: 0 . 2 8 8  
MmIAN: 0 . 1 0 6  
b 2 . 4 X  2 . 4 2  
S O  1 7 . 7 4  2 0 . 1 b  
8 1  1 8 . 1 1  1 1 . 4 7  
1 8 4  4 0 . 7 a  7 4 . 1 9  
2 x 7  1 7 . 1 4  9l.Kl 
21X 2 . 0 2  9 1 . 6 6  
n o  2 . 8 ~  o a . 1 7  
2 4 1  0 . 8 1  8 7 . 1 8  
2 4 6  1 . 8 1  t 8 . 7 8  
24b 1 . X 1  1 0 0 . 0 0  
2 4 8  0 . 0 0  1 0 0 . 0 0  
Figure B.3 (c).  Distribution of CE Bus Busy, C, > 0.8 
1 . 8 X  
aa.7a 
2 2 . b 4  
a 1 . 0 1  
6 * 4 6  
6 . 4 6  
0.00  
0 . 0 0  
0 . 0 0  
0 . 0 0  
0 . 0 0  
- 
ax. 
PWCENF 
1 .02  
84.86 
6 8 . 1 8  
1 8 . 0 8  
8 4 - 6 6  
100 00 
l o o o o o  
1 0 0 . 0 0  
100 .00  
108 .00  
108 .00  
C E  BUS BUSY 
40 
C E  BUS BUSY 
I 
0 . 0  
0 . 1  
0 . 2  
0 . 1  
0 . 4  
0 . 6  
0 . 6  
0 . 7  
0 . 8  
0.9 
1 . 0  
........................................................................ ................ 8 1 7  1 0 . 1 8  4 8 . 0 6  ............................................ 2 2  1 V  2 8 . b 7  7 8 . 8 2  ...................... 11 7 0  14.28 8 0 . 6 1  ............ 8 7 6  7 . 7 8  9 8 . 7 0  
e.  1 7 7  1 . 1 0  1 0 0 . 0 0  
MEAN: 0.144 
-IAN: 0 . 1 6 7  0 7 7  0 . 0 0  1 0 0 . 0 0  
e 7 7  0 . 0 0  100 .00  
0 7 7  0 . 0 0  1 0 0 . 0 0  
0 7 7  0 . 0 0  1 *0 .00  
0 7 7  9 . 0 0  1 0 0 . 0 0  
---C--C--C--t---C--C--+---C--+---C--t--t--t--t--+--~--t- 
8 4 6 8 10 1f  14 16  18 fO 1 2  2 4  26  28 
AtEQuENcr 
Figure B.4 (a). Distribution of CE Bus Busy, Pc <= 6.0 
C E  Bus BUSY F R t q  cw. PEtcEm 
0 . 0  
0 . 1  
0 . 2  
0 . 1  
0 . 4  
0 . 6  
0 . 6  
0 . 7  
0 . 8  
0 . s  
1 . 0  
1. ............... 
I 
1........... ......................................... 
I 
I ............................ 
I 
1.................. .............................. ............................................ I 
MEAN: 0.200 
-IAN: 0.282 
8 
26 
14 
24 
2 f  
2 
1 
0 
4 
0 
0 
mEq 
8 
1 4  
48 
7 2  
1 4  
8 8  
8 8  
1 8  
1 0 1  
1 0 1  
1 0 1  
7 . 7 7  
26.24 
1 1 . 6 9  
2 1 . 1 0  
2 1 . 1 6  
1 . 8 4  
2 . 9 1  
0 . 0 0  
1 . 8 8  
0 . 0 0  
0 . 0 9  
w. 
P E R C r n  
7 . 7 7  
1 1 . 0 1  
4 0 . 6 0  
0 9 . 8 0  
8 1 . 2 8  
9 1 . 2 0  
66,lZ 
1 8 . 1 2  
1 0 0 . 0 0  
1 0 0 . 0 0  
1 0 0 . 0 0  
Figure B.4 (b). Distribution of CE Bus Busy, 6.0 < Pc <= 7.5 
FREq w. PmCEplF 
mtq 
0 . 0  
0 . 1  
0 . 2  
0 . 1  
0 . 4  
0 . s  
0 . 0  
0 . 7  
0 . 1  
0 . 9  
1 . 0  
0 . 6 1  
2 8 .  6 7  
7 . 1 4  
4 0 . 1 6  
* . a *  
2 . 7 6  
2 . 2 0  
1 . 1 0  
0 . 0 0  
1 . 0 6  
0 . 0 0  
cw. 
P- 
0 . 6 6  
18.12 
1 6 . 1 0  
82.42 
8z.n 
8 5 . 0 6  
8 7 . 2 6  
9 8 . 1 1  
9 1 . 1 6  
1 0 0 . 0 0  
1 0 0 . 0 0  
cz PKZ 
PALLT 
scooo 
soooo 
45000 
40000 
15000 
10000 
1cooo 
20000 
lS000 
10000 
5000 
0 
41 
A 
A 
A 
A A 
A A 
A A 
A A 
A A  
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
B 
A 
A 
A A  A 
A 
42 
i i o o o  
i o 0 0 0  
41000 
40000 
as000 
aoooo 
25000 
aoooo 
15000 
10000 
SO00 
0 
LEXZN): A i l  W. B - 1  Cas, mC. 
a PE E  
P U T  I + 
+ 
! + 
+ 
i 
! + 
+ 
I 
I 
I + 
I 
I 
I + 
I + 
A 
A 
A 
A 
A 
A 
A 
A A  
A A A  A 
A B 
A A  
Figure B.6. Page Fault Rate VB. Mean Concurrency Level. 
1 
I 
I 
1 
1 
I 
8 
1 
I 
I 
I 
I 
I 
1 
0 
6000  
10000 
icooo 
2 0 0 0 0  
2 6 0 0 0  
1 0 0 0 0  
1 ~ 0 0 0  
40000  
48 
CE PAGE FAULT Fit- cat. PERceplp CUI. 
mEQ Pmcm 
IO a o  61.72 
10 4 8  a 2 . 7 8  
I........... I.........~..............~~~~~,..,.~~~.,,,..,,,,~ 
............................... I.... 
7 68  12.07 
0 68 8 . 0 8  
2 6 8  S . 4 6  
=IAN: 228t 8 at 0 . 0 0  
8 6 8  8 . 0 0  
0 6t 0 . 0 8  
8 68 0 . 0 8  
I..*........*. 
..e. 
W: 4012  
c1.72 
84.48 
8 e . 6 6  
88.66 
1 0 0 . @ 8  
1 0 0 . 0 0  
180 .00  
1 8 0 . 0 0  
1 0 8 . 0 0  
0 
6000  
1 0 0 0 0  
1 1 0 0 0  
2 O O O O  
2 c o o o  
1 0 0 0 0  
a c o o o  
4 0 0 0 0  
7 
1 8  
2 
1 
1 
a 
8 
0 
2 
10.13 
4 7 . 0 8  
6 . 8 8  
2 . * 4  
t.82 
8.t2 
0 . 8 8  
0 . 0 0  
6.t8 
Figure B.7 (b). Distribution of Page Fault Rate, 0.4 < C, <= 0.8 
CB PACE FAULT FREQ CUM. PW- 
mEQ 
0 
6 0 0 0  
l o o 0 0  
15000 
20000 
f r o 0 0  
a o o o o  
a s o o o  
4 0 0 0 0  
58.68 
@1.86 
7a.11 
7 a * 4 7  
t5.15 
84 .12  
8 4 . 1 2  
8 4 . 1 2  
1 0 0 . 0  
CUI. 
Pmcm 
2 1 . 8 2  
r1.tc 
71.01 
7 1 . 0 8  
7 8 . 8 2  
t 4 . 1 2  
8 8 . 4 8  
02.11 
1 0 0 . 0 0  
44 
CE PAGE FAULT P R m  CUM. 
0 
a o o o  
1 0 0 0 0  
1 6 0 0 0  
2 0 O O O  
2 6 0 0 0  
1 0 0 0 0  
I 6 0 0 0  
40000  
0 
6 0 0 0  
1 0 0 0 0  
1 6 0 0 0  
1 0 0 0 0  
16080 
1 0 0 0 0  
1 6 0 0 0  
40000  
i 
CE PACE FAULT 
i .... ...... ..... .... .. .......... ............... ..... 
I I e... e.... 0 0 1 . .  ..... *.. 
1 . .  
I 
I 
MEAN: XI86 
=IAN: l B 4 4  
FREQ 
x7 x 7  
1 1  J8 
1 1 B  
8 I8 
o a )  
o as  
0 18 
0 18  
0 I 8  
P!EtCrn (xh(. 
P w C m  
08.Xl o . z a  
X 8 . 2 1  8 T . 4 4  
X . 6 8  1 0 0 . 0 1  
0.08 1 0 0 . 0 0  
0 , O O  1 0 0 . 0 0  
0 . 0 0  1 0 0 . 0 0  
0 . 0 0  1 0 0 . 0 0  
0 . 0 0  1 0 0 . 0 0  
0 . 0 0  1 0 0 . 0 8  
Figure B.8 (a). Distribution of Page Fault Rate, PE <= 6.0 
8 8 1 6 . 2 2  16.1) ............... ...~.*... I I . * . . . . . . . ~ ~ . ~ ~ ~ ~ ~ * ~ . * ~ . . . . . . . ~ . ~ ~ . ~ . , ~ . * ~ ~ * ~ . . ~ . . ~ . . . ~ . . . . ~ . . .  1' 21 4 0 * g 4  
8 1 7  18.21 7 X . 8 7  
1 I8 2 . 7 0  75.88 
2 I O  6 . 4 1  81.08 
.............................. 
*..I. 
.I........ 
MEAN: 1 1 1 8 4  
-IAN: 8817 4 a 4  10.81 8 1 . 0  .................... 
1 1 6  X . 7 0  8 4 . 6 8  
0 I 6  0 . 0 0  8 4 . 6 )  
e.... 
x a7  6 . 4 1  1 o o . 0 0  . . . . . . . . I *  
CE PACE FAULT PREQ CUM. 
0 
6 0 0 0  
1 0 0 0 0  
1 1 0 0 0  
f O O 0 0  
X6000 
a 0 0 0 0  
¶ L O O 0  
4aooo  
PmcEr?r amI. 
18.18 
18. i o  
1b.b7 
8.08 
8.5: 
¶ . l a  
8.00 
X . l B  
4 . 7 8  
P W C W  
2 8 . 1 8  
84.20 
80.v1 
88.86 
B8.48  
B X . 8 8  
V X . 8 0  
b5.14 
100 .00  
.. . 
8000 
6000 
I 
1 
II 
I 
I 
- 
0 
- 
0 
1 
- 
0 
Page 
- 
46 
- 
k 
oooi 
Fault 
0 
Page 
Figure B.9. Plot of Regression Model, Page Fault Rate YS. C,. 
Fault Rate 
~ ~ ~ ~ ~ " ~ ~ " ~ ~ " ~ ' ~ . ~ ' . ~ ' ~ ~ ~ ' . . . ~  I , . . . I I . , . l L  
pc  
4 5 6 7 8 
Figure B.10. Plot of Regression Model, Page Fault Rate va. Pc. 
46 
APPENDIX C. 
ALLIANT FX/8 DESCRIPTION 
The Alliant FX/8 supports two types of processors, an Interactive Processor (IP), and a 
Computational Element (CE). The machine may be configured with as little as 1 IP and 1 CE 
(FX/l) ,  or up to 12 IPS and 8 CEs (F/X 8). The IP is based on a Motorola 68012 microprocessor. 
512 Kbytes of local memory is available to the IP; i t  is also directly connected to a 32 Kbyte IP 
cache (IPC), which is in turn connected to the system memory bus. IPS handle all system 1/0 
through a MultibusTM system. The IPS' function is support of interactive load, support of the 
operating system, and control of I/O. 
The CE has a base instruction set similar to the Motorola 68020 microprocessor. Addition- 
ally, the CE supports vector processing, floating point operation, and concurrent execution in the 
Computational Cluster, where loop-level multiprocessing takes place. Vector operations may 
occur simultaneously with multiprocessing on the Cluster. Each CE contains a 16 Kbyte instruc- 
tion cache for efficient handling of loops and other localized portions of code. The CEs share a 
four-way interleaved cache memory, with a total size of 128 Kbytes, divided into two Computa- 
tional Element Caches (CPCs). Connection to these cache modules is accomplished through a 
crossbar switch which routes both address and data between cache and CE. 
All data traffic between processors (CE or IP) and shared memory takes place through the 
processors' respective caches. The caches maintain data coherency by requiring that a cache pos- 
sess a "unique" copy of data before modifying it. Traffic between caches and main memory is over 
two 64-bit wide data busses, with a total maximum bandwidth of 188 Mbytes per second. The 
main memory has an interleaving factor of four, and has a maximum size of 64 Mbytes. The 
I 
I 
I 
I 
I 
1 
I 
I 
1 
I 
I 
I 
I 
1 
1 
I 
1 
1 
I 
47 
system’s virtual address spaces are organized as 1024 segments of 1024 pages per segment; pages 
are 4 Kbytes in length 1181 [25]. 
The operating system on the Alliant FX/8 is called Concentrix, and is an implementation of 
4.2 BSD UNIX, with extensions including support of multiple processors. Languages supported 
include C, F/X FORTRAN, and assembler. FORTRAN is the only high-level language which 
generates code using the Cluster concurrency feature; this function can also be accessed by the 
assembly language programmer. Programs may be specified to run on either the CE or the IP, 
(the latter only if floating point or vector processing is not required), or on the Cluster with a 
particular number of processors [21]. 
48 
REFERENCES 
P. Heidelberger and K. Trivedi, Queueing Network Models for Parallel Processing 
with Asynchronous Tasks. IEEE Trans. on Comp., v. C-31, 1982, pp. 1099- 
1108. 
, Analytic Queueing Models for Programs With Internal Concurrency. 
IEEE Trans. on Comp., v. C-32, January, 1983, pp. 73-82. 
U. Herzog, W. Hoffman and W. Kleiniider, Performance Modeling and Evaluation for  
Hierarchically Organized Multiprocessor Computer Systems. Proc. of the 1979 
Int. Conference on Parallel Processing, pp. 103-114. 
D. Kuck et. al., The Effects of Program Restructuring, Algorithm Change, and Archi- 
tecture Choice on  Program Performance. Proc. of the 1984 Int. Conf. on 
Parallel Processing, pp. 129-135. ’ 
K. Gallivan, W. Jalby, and U.Meier, The use of BLASS in Linear Algebra o n  a Paral- 
lel Processor with a Hierarchical Memory. CSRD Report No. 610, University 
of Illinois at Urbana-Champaign, October 14, 1986. 
M. Berry and R. Plemmons, Algorithms and Experiments for Structural Mechanics o n  
High Performance Architectures. CSRD Report No. 602, University of Illi- 
nois at Urbana-Champaign, September 1986. 
K. Hwang and F.. Briggs, Computer Architecture and Parallel Processing. McGraw- 
Hill, New York 1984. 
C. Polychronopoulos, D. Kuck, and D. Padua, Ezecution of Parallel Loops o n  Parallel 
Processor Systems. Proc. of the 1986 Int. Conf. on Parallel Processing, pp. 
5 19-527. 
R. Cytron, Compile T ime  Scheduling and Optimization for Asynchronous Machines. 
Ph.D. Thesis, University of Illinois at Urbana-Champaign, UIUC Report 
No. UIUCDCS-R-84-1177, October, 1984. 
S. Midkiff and D. Padua, Compiler Generated Synchronizations for DO Loops. 
Proc. of the 1986 Int. Conf. on Parallel Processing, pp. 544-551. 
D. Kuck, The Structure of Computers and Computations. John Wiley and Sons, New 
York, 1978. 
I 
1 
1 
1 
I 
I 
I 
8 
I 
1 
I 
1 
I 
1 
I 
I 
I 
I 
49 
W. Abu-Sufah and A. Malony, Vector Processing o n  the Alliant F X / 8  Multiproces- 
sor. Proc. of the 1986 Int. Conf. on Parallel Processing, pp. 559-566. 
A. Jones and P. Schwarz, Ezperience Using Multiprocessor Systems - A Status 
Report., Computing Surveys, Y. 12, no. 2, June 1980, pp. 121-165. 
R. Vaughan and M. Anastas, Limiting Multiprocessor Performance Analysis. Proc. 
of the 1979 Int. Conf. on Parallel Processing, pp. 55-64. 
W. Jalby and Ulrike Meier, Optimizing Matrix Operations on  a Parallel Multiproces- 
sor with a Hierarchical Memory System., Proc. of the 1986 Int. Conf. on 
Parallel Processing, pp. 429-432. 
U. Hercksen, R. Klar, W. Kleiniider, and F. Kneissl, Measuring Simultaneous 
Events in a Multiprocessor System. Proc. of the 1982 ACM SIGMETRICS 
Conf. on Measurement and Modeling of Computer Systems, pp. 77-82. 
H. Fromm et al., Ezperiments with Performance Measurement and Modeling of a 
Processor Array. lEEE Trans. on Comp., v. C-32, no. 1, January, 1983, pp. 
15-31. 
Alliant Computer Systems Corp., FX/Series Product Summary. June, 1985. 
P. Tang, P. Yew and C. Zhu, Processor Self-Scheduling in Large Multiprocessor Sys- 
terns. CSRD Report No. 536, University of Illinois at Urbana- 
Champaign, October, 1985. 
Tektronix, DAS 91 00 Series Operator’s Manual. February 1985. 
Alliant Computer Systems Corp., Concentriz Commands and Applications Manual. 
May 1985. 
SAS Institute, SAS User’s Guide: Basics 1982 Edition. 
W. Mendenhall and Terry Sincich, Statistics for the Engineering and Computer Sci- 
ences. Dellen Publishing, San Francisco, 1984. 
M. Younger, A Handbook for Linear Regression. Wadsorth, Inc., 1979. 
Alliant Computer Systems Corp., FX/Series Architecture Manual. May 1985. 
