PAWS: A performance evaluation tool for parallel computing systems by Pease, Daniel et al.
PAWS: A Performance 
Evaluation Tool for 
Parallel Computing Systems 
Daniel Pease, Arif Ghafoor, Ishfaq Ahmad, 
David L. Andrews, Kamal Foudil-Bey, Thomas E. Karpinski, 
Mohammad A. Mikki, and Mohamed Zerrouki 
Syracuse University 
ifteen years ago, most large-scale 
scientific and engineering computa- 
tions were performed on sequential 
von Neumann machines. Comparisons 
among these machines focused on running 
sets of common benchmarks and on rank- 
PAWS (Parallel 
F 
Assessment Window 
System) provides an 
interactive user- 
ing the machines based on the number of 
instructions executed per second. 
systems increased, so did the diversity of 
their architectural design. As each new ar- 
However, as the number of commercial 
environment 
chitecture diverged from the classical von 
Neumann model, new languages and anno- 
for analysis of existing, 
tated versions of older sequential languag- prototype, and 
conceptual machine es were developed for execution on these new machines. This made i t  difficult to run 
a standard benchmark. Not only did each 
benchmark require translation into each 
language, but the translation process and 
newer optimizing compilers obscured the 
relative merit of the results. 
T o  date, no formal methods allow com- 
parisons among different machines run- 
ning a single common application. Further- 
more, code is generally not portable among 
different parallel processing machines. This 
forces applications to be recoded for each 
language and each machine. 
PAWS (Parallel Assessment Window 
System) is an experimental system for per- 
architectures running a 
common application. 
forming machine evaluation and compari- 
sons. As shown in Figure 1, PAWS pro- 
vides a user-friendly X window-based en- 
vironment that allows comparisons among 
vastly different machines running com- 
mon applications. 
Figure 2 shows the PAWS block dia- 
gram, which consists of four tools: the 
application tool, the architectural charac- 
terization tool, the performance assessment 
tool, and the interactive graphical display 
tool. Through the application characteriza- 
tion tool, PAWS translates applications 
written in a high-level source language 
into a single data dependence graph. This 
allows users to view their applications’ 
attributes. The dataflow graph is a machine- 
independent intermediate representation 
that can map onto different architectures. 
The architecture characterization tool 
allows users to create, store, and retrieve 
descriptions of machines in a database. 
This approach permits users to evaluate 
conceptual machines before building any 
hardware. 
The performance assessment tool (PAT) 
generates profile plots through the interac- 
tive graphical display tool (IGDT). It shows 
both the ideal parallelism inherent in the 
machine-independent dataflow graph and 
the predicted parallelism of the partitioned 
dataflow graph on the target machine. 
Using dataflow graphical languages on 
dataflow machines is not new,’.’.’but using 
them for parallel computer assessment 
through PAWS is original. A powerful 
PAWS feature is its ability to associate a 
graph’s visual display during assessment. 
18 001 8-9 l62/Y 1/0100-0018$0 I .OO 0 199 I IEEE COMPUTER 
High-level language Data dependence graph 
For i=l ,N 
end For 
A[ i]=B+A[ i] ; 
\ Architecture characterization 
U 
10 20 30 40 50 60 \ 
Parallelism profile 
100 I I 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
10 20 30 40 
Figure 1. Parallel Assessment Window System (PAWS) environment. 
Applications 
in high-level 
languages 
Application characterization tool 
I 
Parser  
I Parser C, 
I I 
Characterize 
Architecture 
selection 
Architecture characterization tool 
Graphical 
representa- U tion Performance 
assessment tool 
Interactive 
graphical 
display tool 
Display 
Figure 2. PAWS block diagram. 
Through the windows environment, multi- 
ple windows can be opened to show the 
data dependence graph created from the 
original program, the evaluation machine’s 
characteristics, and the application’s per- 
formance metrics machine. 
The application 
characterization 
and degree of an application’s parallelism. 
Application characterization consists of a 
data dependency analysis to determine the 
order of program statements execution. It 
also identifies operations that may be exe- 
cuted in parallel. Parallelism within a pro- 
The application characterization tool 
provides the facility to evaluate the level 
January 199 1 19 
Procedure Example 
A,B,C,D : integer; 
begin 
end Example; 
D:= (A+B) * (C-1); 
I I 
Figure 3. Example of a simple Ada program and IF1 representation, (a) and (b). 
gram may exist at different “granularity” 
levels. For example, arithmetic operations 
within a program may be independent, us- 
ing different data values or variables in 
each expression. This parallelism level is 
called fine-grained parallelism. Converse- 
ly, if multiple functions, subroutines, or 
procedures are independent, then they can 
be executed concurrently. This parallelism 
level is called course-grained parallelism. 
The application characterization tool trans- 
forms programs written in high-level lan- 
guages (currently Ada) into an intermedi- 
a t e  graphical  fo rm that  exposes  the 
program’s data dependencies and parallel- 
ism levels. Analysis is first performed on 
the intermediate form, providing insight 
into the program’s structure and organiza- 
tion. The information produced during this 
analysis is then graphically displayed to the 
user for characterizing the application and 
mapping the application onto a machine. 
The intermediate form defined to repre- 
sent the program’s data dependencies and 
parallelism must support parallelism at all 
granularity levels. Also, because the inter- 
mediate form is the target for mapping the 
application onto any machine, it should not 
be biased by a specific machine’s limita- 
tions. Translation of applications written in 
a high-level language to an intermediate 
form with these attributes offers the fol- 
lowing PAWS advantages: 
Provides acommon target for any high- 
level language; 
Allows a single application to be ana- 
lyzed on each machine characterized by 
the PAWS architecture characterization 
tool; 
Allows users to perform a machine- 
independent application analysis. 
The intermediate formchosen for PAWS 
is Intermediate Form 1 (IFI),4 a dataflow 
graphical language. 
Dataflow graphs as intermediate 
forms. Dataflow graphs present the pro- 
gram’s data dependencies and parallelism 
as graphs, clearly showing the program 
operations to the user. Figure 3 shows a 
simple Ada program and its equivalent IF1 
PAWS-generated graph. The assignment 
to the variable D in Figure 3a is expressed 
in IF1 by labeling the output edge of the 
“times” node D. Further references to the 
variable D in Ada are translated in IF1 by 
connecting the edge labeled D to the corre- 
sponding node that uses D as an input. 
Figure 3 shows that Ada operations (A+B) 
and (C-I) can execute in  parallel. The data 
dependency of the times operation on both 
the plus and minus operations is also ex- 
posed. 
Description of IFl. IF1 is an acyclic 
graphical language developed as a target 
for SISAL (Streams and Iteration in a Sin- 
gle Assignment Language), a high-level 
functional programming language. IF1 hi- 
erarchical structure is well suited for 
graphical display. The IF1 language con- 
sists of nodes, edges, types, and graph 
boundaries. IF1 supports simple nodes and 
compound nodes. Simple nodes represent 
elementary operations such as addition, 
subtraction, and equality, while compound 
nodes represent complex constructs such 
as conditionals, loops, and the parallel 
construct Forall. As shown in Figure 3a, 
edges represents data values. The literal 
edges describe constants. Graph bound- 
aries define functions, procedures, and 
compound nodes by encapsulating sets of 
nodes and edges. 
Translation of Ada to IF1 in PAWS. 
PAWS translates Ada’s regular grammar 
expressions into IF1 by decomposing them 
into IF1 operation nodes. Operands, data 
values, and intermediate results are trans- 
formed into IF1 edges. Table 1 shows fun- 
damental Ada constructs mapping to their 
corresponding IF1 constructs .  Whi l e  
translation of most constructs from Ada to 
IF1 is straightforward, issues arise in the 
translation due to fundamental language 
differences. Three main issues are compile 
time initializations, handling of global 
variables, and the single assignment rule 
imposed by dataflow computation. The 
PAWS techniques that handle these three 
issues are briefly described below. 
While single variable initialization can 
be accomplished by substituting the desired 
initialization value into the first occur- 
rence of the variable in the IF1 graph dur- 
ing runtime, compound data objects de- 
clared at compile time in Ada (arrays, 
records, etc.) cannot always be initialized 
using this approach. Instead, compound- 
data types are initialized with user-supplied 
input data during program execution. This 
eliminates the execution overhead caused 
by the compound data object’s initialization 
during runtime. 
Dataflow programs do not employ glo- 
bal memory. Instead, data values reside 
“locally” on edges between nodes. In 
PAWS, global variables in Ada programs 
Table 1. Translation of some Ada contructs to IF1 constructs. 
Arithmetic operations 
Logical operations 
Array operations 
If-then-else statement 
Case statement 
For loop statement 
While loop statement 
Functions & procedures 
~~ 
Arithmetic nodes 
Logical nodes 
AElement 
Node 
AReplace node 
ABuild node 
AFill node 
Select compound node 
Select compound node 
LoopA & forall nodes 
LoopB & forall nodes 
Subgraphs 
20 COMPUTER 
are explicitly passed as parameters to each 
function using the global variables. This 
has the added benefit of transforming im- 
plied dependencies to explicit data depen- 
dencies. 
Dataflow languages adhere to the single 
assignment rule, allowing variables to be 
written only once. PAWS solves this prob- 
lem for Ada by introducing temporary 
variables. Once a variable is assigned a 
value, the value remains fixed for the re- 
mainder of the program. If reassignment is 
required, a temporary variable is created. 
After this reassignment, any read referenc- 
es to the original value is replaced with a 
reference to the temporary variable. 
- 
- 
Machine 
name 
- 
i 
The architecture 
characterization tool 
Input/output - 
timing module 
Computation 
timing module 
Timing Timingstruct 
data format - 
Data movement 
timing module 
Waitingkynch. 
timing module 
(control) - 
Traditionally, machines have been clas- 
sified as multiple instruction, multiple data 
(MIMD), single instruction, multiple data 
(SIMD), multiple instruction, single data 
(MISD), and single instruction, single 
data (SISD). These classifications differ- 
entiate among architectures based on in- 
struction flow and dataflow. Also, at these 
classification levels, significant architec- 
tural diversity exists within each class. For 
example, within the MIMD class, both 
tightly coupled and loosely coupled sys- 
tems exist. Although both are classified as 
MIMD, significant differences appear be- 
tween the two system types. This affects 
how a program is written or partitioned 
+ cation -- 
Data 
sizes 
modes - Communi- 
cation --+ --+ 
distances - 
Actual 
timing 
Figure 4. The top level of the parametric data structure. 
onto the machine. The PAWS architecture 
characterization tool differentiates be- 
tween machines within each of the classes 
defined above  by a characterization 
based on 
the number and flexibility of different 
the number of processors; 
memory bandwidths and memory hi- 
the types of interprocessor communi- 
functional units; 
erarchies; and 
cation mechanisms. 
Our characterization method functionally 
partitions an architecture into computa- 
tion, data movement and communication, 
I/O, and control. 
An hierarchical organization of ar- 
chitectural parameters. Each category is 
continuously partitioned into subsystems 
until a subsystem is fine enough to be 
characterized by raw timing information. 
PAWS organizes this information in a tree 
data structure with the raw timing informa- 
tion in the leaf nodes. For timing informa- 
tion, we use an integrated approach based 
on low-level benchmarking that determines 
the machine's operation and behavior for 
each subsystem and application-dependent 
analytical models. Figure 4 shows the top 
level of this structure. As an example, the 
data movement subsystem shown i n  Fig- 
ure 5 is partitioned into processor-to-pro- 
cessor, processor-to-memory, processor- 
to-peripherals, and memory-to-memory 
Number of networks 
- r  
Physical 
network 
topology 
etc. 1 
I .  
Data 
movement 
timing 
module 
4 Processor to processor I-' 
Peripheral to memory 
Virtual 
network 
Type of network topology 
Physical description 
Virtual description 
Network parameters 1-w Network I 
Timing information parameters 
Figure 5. The data movement submodules. 
January 1991 21 
8K processors 
Sequencer 1 
;-z> subsystem 
11-t; subsystem 
I 
8K processors>-4 
Graphic U terminal 
I 
Figure 6. The architecture of CM-2 and its functional subsystems. 
Dual processors with 
floating-point accelerator I 
16 processors (NS 32532) 
Cache medory I 
l e 
128 Megabyte RAM 
I 
I 
Ethernet and mass 
storage interfaces 
Figure 7. The architecture of Encore Multimax and its functional subsystems. 
99 
COMPUTER 
data movement  subsystems. The  data 
movement  a m o n g  processors c a n  be 
achieved through various communication 
architectures such as multistage intercon- 
nection networks, bus systems, and link- 
oriented connections. The network sub- 
system describes the network’s physical 
characteristics, such as topology. A partial 
decomposition of the data movement sub- 
system is shown in Figure 5 .  
The characterization of a particular ma- 
chine may not need the whole data struc- 
ture. Instead, the information necessary to 
fully characterize the machine may only 
require a few subtrees of the main data 
structure. Currently, PAWS characterizes 
a SIMD architecture, Thinking Machine’s 
CM-2, and a MIMD architecture, the En- 
core Multimax. The CM-2, shown in Fig- 
ure 6,  is configured as a 32K processor 
machine. This figure shows the CM-2 log- 
ical partitioning into the top-level PAWS 
parametric data structure. The architecture 
of the Encore Multimax model 520’ is 
shown in Figure 7, along with its various 
subsystems. 
The block diagram of these machine’s 
architectural characterization tool is shown 
in Figure 8. Users interact with this tool via 
the user-interactive interface module for 
selecting, synthesizing, or modifying the 
machine specification. The complete spec- 
ification of any existing, conceptual, or 
prototype machine can be captured and 
stored in the PAWS architectural data 
structure. 
Benchmarking and analytical models. 
T o  use the parametric data structure, the 
user interactively enters both static and 
dynamic timing values. Static timing val- 
ues, such as arithmetic operations, are for 
operations that are uneffected by the run- 
time environment. These values are gener- 
ally obtained through benchmarking. Sev- 
eral benchmark studies for the Connection 
Machine have been r e p ~ r t e d ~ . ~ , ~  and used 
for CM-2 characterization in PAWS. Dy- 
namic timing values are effected by the 
runtime environment and are determined 
by analytical modeling. Levit proposed an 
analytical model for grid communications 
in the CM-2.6 This model takes into ac- 
count the geometry of physical and virtual 
processors,‘ the dimension of communica- 
tion, and the data size. The user must pro- 
vide such information to the architecture 
characterization tool to generate the dy- 
namic timing values for the specified grid 
geometry. Similar techniques have been 
used to obtain the timing values for the 
Multimax. 
January 1991 
Machine 
list 
Machine 
selection 4 
process 
User- 
interactive 
interface 
Selection of Parametric 
appropriate 
data structure structure 
Interface to 
performance 
analyzer tool 
Benchmark 
Generation of timing results 
timing tables analytical 
models I 
Figure 8. The architectural tool and its subfunctions. 
Interface between the architecture 
characterization tool and the PAT. The 
performance assessment tool obtains in- 
formation from the architecture character- 
ization tool by generating queries. Four 
query types correspond to the data struc- 
ture’s four functions. The Plus operation 
on the Multimax has the following format: 
(AMAX, computation, arithmetic, binary, 
plus, 32, float). The first parameter speci- 
fies the machine as Multimax. The second 
and the third parameters specify the type of 
function as arithmetic and binary. The 
fourth, fifth, and sixth parameters specify 
the name of the operation, the data size, 
and the data type, respectively. As a result 
of this query, a single timing value is re- 
turned. 
A complete timing information table can 
also be generated instead of a single value. 
For  example ,  the  query  (CM-2,  
data-movement, proc-proc, net, router, all, 
4, all, I )  generates a full table of timing 
values for data communications using the 
router network on the CM-2 with Ham- 
ming distance 4, and I -  to 64-bit data size. 
The first parameter in the query specifies 
CM-2 as the target machine, the second 
specifies the data movement function, and 
the third specifies the data movement type 
as processor to processor. The fourth pa- 
rameter specifies the network type and the 
fifth specifies the network name. The next 
two parameters describe the communica- 
tion mode as “all” with a Hamming dis- 
tance of 4 and data sizes of 1,  2, 4, 8, 32, 
and 6 4  bits. The last parameter defines the 
virtual-to-physical processor ratio. 
The PAWS data structure can character- 
ize any machine. However, query attributes 
are only valid if the user initializes the 
corresponding subdata structure for that 
particular machine. 
Interactive graphical 
display tool 
The interactive graphical display tool 
provides the user interface for accessing all 
PAWS tools. The IGDT has been imple- 
mented as a hierarchical menu-driven sys- 
tem, allowing multiple windows to be 
opened in a single session. The main menu 
shown in Figure 9 allows the user to select 
the three remaining PAWS tools: the appli- 
cation characterization tool (applications), 
the architecture characterization tool (ar- 
chitecture), and the performance assess- 
ment tool (performance). Users may simul- 
taneously open windows conta in ing  
information for each of these tools. Figure 
9 shows a series of open menus, along with 
the IF1 graph description of the selected 
program “matrix 1 .a.” The displayed graph 
shows nodes organized by levels. All nodes 
at the same level can execute in parallel. An 
“optimizations” window lists the user’s 
different optimization choices during com- 
pilation. 
For large applications, the number of 
nodes within a graph may be too large to 
easily display in a single window. IGDT 
displays graphs hierarchically, allowing 
users to select any compound node by plac- 
ing the cursor on the node and selecting 
expand or collapse from the menu. Ex- 
panding a compound node takes the user 
into the next hierarchy level, showing sim- 
ple and compound nodes of the selected 
compound node. Collapse reverses the pro- 
cess of combining nodes within the selected 
node. Using this approach, users can dis- 
play as many or as few nodes as required. 
Figure I O  shows the menus for interact- 
ing with the architecture characterization 
tool. The user is guided through the differ- 
23 
- 
Matrix 
8 Matrix 
621 Matrix 
0 FUYTIONS 
m OATA "1 
0 INpulmVTwl 
Figure 9. PAWS menus for application characterization. 
1 console I 
0.00 50.00 100.00 150.00 200.00 250.00 
0 
Figure 10. PAWS menus for architecture characterization. 
ent levels of the parametric data structure by 
IGDT-generated prompts. This figure also 
shows the data structure's top level with the assessment 
prompt for creating dynamic timing infor- 
mation on the CM-2 router communications 
network. The created timing values are 
passed to the performance assessment tool 
and displayed as graphs, as shown. 
Performance system (using the application character- 
ization tool) by generating a set of perfor- 
mance metrics. These performance met- 
rics include speedup (the average amount 
of computation performed in one step with 
unlimited processors) curves, parallelism 
profilecurves, andexecution profiles. These 
performance metrics are generated for both 
The PAWS performance assessment tool 
(PAT) allows users to evaluate the perfor- 
mance of any application entered in the 
24 COMPUTER 
converter analysis 
I I 
Figure 11. The performance assessment tool block diagram. 
Actual 
the ideal case, which represents the appli- 
cation’s theoretical upper bounds of per- 
formance, and a set ofperformance metrics 
for the application after i t  has been parti- 
tioned and mapped onto a machine. An 
analysis of the two performance metrics 
sets shows the effects of mapping the ap- 
plication onto the machine. 
Figure I 1 shows PAT’S overall block 
diagram. The block labeled “theoretical 
model” generates an application’s ideal 
parallelism and speedup information. The 
blocks labeled “actual machine 1” and “ac- 
tual machine 2” compute the predicted 
parallelism and speedup performance of 
the applications running on the specified 
machines after the application has been 
mapped onto each machine. 
V 
Change 
Computer Architecture: 
A Quantitative Approach 
John L. Hennessy and David A. Patterson 
1990; 784 pages; cloth; ISBN 1-55860-069-8; $59.95 
“[This book] is a thorough and extraordinarily wide-ranging 
education in that magical interface between the programmer’s 
intentions and the electron’s actions. It should be read by 
every software craftsman who cares about wringing the last 
drop of performance from his machine.” Dr. Dobb’s Journal 
Introduction to Parallel Algorithms 81 
Architectures: 
Arrays, Trees and Hypercubes 
F. Thomson Leighton 
SprinX 1991: approx500 ~ R S ;  clorh, ISBN 1-55860-117-1, $49.95 
Features communication networks that form the architectural 
basis of almost all parallel computing; the author describes 
their capabilities, limitations and use in solving specific algo- 
rithmic problems. 
Synthesis of Parallel Algorithms 
Edited by John H. Reif 
Spring 1991: uppr0.v 900 paRes; cloth; ISBN I-55x60-135-X; $54.95 
Each of the 25 original chapters in this book presents a specific 
parallel algorithm and contains a description of the fundamen- 
tal problem, its solution, and analysis with complete examples. 
VLSl and Parallel Computation 
Edited by Robert Suaya and Graham Birtwistle 
1990, 474 Dupes: cloth; ISBN 0-934613-99-0: $39.95 
machine 2 
Cache and Memory Hierarchy Design: 
A Performance-Directed Amroach 
performance 
. .  
Steven A. Przybylski 
1990, 223 pages; duth; lSBN 1-55860-136-8, $39.95 
AVAILABLE FROM TECHNICAL BOOKSTORES, OR, 
Or send a US$ check or money order to Morgan Kaufmann, 
2929 Campus Dr., Ste. 260, Dept. 91, San Mateo, CA 94403. 
Include shipping and handling (US/Canada: $3.50 1st volume, 
$2.50 each additional; Int’l.: $6.50 1st volume, $3.50 each addl.). 
CA residents add sales tax. Fax orders: 415-578-0672. 
CALL TOLL FREE 800-745-7323 (US & CANADA) 
Reader Service Number 4 
~ -~ 
Reader Service Number 3 
Mapping. The program execution time 
on any parallel machine is dependent on 
both the program operations and the users’ 
ability to express the parallelism at the 
machine’s  correct  granular i ty  level .  
Therefore, to fairly compare two machines 
running a common application, two differ- 
ent mappings of the application will be 
required, one for each machine. In PAWS, 
applications are first transformed into dat- 
aflow graphs and then mapped onto a ma- 
chine based on its attributes. However, 
guaranteeing optimal dataflow graph map- 
ping is a nontrivial problem. Currently, 
PAWS uses the mapping techniques to run 
IF1 on MIMD machines. Research is un- 
derway for PAWS to develop new heuris- 
tic algorithms for mapping on both MIMD 
and SIMD machines.’ 
AFill 
Generating performance parameters. 
Both parallelism and execution profiles 
Plus Plus Plus 
Begin 
for i in 1 ..5 loop 
for j in 1 ..5 loop 
for k in 1 ..5 loop 
end loop; 
c(i,j):=c(i,j)+a(i, k)*b(k,j) 
end loop; 
end loop; 
end 
(4 
Figure 12. Matrix multiplication pro- 
gram. 
lLessEqual] 1 LoopA’ I 
1 I 
J. 
Finalvalue Finalvalue Finalvalue 
I 
r‘ I 
444 .1 
are generated by performing a graph walk 
of an application‘s dataflow graph. The 
graph walk routine traverses the dataflow 
graph computing and recording each node’s 
performance and statistical information. 
To handle compound nodes, the graph walk 
routine is implemented recursively. This 
recursive nature allows statistics and tim- 
ing information to be recorded for individual 
functions, procedures,  etc. Therefore,  
parallelism profiles and other performance 
parameters may be generated for a pro- 
gram’s function, procedure, etc. 
An application’s recursive function calls 
are modeled as linear loops with a prede- 
termined number of iterations. The num- 
ber of iterations can be input interactively 
or estimated, based on a frequency count 
derived from an actual program run. 
Examples. Two examples illustrate 
PAT’S flexibility, The Ada source pro- 
gram for the first example, a 5 x 5 matrix 
multiplication, is shown in Figure 12. Fig- 
ures 13a through d show the whole pro- 
gram, the three compound nodes (three For 
loops), and several simple nodes that make 
up the program. Figures 14a through d 
show the parallelism profiles for the whole 
eAFill 
Finalvalue Finalvalue FinalValu Finalvalue 
2
i a b  k c j  
b a j i  c k  
I 
I I  I 
el Replace 
%d FinalValu FinalValu 
t t 
Figure 13. Matrix multiplication IF1 
graph (a) and expansions of Loop A, 
(b), (c), and (d). 
26 COMPUTER 
Maximum number 
of operations 
4.00 
3.60 
3.20 
2.80 
2.40 
2.00 
(a) o 200 400 600 801 
, 
Maximum number 
of operations 
6 
5 
4 
3 
2 
1 
(C) 
Steps 
0 50 100 150 Steps 
Maximum number 
of operations 
7 
6 
5 
4 
3 
2 
1 
(b) 
I 
0 200 400 600 800 Steps 
Maximum number 
of operations 
I n  I 
30 Steps (d) b 10 20 
Figure 14. Profile.matrixgraph.5 (a), Profile.LoopA (h) ,  Profile.LoopA’(c), and Profile.LoopA” (d). 
program and the three compound nodes 
(LoopA, LoopA’, and LoopA”, respec- 
tively). The speedup corresponding to the 
complete program shown in Figure 15 ap- 
proaches two. the average number ofoper- 
ations performed at each execution step. 
Figures 13(a) through ( d )  exhibit regular 
fine-grain parallelism patterns throughout 
the program. By identifying these patterns. 
a user can synthes ix  a machine while 
proceeding with investigations into this 
part of the code to enhance performance. 
Furthermore, a user performing algorithm 
analysis can be prompted to substitute a 
parallel Forall-type construct for the ne\t- 
ed iterative constructs to obtain faster exe- 
cution times and increased system effi- 
ciency. The ability to change programs and 
machine parameters in  the PAWS archi- 
tectural data 5tructure and quickly observe 
the modifications’ results i s  a powerful 
design tool, since modifications to existing 
hardware can be time consuming and cost 
prohibitive. 
As a second example, we use B program’ 
that performs binary integration of the 
function F as shown in Figure 16. An inter- 
esting characteristic of this program is its 
inclusion of recursive and function calls. 
Thi5 program’s parallelism profile and 
speedup plot are shown in Figures I7 and 
18. In Bohm, Curd, and Kallstrong,? a 
similar plot was generated using a graph 
interpreter originally designed for IF1 pro- 
grams translated from SISAL.”’ The fig- 
ures show that the number of operations 
Speed-up 
0 200 400 600 800 
available for parallel execution increases 
geometrically during execution because the 
Figure 15. Speedup for matrix multi- 
plication. 
January 1991 21 
number of recursive calls doubles every 
time anew call is performed in the function 
area. 
he PAWS project’s objective is to 
provide a unified environment for T users to assess various existing, 
conceptual, and prototype machines that 
execute a common application. PAWS 
provides a framework for users to compare 
various architectures for a given applica- 
tion and help to identify the best machine. 
PAWS is a unique tool because i t  com- 
bines characterizations of both applica- 
tions and architectures and generates for 
assessment various performance metrics, 
including parallelism profiles and speed- 
up information. The effects of architec- 
tural changes can be investigated through 
PAWS’ ability to modify and store ma- 
chine descriptions. 
Research is underway for PAWS t o  
develop new heuristic algorithms for 
mapping on both MIMD and SlMD ma- 
c h i n e ~ . ~  
Procedure main(A,BInit:in float;r:out float)is 
function F(X:float) return float is 
Begin 
return 3.0*X*X*X+2.O*X*X+5.0; 
end F; 
function TRAP(le,Ri:float)return float is 
begin 
return(Ri-Le)*(F(Le)+F(Ri)/2.0; 
end TRAP 
function AREA(L,R,Est,Tol:float)return float is Mid,A 1 ,A2,News,a:float; 
begin 
Mid:=(L=R)/2.0;A 1 :=TRAP(I,Mid);A2:TRAP(mid,R);News;=A 12+ 2 
if(abs(Est-News)<Tol)then 
a:=News; 
else 
A:=AREA(L,Mid,AI ,Tol/2.0)+AREA(Mid,R,A2,To1/2.0); 
endif 
return a; 
end AREA; 
begin 
r:AREA(A,B,TRAP(A,B),Init); 
end main; 
Figure 16. Binary integration program. 
~ ~~~~ 
Maximum number of 
operations x1000 
0 50 100 150 Steps 
Figure 17. Parallelism profile for binary integration. 
28 
Acknowledgment 
This project was funded through a grant from 
the Rome Air Development Center under con- 
tract F306020-88-C-0136 
References 
1.  Arvind, D.E Culler, G.K Maa, “Assessing 
the Benefits of Fine-Grain Parallelism in  
Dataflow Programs,” IEEE Computer So- 
ciety Press, Los Alamitos, Calif. order no. 
882, pp. 60-69. ’ 
2. A.P.W. Bohm, J.R. Gurd, and M.C. Kall- 
strom, Monitoring Experimental Parallel 
Machines, Tech. Report, Dept. of Comput- 
er Science, Univ. of Manchester, 1988. 
3. D.E. Culler, Effective Data Flow Execu- 
tion of Scientific Applications, doctoral 
dissertation, Computational Science Labo- 
ratory, MIT, Cambridge, Mass., 1989. 
4. “An Intermediate Form Language IF1 ,” 
Lawrence Livermore National Laboratory 
reference manual, 1985. 
5. Mul r imm Technical Summary, Tech. Re- 
port, Encore Computer Corp., Marlboro, 
Mass., 1987. 
6. C. Levit, “Grid Communication on the 
Connection Machine: Analysis, Perfor- 
mance, and Improvements,” Tech. Report 
88.19, Research Inst. for Advanced Com- 
puter Science, NASA Ames Research Cen- 
ter. 1988. 
7. D.W. Myers and G.B. Adams 11, “Bench- 
marking and Performance Analysis of the 
CM-2,” Tech. Report 88.19, Research In- 
stitute for Advanced Computer Science, 
NASA Ames Research Center, 1988. 
R. Pozo and A.E MacDonald, “Performance 
Characteristics of Scientific Computations 
8. 
Speed-up 
280 
240 
200 
160 
120 
80 
50 100 150 Steps 0 
Tigure 18. Speedup plot for binary integration. 
COMPUTER 
on the Connection Machine,” Tech. Report 
CU-CS-440-89, Center for Applied Parallel 
Processing, Dept. of Computer Science, Univ. 
of Colorado at Boulder, 1989. 
9. V. Sarkar, Partitioning and Scheduling 
Parallel Programs forktultiprocessing, MIT 
Press, 1989. 
10. “DI: An Interactive Debugging Interpreter 
for Applicative Languages,” Lawrence Liv- 
ermore National Laboratory reference 
manual. 1987. 
Ishfaq Ahmad has been a research assistant at 
the Northeast Parallel Architecture Center since 
January 1989. His research interests include 
parallel and distributed architectures, schedul- 
ing and load balancing, and performance evalu- 
ation. Ahmad won the best student paper award 
in systems at Supercomputing ’90. Ahmad is a 
PhD candidate in the School of Computer and 
Information Science, Syracuse University. He 
received his BSc in electrical engineering from 
the University of Engineering and Technology, 
Lahore, Pakistan, in 1984, and MS in computer 
engineering from Syracuse University in 1987. 
He is a student member of the IEEE Computer 
Society and ACM. 
Thomas E. Karpinski has been a research as- 
sistant in the Department of Electrical and Com- 
puter Engineering in the Northeast Parallel Ar- 
chitecture Center at Syracuse University since 
July 1989. His research interests include paral- 
lel and distributed languages, compiler design, 
and real-time systems. Currently, he is pursuing 
his MS in computer engineering at Syracuse 
University. He received his BS in computer 
science from the Rochester Institute of Technol- 
ogy, Rochester, New York, in 1986. 
Daniel Pease joined the Syracuse University 
faculty in 1979 and is currently an associate 
professor. Pease is involved in a number of 
research projects related to parallel processing 
and assessment of parallel systems. The projects 
are sponsored by DARPA, RADC, IBM, and 
DEC. He received his BSc in 1968, and MS and 
PhD degrees in 1973 and 1983, respectively, all 
from Syracuse University. Pease is a member of 
IEEE. 
Arif Ghafoor joined the Syracuse University 
Department of Electrical and Computer Engi- 
neering in September 1984 and is currently an 
associate professor. He is a consultant to such 
companies as Bell Laboratories and General 
Electric in the areas of telecommunications and 
distributed systems. His research interests in- 
clude design and analysis of parallel and distrib- 
uted systems and telecommunication. He re- 
ceived his BS degree in electrical engineering 
from the University of Engineering and Tech- 
nology, Lahore, Pakistan, in 1976, and his MS, 
Mphil, and PhD degrees from Columbia Univer- 
sity in 1977, 1980, and 1985, respectively. He is 
a senior member of IEEE and a member of Eta 
Kappa Nu. 
Readers may contact Arif Ghafoor at Syra- 
cuse University, Department of Electrical and 
Computer Engineering, Syracuse, NY 13244. 
Electronic mail can be sent to 
ghafoor@ top.cis.syr.edu. 
David L. Andrews is employed by the Ocean 
Systems Division of General Electric where he 
works on distributed operating systems and dis- 
tributed networks. His research interests include 
parallel and distributed architectures. Andrews 
is pursuing his PhD in computer science at 
Syracuse University. He received his BSEE in 
1983 and MSEE in  1984 from the University of 
Missouri at Columbia. He is a member of IEEE. 
Kamal Foudil-Bey is a research assistant in the 
Northeast Parallel Architecture Center at Syra- 
cuse University. His primary research interests 
are in parallel computing, performance evalua- 
tion, and dataflow representation of algorithms 
on parallel computers. He is pursuing his PhD 
degree in performance evaluation and assess- 
ment of parallel computers at Syracuse Univer- 
sity. He received the degree of Ingenieur D’etat 
in Electronique from the Ecole Nationale Poly- 
technique in Algiers in 1984 and the MS degree 
in computer engineering from Syracuse Univer- 
sity. He is a student member of IEEE Computer 
Society. 
Mohammad A. Mikki has been a research as- 
sistant in the Department of Electrical and Com- 
puter Engineering since 1989. He is currently 
developing tools for displays of IF1 graphs us- 
ing the X window system. His primary area of 
research is in parallel processing. He is a candi- 
date for a PhD in computer engineering at Syr- 
acuse University. He received his BSc degree in 
general electrical engineering from Bir Zeit 
University in West Bank in August 1984 and his 
MS degree in computer engineering from Syr- 
acuse University in 1989. 
Mohamed Zerrouki has been a research assis- 
tant in the Department of Computer and Electri- 
cal Engineering at Syracuse since January 1989. 
He is currently developing the Ada-to-IF1 con- 
verter. His primary research interests are in 
parallel processing and compiler design. He is a 
PhD candidate in computer engineering at Syr- 
acuse University. He received the Ingenieur 
d’Etat degree in electronics from Ecole Nation- 
ale Polytechnique d’ Alger, Algiers, Algeria, in 
1985, and the MS degree in computer engineer- 
ing from Syracuse University in 1988. 
January 199 1 29 
