A Framework for Intelligent Parallel Compilers by Want, Ko-Yang
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1991 




Want, Ko-Yang, "A Framework for Intelligent Parallel Compilers" (1991). Department of Computer Science 
Technical Reports. Paper 877. 
https://docs.lib.purdue.edu/cstech/877 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
INTELLIGENT PROGRAM OPTIMAZATION AND




A FRAMEWORK FOR INTELLIGENT PARALLEL COMPILERS
Ko-Yang Wang
Computing Aboul Physical ObjcclS,
Department OfCompulcr Sciences, Purdue University, Wcsl Lafayette, IN 47907
E-mail: kyw@cs.purdue.edu
Phone number: (317) 494-4465
Fax number: (317) 494-0739
ABS111ACT
In this paper, we outline obstacles in building high perfonnance, intelligent
parallel compilers and present some methodologies for overcoming these prob-
lems. A framework for conrructing parallel compilers that can analyze and
optimize programs automatically and intelligently is proposed. This framework
utilizes sophisticated AI techniques to optimize programs in a systematic
approach in order to prone non-promising decision subtrees as early as possible.
Methodologies to apply this framework to build optimizing parallel compilers,
including heuristic-guided state-space search and planning, machine knowledge
manipulation, system-knowledge organization and inference, opportunistic rea-
soning, and a problem-solving model called hier-blackboard. are also discussed.
Keywords: AI, architecture, compiler, heuristics, intelligence, parallelism,
planning, problem-solving, program-transformation, optimization
Seplcmber 29. 1990
t Teehnicul Repon, CSD-1R-I044, CER-90-52, Depanrne.nt of CampulJ;r So;il:llccs, Purdue University, November
1990.
ACKNOWLEDGMENTS
I would like to thank all the individuals who have contributed, both personally and
professionally, to the completion of this thesis. In particular, I gratefully acknowledge the
friendship, support and guidance of my major professors Piyush Mehrotra and Dennis Gan-
non. I furthermore would like to thank my committee members Professors Tim Korb and
Mike Attallah for their valuable input and suggestions.
Much of the research reported in this thesis was done during 1985-1988 while I was a
graduate student in the Computer Science Department at Purdue. However, the writing of
this thesis was on and off during the last three years - including the last two and half years
that I worked as a full~time research associate in the Parallel and Distributed Computing
Group of the Computing About Physical Objects (CAPO) project. I would like to thank Dr.
Houstis and Dr. Rice for their guidance and support; without their help and encouragement,
I would never have finished the writing of this thesis. I thank all my colleagues in the
CAPO for the enjoyable experience of working with a group of fine researchers. I would
like to thank: Dr. Greg Pfister and the RP3 group of IBM for giving me the opportunity of
doing on-site research at IBM's facilities in Yorktown heights, New York.
I thank all my friends at the computer sciences department of Purdue University who
make my stay a rewarding experience. I like to thank Dr. William Gonnan for his support
while I was a graduate student in the computer science department. I also like to thank the
following nice secretaries for their help: Patty Minniear, Margaret Fabl, Daloris Williamson,
Wilma Birge, Paula Perkins, Georgia Conarroe, and Candace Walters. Dr. Frank Oreovicz
proofread the whole thesis and gave me lots of advice in technical writing which I am duly
grateful. I also would like to thank Scott McFaddin who proofread chapters 3 and 4 and
also gave many valuable suggestions.
I like to thank my wife Yuh-Jinn for her love, support and patience. I thank my
daughter Alice and son Arthur who made my years at Purdue busy but meaningful. I also
thank my mother for everything she rendered me. I thank my late-father and regret that he






1.1.1 Parallel Programming Environments 2
1.1.2 The Stale-of-the-art of the Current Technology 3
1.1.3 Obstacles of Building Parallel Programming Environments 3
1.2 Problem Statements 4
1.3 Our Approach 5
1.3.1 Expert System Technologies 6
1.3.2 Multiple Target Architec(ures and Knowledge Generalization 6
1.4 Organization of the Thesis 7
2. PARALLELISM AND PARALLELISM IMPROVEMENT 8
2.1 Levels of Parallelism 8
2.2 Machine Parallelism 9
2.2.1 Variations of Parallel Computers 10
2.2.1.1 Network Topologies 11
2.2.1.2 Memory lIierarchy 11
2.2.2 Programming Issues for~ Computers 11
2.2.3 Computational Models 12
2.3 Program Parallelism 13
2.3.1 Granularity Of the Program Parallelism 13
2.3.2 Program Dependence 14
2.3.3 Representation of Program Dependence Graph 16
2.4 Optimization ofParaIlelism with Program Transfonnations 17
2.4.1 Abstraction of Program Parallelism Based on Program Dependence Graph 17
2.4.2 Optimization of Program Parallelism 19
2.4.2.1 Parallelism Realization and Program Restructuring 19
2.4.2.2 Two Approaches for Program Parallelism Optimization 19
2.4.2.3 Pre-optimizcd Algorithm Substitution Versus Program Transfonnation 20
2.4.3 CIassificalions of Program Transfonnation Techniques 21
2.5 Summary 22
3. TOWARDS lNTELLIGENTPARALLEL COMPILING 23
3.1 Need for Intelligence in Parallel Compilers 23
3.2 Methodologies for Improving Intelligence in PamIlel Compilers 28
3.2.1 Models for Program Parallelism Improvement 28
3.2.1.1 Six Models for Selecting Program Transformation Sequences 28
3.2.1.2 Analysis of the Models 29
3.2.1.3 A New Paradigm for Program Parallelism Improvement 30
3.2.1.4 Comparison to Other Parallel Program Optimization Models 32
3.3 Utilization of Heuristics 33
3.3.1 Systematic Discovery of Heuristics 33
3.3.2 Translating Heuristics into Rules 33
iii
Page
3.4 Applying AI Techniques to ParnJ.lel Compilers 34
3.5 Conclusion 35
4. A FRAMEWORK FOR THE CONIROL OF INTELLIGENT PARALLEL COMPILERS 36
4.1 Parnllel Program Optimization as a Planning Problem 36
4.2 Frameworks for Realizing the Feature-Directed Program Optimization Paradigm 37
4.2.1 Heuristic Guided Stale-space Search 37
4.2.1.1 Non-linear Planning and the Coordination of Multiple Thread State-space Search 39
4.2.2 Hierarchical Decomposition of the Parallelism Improving Problem 40
4.2.3 The Heuristic Guided Reasoning and the Expert Systems Approach 41
4.2.3.1 The Heuristic ffierarchy 42
4.2.4 Opportunistic Reasoning and the Blackboard Architecture .•........................................................ 42
4.2.4.1 The Blackboard Architecture 43
4.2.4.2 Blackboard Systems and Production systems 43
4.2.4.3 Advantages of the Blackboard Model 43
4.2.4.4 Weakness of the Blackboard Model 43
4.2.5 The Hier-Blackboard Model 44
4.2.5.1 Thc Framcwork of the Rier-B1ackboard Model 44
4.2.5.2 Issues for Parallcl Implementalion 45
4.2.5.3 Simulation of Parallel Hier-blackboard on Sequential Machines 45
4.2.5.4 Comparison wilh Other Blackboard Models 45
4.2.5.5 Applying Ute Hier-Blackboard 10 Ute Parallel Compilers 46
4.2.6 Comparison of the Three Frameworks 46
4.3 Conclusion 48
5. MACIDNE KNOWLEDGE MANIPULATION ISSUES FOR PARALLEL COMPILERS 49
5.1 Introduction 49
5.1.1 Feature-Directed Program Optimization 49
5.2 Machine Features and Parallel Compilers 50
5.2.1 Machinc Features 50
5.2.2 Important Machine Features for Parallel Compiles 50
5.2.2.1 The Processing Elements 51
5.2.2.2 Thc Memory Hierarchy 51
5.2.2.3 Interconnection Networks and Busses .................................................................•............. 52
5.2.2.4 The Conlrol Unit and Processor Clusters 53
5.3 Design Considcrations for Machine Knowledge Representation Schemes 54
5.4 An Object Oriented Knowledge Representation Scheme for Parallel Computcrs 55
5.4.1 Feature Objects 55
5.4.2 Attributes Associated with Objects 56
5.4.3 Feature Organization 57
5.4.4 Operations on lhe Objects 58
5.4.4.1 Inheritance. Specification and Qualification 58
5.4.4.2 Fealure Modification 58
5.4.4.3 State Adjustment with Dependencies 59
5.4.5 A Simple Example 59
5.4.6 Features of the Parallel Machine Knowledge Representation Scheme 60
5.4.6.1 Static and Dynamic Knowledge Representation 60
5.4.6.2 Flexibility in Knowledge Characterization and Organization 60
5.4.6.3 Various Abslraclion Levels of Features 61
5.4.6.4 Global Visibilily Versus Knowlcdgc Hiding 62
5.5 Implemcntalion of the Knowledge Representation System 62
5.5.1 A Machine Knowledge Representation Syslem 62
iv
Page
5.5.2 Machine Feature Abstraction and Installation ..............................•...........•.................................... 63
5.5.2.1 Inleractive Machine Feal.ure Specification ...................................•.................................... 63
5.5.3 FeaLure Deduction and Comparison 67
5.5.4 Specializing System Knowledge 67
5.6 OUIer Applications of the Represenlation Scheme 68
5.6.1 Distributed Computing Environments .....•..................................................................................... 68
5.6.2 Flexible Parallel Computing System Simulation Systems 68
5.7 Conclusion 68
6. PERFORMANCE PREDICTION AS A BASIS FOR INTELLIGENT PROGRAM OPTIMIZATION .... 70
6.1 Introduction 70
6.2 Pcrronnancc Prediction Models 70
6.3 A Framework for Perfonnance Prediction 71
6.3.1 Dynamic PerfonnancePrediction and Run-time Tests 72
6.4 Perfonnance Evaluation Functions 73
6.4.1 Examples of Pecfonnance Factors and Their Evaluation Functions 73
6.5 Parallel Execution Time of lhe Program 77
6.5.1 Example of Estimating Program Execution Time wilh lhe Model 78
6.6 Applying Perfonnance Prediction to Intelligent Decision-Making 82
6.7 Related Work 83
6.8 Conclusion 84
7. ABSlRACTING PARALLELISM FOR MIMD PARALLEL COMPUTERS 85
7.1 Improving Parallelism with Feature-Directed Program Optimization 85
7.1.1 Focus of Program Optimization 85
7.2 Heuristic Guided Program Optimization 86
7.2.1 General Optimization of lhe Program Parallelism 86
7.2.2 Task Decomposition and Processor Assignment 87
7.2.3 Memory Utilization for Shared-Memory Archilectures 87
7.3 Array Reshaping -. a Mechanism for Optimizing Array Usage 88
7.3.1 Array Reshaping Functions 89
7.3.2 Variations of Array Reshaping .............................................................................•...•.................... 89
7.3.3 Opportunities for Applying Array Reshaping 90
7.3.4 Heuristics for Applying Array Reshaping 93
7.3.5 Some Remarks for Array Reshaping 94
7.4 Optimizing Dala Synchronization on Distributed Memory Systems 95
7.4.1 Introduction 95
7.4.2 Message Consolidation 95
7.4.3 Summary 104
7.5 Algorithm Substitution and Fine-Tuning with Pre-optimized Algorithms 104
7.5.1 Summary 109
8. IMPLEMENTATION AND EXPERIMENTAL RESULTS III
8.1 An Expcrimenr.al Intelligent Parallel Programming Environment 111
8.1.1 The System Architecture III
8.1.1.1 The Front-Ends And the Back-Ends to lhe Programming Environmenl III
8.1.1.2 The Machine Knowledge Manipulation System 112
8.1.1.3 The Intelligent Program Optimization System 113
8.1.1.4 The Intelligent Program Restructuring Control System 114
8.1.1.5 The User Interface 115
8.1.1.6 The Structure of the System 116
v
Page
8.2 Examples and Experiments ...............................................................................................................•.... 116
8.2.1 Remarks about the Experiments ....•............................................................................................. 116
8.2.2 The Matrix-Vector Multiply Example Revisited 118
8.2.2.1 Mapping onlo the BBN Butterfly .•.............•.................................................................... 118
8.2.2.2 Mapping cnlo lhe Pringle/CHiP Architecture 119
8.2.2.3 Mapping onto lhe Alliant FX/8 120
8.2.3 A More Realistic Example: LU-Faclorization 122
8.3 Summary 125
9. CONCLUSION 130
9.1 Summary of the Thesis 130
9.2 Contributions 131
9.3 Future Work 132
9.3.1 Chaining MullipleProgram Transformations 132
9.3.2 Parallel Excculion of the Compiling Process 132
9.3.3 Self-Learning Modules 132
9.3.3.1 Knowledge Acquisition .............................................................................•...................... 132
9.3.3.2 Knowledge Refinement 133
9.3.3.3 Self-Learning Modules 133
9.4 Closing Remarks 133
BIBLIOGRAPHY 134
APPENDIX 143
Appendix A.l Sample Rules Used in Chapter 8 143
Appendix A.2 Sample Listing of Encoded Rules 147




The main focus of this thesis is the methodology for constructing intelligent, optimizing, parallel com-
pilers and programming environmenls. We emphasize intelligence in the context of the optimization of pro-
gram parallelism because of the following two reasons:
1. Program-parallelism optimization is a difficult task and high level intelligence is required.
2. Applying state-of-the-art AI techniques to parallel compiling may significantly increase Ute effectiveness
and efficiency of parallel compilers. However, lhis approach has not received its due attention in the
parallel compiler research community.
Two typical remarks that we received when we first proposed utilizing AI techniques to build intelligent
parallel compilers back in 1985 were: "Aren't you making a difficult problem even more difficult?" and
"How can you build an expert system for a field that has no experts?" Indeed, building parallel compilers or
programming environments has never been slraightforward. The task is so complex that no satisfiable para1lel
compilers exist today. However, lhe purpose of introducing AI techniques inlo para1lel compiling is to derive
new methodologies for organizing and integrating program optimization knowledge and a framework for sys-
lematic and aulomatic analysis so thal inlelligent behaviors can be observed in para1lel compilers. As will be
demonstrated in this lhesis, powerful program optimization tools can be built based on some relatively simple
methodologies. The second remark demonstrates a common misunderstanding about expert systems. In fact,
the primary areas where expert system technology can be applied arc ill-conditioned, complex problems where
Ihe gcnera11evel of understanding about !he field is not malure enough to solve the problem analytically. The
problem of program parallelism optimization falls inlo this category.
In !his chapter, we identify some problems wilh current stale-of-lhe-art optimizing parallel compilers and
programming environments and discuss new directions for solving these problems.
1.1. Motivation
An emerging lrend in eompuler design is lhe reliance on parallelism to achieve performance improve-
ments over sequenlial processors. During !he last decade, parallel aochilectures have migrated out of the
research laboratories and spread into various classcs of commercial product lines that range from supercom-
puter class machines, such as the Cray-YMP and Inlel Touchslone, to workstation class machines such as
Transputers or NCUBE-4. The advantage of utilizing parallelism is obvious - concurrent execution provides
lhe critical speed up Utat many applications need and opens up opportunities for many new applications (hat
were not computable before. For supercomputers, parallelism provides a critical technology to break the
sequential speed bamer. For general purpose machines, parallelism allows manufacturers to produce powerful
computers with impressive peak: performance and excellent price-perfonnance ratio wilhout exclusively relying
on expensive, culling edge hardware technology. This lowers the cost of Ihe machines significantly.
However, parallelism is not without its price. Parallel programs contain multiple threads of the control
and data flow. Many new concepts and difficulties that do not exist in programming sequential compulers are
encountered in programming parallel compulers. Issues such as !he correct order of concurrent data update and
accesses, critical regions, communication, and deadlock prevention need to be handled carefully to ensure lhe
correctness of lhe programs. Furthermore, complicated issues such as program partitioning, mapping, schedul-
ing, cache and local memory utilization, and synchronization must be addressed to utilize the parallel capabil-
ily of lhe machines. This makes programming parallel compulers a very lricky and complex task. :Program-
ming parallel compulers can be several orders of magnitude harder than programming their sequential counler-
parts [CCHKT88]. The efficiency of Ihe program is usually tied closely 10 lhe structure of the target machine.
Highly optimized programs are usually non-portable. Worse yet, it is normally very difficult to predict the
2
correct behavior or 10 assess the performance of Ute program on a target machine. Debugging parallel pro-
grnms often becomes a nightmare for programmers.
Because of these difficulties. the mass parallelism promised by parallel architectures is rarely realized.
Studies indicate lhat the effective perfonnance of parallel computers usually ranges between only 5% to 35%
of their peak perfonnance [Bem86. Dong87l. Such poor delivered pcrfonnance emphasizes the great need for
appropriate software tools for parallelism utilization. What are needed are user-friendly parallel compilers or
programming environments that can shield the difficulties of parallel programming from the users.
1.1.1. Parallel Programming Environments
A parallel programming environment is a software syslem designed to help the user to program and
debug programs for para1lel computers. It is usually a software system that is the integration of various sub-
systems such as programming tools (editors. visual programming tools, etc.), parser, program dependence
analyzer, parallelism optimizer, code generators, performance evalualion tools, visualization tools, debugging
tools, and, most importantly, a friendly user-interface. Different programming environments emphasize
different aspects of the parallel programming task. We are particularly interesled in programming environ-













Figure 1.1. The slructllre of a typical parallel programming environment.
What an ideal parallel programming environment consists of is debatable. Ideally, the user should
implement the application in a high level programming language without worrying about the efficiency prob-
lem and leave the optimizalion task to lhe compilers. In other words, the responsibility for mapping programs
efficiently 10 lhe target architectures lies primarily wilh the compilers and programming environments instead
of with lhe users. In this way, the programming environment can shield programmers from the pragmatic
details of parallel programming and allow them to concentIatc on the structure of the algorithms and the
correctness of the programs. The use of this kind of programming environment can be best described by the
following scenario.
After developing the algoritJuns for his problem. the programmer encodes his/her app/ica-
tioll in a machille-independent form (which can be either sequenlia/ or parallel). He then
selects a set of target machines. For each of the target machines. Ihe programming
3
environment paraflelizes the code and generates an optimized program that matches the
Wlderlying architecture model. Depending on his/her parallel programming experience.
the programmer may select to interact with the system's decision-making process. Such
interaction can range from querying the system for its reasons ;n making some particular
decisions to providing suggestions or other input which would eose the task of the system.
In the extreme case, the programmer may actually direct the choices taken by the system.
Of course the programming environment as described above does not exist today but there are some
ongoing efforts in this direction. A gocx:l parallel programming environment should provide programming and
optimization advice to ordinary parallel programmers so that user involvement in the decision-making is
minimum.
1.1.2. The State-of-the-art of the Current Technology
Most parallel compilers and programming environments of loday have the following problems:
• They lack the needed intelligence to make critical decisions. Most systems rely on the user to make crit-
ical optimization decisions.
• They are inefficient. Most systems are slow and expensive to use.
• They are hard to improve. Most syslems lack a syslematic organization of the system knowledge and
often have ad hoc heuristics scatlered throughout the system. This makes it difficult for the systems to
evolve. None of the existing parallel compilers or parallel programming environments have knowledge
acquisition facilities or learning capabilities.
• They have a limited scope of applications. Most existing systems can only be applied to a very limited
set of target an:hilectures. Transferring the knowledge in a parnllel compiler for a particular machine to
another requires significant effort.
As a result of these problems, most parallel compilers rely on the users to exploit opportunities for con-
current operations. Users have to struclure their applications carefully in specific forms so !hat the compiler
can easily recognize the parallelism available in the program. This means !hat users have to figure out most or
all the transformations !hat the compilers need to perform and annotale the programs to instruct the compilers
to generate efficient code. This is certainly not the kind of help !hat novice programmers expect from parallel-
izing compilers.
Although program veclorization is considered 10 be a well-researched and well-underslood subject
([AlKe84a. AIKc87, KKLW80, Wolfe82]), reseaICh in program paraUelization and optimization for non-shared
memory or hierarchical multiprocessor architectures is still in its infancy. Only recently have extensive efforts
been underway in lhis direction: Parafrase IT at CSRD [PGHLLS891, PFC [AlKe84b] and Parascope
[CCHKT881 at Rice. PTRAN at mM [Allen86, ABCCF88J, and our efforts in building intelligent parallel
compilers [Wang85. WaGa89, Wang9OdJ. It is unfortunate !hat the advances in parallel hardware have not
been met with needed advances in software technology. In fact, the slowness in market growth of parallel
computers may be linked indirecUy to the slowness in the development of good parallel compilers and pro-
gramming environments. In the next section. we will discuss some difficulties in building such environments
with current state-of-the-art technology and propose some solutions 10 the problems.
1.1.3. Obstacles in Building Parallel Programming Environments
Why is the parallel programming software still in such a primitive state even after extensive research
efforts during the last decade? The major difficulties in generating powerful optimizing parallel programming
environments can be categorized as follows:
1. Lack of comprehensive understanding of parallelism utilization: Despite extensive research during the
past decade, we still do not have an overall picture of how parallelism can be optimized withoul exten-
sive user involvement
2. Detailed knowledge about the target machines required for program optimization .. Different parallel
architectures use different techniques to speed up computations and require different tricks to utilize the
features. Extensive knowledge about the underlying hardware is needed.
3. Dynamism in program behavior: The performance of some programs depends on input and the control
flow of the program.
4
4. Incomplete knowledge at compile rime: Parallel compilers have to base their decisions on approximate
information at compile lime. These approximations are oflen incomplete, rough and rely on an unrealis-
tic and simplified model of computation.
5. Huge decision-trees for parallelism optimization: Even for a program of medium size, the decision-tree
for program optimization can be quite large. Methodologies for pruning unneeded branches in the
decision-tree are needed.
6. Use of ad hoc heuristics: The techniques adopted by parallel compilers and programmers of parallel
computers are mostly ad hoc heuristics. A number of heuristics are needed in order for the compiler (0
perform an adequate job. Unfortunately, mOl'll systems lack proper knowledge organization facilities (0
ulilize the heuristics effectively and systematically. Also, these ad hoc heuristics are mostly 800-
portable.
7. Expensive dependence analysis and performance estimation: Program dependence infonnation and csli-
mation of perfonnance are usually very computation-intensive.
Because of lItese difficulties, much of lite burden of achieving high perfonnance is shifted to lite pro-
grammer. Most people agree that cwrenl parallel programming tools need a much higher degree of inlelli-
gence. The question is: How far away are we from building truly intelligent parallel compilers that can shield
the user from optimization details? Have we effectively utilized the slate-of-the-art technology in building the
current generation of paralIel compilers? Judging from the abundance of research results and the slate of
current parallel compilers, the answer is probably a "no." From our point of view, lite root of lite problem
lies in the lack of systematic mechanisms for the reasoning and control of program parallelization and optimi·
zation process. Methodologies that can effectively integrate and utilize the current technology to improve
parallel programming environments and compilers can have an inunediate impact on lite development of paral·
lei software and should be important topics in lhe current research on parallel processing. In particular, com-
bining advance program trnnsfonnalion techniques and the state-of-the-art AI technology in lite construction of
parallel compilers or programming environments is a very promising approach. Four important areas that need
much more concentrated and coordinated research efforts are:
L Frameworkfor systema/ic program analysis and intelligent program restructuring control. This is lhe
centrallopic of chapler 4.
2. The integration of machine properties into the program restruclluing process and the representation and
manipulation of machine features. These problems were diseussed in [WaGa89, Wang88l and will be
briefly discussed here. LPoly86l and [Sarkar87] discussed task creation and scheduling problems based
on some machine features.
3. Methodologiesfor analyzing, represen/ing, organizing, and integrating heuristics for improving parallel-
ism. A framework for knowledge organization and control, called the heuristic hierarchy, is discussed in
section 4.4.3, and the concept is extended to a new problem-solving model called the hier-b/ackboard
(see section 4.4.4).
4. Leam;ng models and knowledge acquisition loolsfor the enhancement of /he system knowledge.
1.2. Problem Statements
This research is aimed at studying methodologies for constructing intelligent parallel compilers for
different classes of parallel computers. We will examine new techniques and meUJodologies for building intel-
ligent compilers for different classes of parallel architeclures. Two key issues that we concentrale on are Ihe
inlelligence of the syslem and the portability across a wide spcclrum of target architeclures.
More specifically, we intend 10 study lite following problems:
1. Paradigms for program reslructuring. Program restrucluring techniques are only mechanical procedures
lo change the program structure; 10 effectively ulilize lhese techniques. we need to couple the techniques
with knowledge of the target machine and heuristics. A framework for systematically selecting program
lransformations and evaluating the results is needed. The efficiency, flexibility, and effectiveness of the
framework need to be studied.
2. Me/hodologiesfor improving the parallelism of the program. Study the heuristics for applying program
lransfonnations 10 malch lhe parallelism of the program with the target machine, the lIteorelic founda-
tions of some program transfonnation techniques, and the incorporation of global consideration in the
5
decision-making process.
3. Knowledge 17Ulnipulorion for parallel program optimization. These include techniques for the analysis,
acquisition, organization, integration and accumulation of lhe knowledge for program parallelization and
optimization.
4. Applicarion of AI techniques to parallel compiling. Assess possible contributions of various AI tech-
niques in the construction of intelligent parallel compilers.
In our view, these problems are essential for solving the general problem of building parallel program-
ming environments lhat we discussed above. To demonslrate lhe techniques developed in this thesis. a proto-
type system has been constructed. Some experimenlal results and comparisons with olber parallel program-
ming environments and parallel compilers will be presenled.
1.3. Our Approach
Our goal is not to build a production quality parallel compiler lhat generates high quality object code for
certain parallel computers; this would require a great implementation effort and overshadow the main ideas of
this research. What we are trying to do is to derive methodologies and theoretical foundation for construction
of such systems and show that these approaches are feasible. Our approach is based on a new framework for
systematic analysis and restructuring of the program. This framework integrates machine features and perfor-
mance evaluation inlo the decision-making process of program optimization. Under this framework, !he pro-
gram optimization process is an iterative process of finding a suitable program transformation sequence to
improve the match between the program and the target machine. AI techniques can be utilized in this iterative
process to prune non-promising branches of the dccision tree, provide expert advice and explanation to novice
users, and support learning to improve intelligence of the compiler.
This framework is realized by a generalized blackboard problem-solving model called the
hier-blackboard which features opportunistic reasoning and hierarchical knowledge organization. Features of
the target machine that affcct the performance of the program are explicitly encoded in the heuristics. An
object-oriented machine knowledge representation scheme is designed (0 support reasoning on the knowledge
and integration of heuristics. A machine knowledge manipulation system is implemented to provide supports
for the program optimization system and the knowledge acquisition tool.
A systematic program optimization process requires a flexible, efficient, and accurale performance pred-
iction model. We have designed and implemented a performance prediction model to estimale effects of pro·
gram transformations on the program. This performance prediction is based on the machine knowledge mani-
pulation system and integrated with the systematic program restructuring process.
We also study heuristics for utilizing various transformation techniques and the theoretic foundation of
(wo useful transfonnations, array reshaping and message consolidation.
The essential requirement for a system to be intelligent is the ability to acquire new knowledge or self-
learning. A compiler that learns from its past experience will greatly enhance its capabilities and evolve with
advances in the technology. Allhough no self-learning modules are implemented at this point, the whole sys-
tem framework is designed with self-learning in mind. Provisions for integrating learning modules are sup-
ported in modules for knowledge organization, program optimization, and performance prediction.
To summarize, our approach has the following distinct features:
• The program optimization process is driven by the features of (he target machine and the program.
• Machine knowledge manipulation is integrated wj(h (he decision-making, performance-prediction, and
knowledge-manipulation processes.
• It supports multiple target machines and allows knowledge generalization and knowledge transfer.
• The whole decision lrcc is examined; AI techniques are utilized to prune non-promising branches of the
decision-lree.
• Supports systematic manipulation (encoding, organization, integration, and utilization) of domain
knowledge.
• New knowledge acquisition and self-learning modules can be easily incorporated.
The implemenlation of our approach differs from other parallel compilers in (he following four ways.
First, we are using second generation expert system techniques as part of !he internal design. Second, we are
6
exclusively relying on heuristics to make program optimization decisions based on the features of the machine
and Ute program. No prior sequence of transformation is required. Third, our system can handle different
classes of target architectures. Adding new target machines is a matter of specifying features of the machines
and installing new heuristics for programming the machine. Fourth. the level of user interaction can be chosen
by the user. This allows the system to adjust itself to suit the different backgrounds of the users. Also, depend-
ing on the available resources, the level of optimization degree is also selectable. Unlike other optimization
systems where the systems skip some optimization process allogelher when the optimization degree is low, the
optimization degree is actually an indication of how many resources should be devoted to do certain costly
estimations of the perfonnance. Therefore. the optimization degree will not affecl the functionality of the
optimization but may use cheap heuristics to obtain a rough estimation of the needed information.
Why do we choose to use the expert system approach and to support multiple target architectures? We
cxamine this question in the following two subsections.
1.3.1. Expert System Technologies
Expert systems are suitable for problem domains with great complexity and jobs for which it is difficull
to develop algorithms that can handle general cases. These kinds of problems are usually solved by employing
human heurisl:ics. Computer-based expert systems seek to capture enough knowledge of human specialists so
Ihat they can solve the problem. Due to the inherent difficulties of knowledge acquisition and representation,
the discovery and test cycle of integrating the knowledge needs to be repeated many times before the system is
powerful enough to solve general cases of the problem. With the gradual maturation of the knowledge about
parallelism, the knowledge of the parallel compilers needs to be updated frequently. The open-ended charac-
teristic of this process makes the expert system a natural choice for implementation.
Firsl generation expert syslems rely on the use of surface knowledge, such as simple hcuristics. AB a
result, these expert systems show only a limited degree of intelligence. This is because human experts not
only have a ground in the declarative and the procedural knowledge of their particular domains, but also pos-
sess something beyond facts and procedures. A major aspect of expertise resides in its structure and organiza-
tion (deep knowledge). However, the know-how sought by expert systems does not come pre-packaged as an
explicit and objective model. It may be better characterized as a collection of semi-complete, semi-structured,
hedged, and subjective know-how [Haycs83]. Therefore, it is important that the knowledge acquisition tech-
niques catch the underlying siructure and organization of the knowledge. Second generation expert systems try
to solve this problem by looking beyond the use of surface knowledge and utilizing two additional features:
tools for acquisition of deep knowledge and machine learning. The expert system shell should integrate a
variety of tools that allow for knowledge acquisition, explanation and utilization of the complex domain
knowledge. The ExpShell discussed in chapler 8 represents our efforts in this direction. The importance of
machine learning is in its low degree of human involvement Although self learning is slill not a reality now,
it has much higher potentiallhan human-assisted knowledge acquisition to learn in the long run. It can utilize
the available computing power to explore more possibilities and is not limited by the knowledge of the human
experts.
One major drawback of the rule-based expert systems is the opacity of the knowledge. Translating
heuristics into rules causes the knowledge to be fragmented. Even though there are still Slrong relations
between many of the rules, the fragmentation causes an unfortunate loss of coherence. We address this prob-
lem by deriving a new problem-solving model called the mer-blackboard. The hier-blackboard model is not
only a model for implemenlation of expert systems, but also a methodology for decomposing and organizing
both the problem-solving knowledge and the problem-solving state. Details of the model are discussed in
chapler 4.
1.3.2. Multiple Target Architectures and Knowledge Generalization
Different architectures present different programming models and use different lricles to speed up the
computation. Thus the programs need to be restructured on the basis of the knowledge of target architectures
in order to utilize the potential parallelism the hardware provides. Generally, there is a trade-oil between
optimizalion and portability. Porting a parallel program to another parallel compuler of different characteristics
requires a major effort which greatly increases the cost of developing software for parallel computers. One
solution to this problem is to build multiple targets or retargelable parallel compilers.
7
Many people believe that the optimization of the machine parallelism needs to be based on Ute machine-
specific instructions that the machine provides; therefore, the heuristics for program oplimization are specific to
the machine that they are developed for. However, a closer examination reveals that the perfonnance of Ute
parallel architecture is tied to the features of the machine instead of the machine-specific instructions that
reUeet the features. So ad hoc knowledge needs to be analyzed and converted into knowledge that can be
shared by other systems; this requires a model of program transfonnation that can accommodale such generali-
zation. The feature-directed program oplimization model that we discuss in chapter 4 suits this need fairly
well. A retargetable parallel compiler not only needs to understand hardware delails related to code generalion
but also needs to have enough knowledge about optimizing parallelism of target machines. This magnifies the
difficulties of alllhe problems in building parallel programming environments.
The major advantage of a multiple target an:hitecture software system is that the knowledge accumulated
for each aIthitecture can be transferred to other target machines. Knowledge generalization and lrnnsfer can be
achieved by analyzing the machine features involved in Ute knowledge syslematically. We discuss lhis prob-
lem in detail in chapler 5.
1.4. Organization of the Thesis
In chapter 2, we discuss some fundamental issues about machine parallelism and program parallelism.
These include machine an:hilecture and parallel execution. program dependence, representation and computa-
tion of program dependence, abstraction of program parallelism. and programming issues for parallel comput-
ern.
In chapler 3. we discuss the foundation of program transformations and survey some restructuring lech-
niques and their applications in parallelism optimization.
In chaplef 4. we present a new model for parallel program optimization. called Ute feature-directed pro-
gram optimization paradigm. A framework for building oplimizing parallel compilers based on the fea.ture-
directed program optimization paradigm is derived. A new problem-solving model called the hier-blackboard
is inlroduced. This model features opportunistic reasoning with hierarchical organization of the domain
knowledge and problem-solving states. We also discuss several approaches for efficient decision-making.
knowledge organization, learning, and intelligent user inlerface. which are the major subjects of the subsequent
chapters.
In chapter 5. fcalures of the parallel an:hiteclure that affect program optimization are discussed. An
object-oriented machine knowledge representation scheme is presented. The utilizalion of this scheme to build
a machine knowledge manipulation system is presented. Problems for collecting. organizing and analyzing
machine features and heuristics are also discussed. Some applications of this technology to other relaled prob-
lems are suggested.
In chapter 6, we present a performance prediclion model that is designed for parallel compilers. The
model is flexible and efficient and can be utilized in parallel compilers to assess the overall performance of a
program on a target machine and update lhe performance estimation incrementally.
In chapter 7. we discuss problems of parallelism matching and techniques to handle the problems for
:MIMD architeclures. Theoretical foundations for array reshaping and message consolidation are discussed.
Heuristics for utilizing Ihese and other program transfonnation techniques are also presented. We also discuss
the problem of selecting and fine-tuning pre-optimized algorithms.
In chapler 8, the implementalion of an experimental parallel programming environment is discussed and
some examples and experimental data are also presented. Finally. in chapter 9. we summarize ideas prcsenlcd





2.1. Levels of ParaUelism
Parallelism can be exploited at three different levels: the algorithm level, the program level, and lhe
machine level. On each of lhese levels it can be abs!racled as a conceptual concurrency model of computation
which js called virtual machine of the level.
At the algorithm level. the parallelism lies in the concurrency of the multi-Ihreads of computations
allowed by the algorithm. The computational assumptions that1he parallel algorithms are based upon represent
the virtual machine of the algorithm. Commonly used models include shared-memory models (conclDTcnt-
read-exclusive-write. concurrent-rcad-eoncurrcnt-wrile. etc.) and fixed interconnection networks with constant
degree (such as linear array. trees, meshes. hypercubes. systolic arrays, shuffie-exchange, and buuerfly net-
works). Algorithm level parallelism can be characlerized by the number of virtual processors, lhe model of
concurrent read-writes, the complexity of inter-processor communications, and the complexity class of the
parallel execution time on the virtual machine model when expressed as a function of problem size.
At the program level, each parallel programming language defines a virtual machine by the semantics of
its parallel control construcls. Program level parallelism can be characterized by the control and data depen-
dence constraints imposed by the language and the user's choice of data structures. When an algorithm is
represented as a program, its parallelism is limited to the available parallelism of the algorithm under the con-
straints of lhe program dependencies.
Machine level parallelism. which is lhe maximum concurrent execution capacity of the archilecture. can
be characterized by various machine features such as peak: performance, levels of memory hierarchy, mtio of
processing and memory access speed. number of pipelines. network: lopoiogy. and network latency.
When problems are mapped from the algorithm level to the program level or from the program level to
the machine level, the differences between Ute computational models of Ute two levels may cause parallelism to
be IooL For example, when a program is compiled 10 run on a particular an:hiteclure, the effective parallelism
of the program is detcnnincd by the malch between the program and the underlying machine architecture.
Similarly, when an algorithm is translated into a program, the concurrent properties of Ute algoriUtm may be
serialized by the dependence relations inherited from program constructs and data synchronization. In some
cases, lhe concWTency is lost because the limited parallel constructs provided by the programming language
simply cannot express the full parallelism of the algorithm.
The ulilization of lhe parallelism of the algoriUtm on a machine is based on the match of parallelism
belween the algorithm level and the program level and between the program level and the machine level. At
the algorithm-design stage. the user can select the most appropriate algorithm to carry out a particular compu-
lation. Allhe programming stage. the user can select suitable programming languages and data structures 10 lay
down lhe computation. And then the compiler is supposed to translale the user program inlo machine language
and optimize the program based on its knowledge of Ute target machine.
Unfortunately. the implementation of algorithms is heavily inftuenced by the computational model
sclected, and changing the computational model sometimes requires a complete redesign of the algorithm.
There have been many attempts (0 solve these problems. One approach is (0 write programs based on a care-
fuUy designed programming model. The programming model forms a virtual architecture that execules the
programs. Programs written for this programming model can be executed on parallel computers that the vir-























Figure 2.1. Paraf/efism and the process a/parallel programming.
[BOMMKSW90] arc just a few examples based on lhis approach. Anolher solution. which has achieved
mixed results so far, is to allow the user 10 program the application in high level programming languages that
support a limited degree of parallelism (such as Fortran, Fortran 9X, Dino [RoScWe89], Blaze [MeVR87J. and
Kali [MeVR89D. and relies on compilers or programming environments to utilize the parallelism of the archi-
tecture. The user may either express the parallelism explicitly in the program or direct the parallel program-
ming environment or the compiler to map Ute program to Ute target machine.
The major difference between these two approaches lies in the level of the abstraction. The former
approach usually provides a wcll-<lefined, very high level programming model which is not bound to any par-
ticular architecture. This model has a smaller application base since some applications may not fit (he model
well. Anolher problem is that users have no conlrol over the program's performance which is at the mercy of
the implemenlation of ihe programming model. Performance may suffer if the target architecture mismatches
the computational model. The biggest drawback of this approach is that ihe huge existing libraries and appli-
cations cannot be utilized wilhout complete reprogramming based on the new model; Ibis is very expensive.
On the other hand, (he compiler approach can utilize the rich existing software libraries by reslruCluring and
recompiling the libraries with parallel compilers or programming environments. Explicit parallelism-
programming with general purpose high level languages is usually at a lower level than programming wilh vir-
tual architectures. Also, capabilities of the compilers and parallel programming environments are slill very
limited. Relargclable parallel compilers and programming environments that we discuss in this thesis have the
pOlcntial (0 automate this process.
2.2. Machine Parallelism
Parallelism can be achieved by processing the data simultaneously. Techniques for achieving simultane-
ous dala processing include:
10
1. Overlapping I/O and CPU operations.
2. Using multiple independent functional units (FPD, ALD, etc.).
3. Using simultaneous dala accesses with multiple memory modules.
4. Using hierarchical memory subsystems (disk, memories. cache memory, regislers, etc).
5. Overlaying execution stages of functional units (pipelining).
6. Using multiple processing elements.
These techniques, except for the use of multiple processing elements, have been successfully utilized in
most sequential computers. The major reason for the success of these methods is that advanced compiler and
operating system techniques make these methods transparent to users. Unfortunately, the same is nol true for
systems that have multiple processing elements. As we discussed in chapter 1. programming parallel comput-
ers with mulliple processing elements requires many concepts that do not exist in programming sequential
machines.
The major complication of parallel programming lies in the management of lhe data. When multiple
processors can access and update the same data at the same time. simultaneous processing needs to be handled
with great care. For example, when several processors modify the same data, the order of the update deter-
mines the oulcome of the program. Similarly, when multiple copies of the same data exist, the correctness and
consistency of the data needs to be of concern also. To preserve the correctness of program execution. the
concurrent tasks must be synchronized. Synchronization introduces overhead and causes processors to be idle.
Programmers of parallel computers are forced to manage data very carefully since the data commwtication
lime can very easily dominate the computation time. Efficient management of data to prevent serialization of
the parallelism is another area that needs much more research and is considered by many experts as a
bottleneck of parallel programming.
Another problem is that computation must be split into parts that can be run simultaneously on the pro-
cessors. This includes two kinds of parallelism: multiprogramming and multi-tasking. In multiprogramming,
several programs are allowed to run concl1lTently on the parallel computer. The resources of the systems are
shared by the programs. but there is no further interaction between the programs. In contrast. multi-lasking
splits a single program into multiple parts that arc called tasks to run on a parallel computer. The tasks coordi-
nate with each other to compute the solution of the problem. The parallelism is exploited by running the tasks
on multiple processors concurrently.
To achieve the maximum speed up, the workload of each processor must be evenly divided. This prob-
lem is called load balancing. Since processes in multiprogramming are independent, the load balancing prob-
lem in multiprogramming usually concerns the utilization of the resources and the throughput of the system.
The scheduling of the multiprogramming is usually decided dynamically at the runtime by the opernting sys·
(em of the target machine. On the other hand. tasks in the multi-tasking environment are usually related, and
the tasks and the relationship between them need. to be analyzed to achieve a balancing load. Therefore,
multi-lasking is usually scheduled statically at compile time, although some decisions may be deferred to the
run time.
2.2.1. Variations of Parallel Computers
Parallel computers can be classified by different criteria. One of the most widely accepted classifications
was based on the number of instruction streams and data streams [FIynn66]. An instruction stream is a
sequence of instructions as executed by the machine and a data stream is a sequence of data as referenced by
the instruction stream. Flynn classified computer architectures into four different classes:
• SISD: Single instruction stream - single data stream
• STh1D: Single instruction stream - multiple data streams
• MISD: multiple instruction streams· single data stream
• MIMD: multiple instruction streams - multiple data streams
In SISD systems, instructions are executed sequentially. Overlapping of instructions is possible (such as
overlapping data fetching and processing or overlapping execution stages) in order (0 speed up the computa·
tion. Even though an SISD may have more than one functional unit (such as floating point unit and integer
unit), all of them are under the supervision of one conlrol unit.
11
An SIMD machine contains multiple processing units which operate synchronously in lock step under
the same control wtilS. Most SIMD machines are special purpose array processors that are applied mainly to
signal and image processing. Examples of machines in Ibis category include ffiM GFll [BeDcWc85J, Con-
nection Machine [Hillis85J. and FPS 164/MAX.
The lvIISD model is considered impractical and no research DC commercial effort has ever been placed in
Ihis category.
The 1illMD architeclure consists of a set of sequential processing elements (PEs), each of which may
execute instructions independent of each other. A generic JvlIMD machine consists of a set of PEs, a set of
memory modules and an interconnection network that interconnects PEs, memory modules and peripheral dev-
ices.
The most important property of MIMD computers is the memory hierarchy and the communication
mechanism. This is because in most systems, especially machines that have sevcrallevels of memory hierar-
chy, misuse of fast memory or the communication network inlroduces unwanted overhead and can significantly
degrade the performance of the system.
2.2.1.1. Network Topologies
Processing elements, memory modules and peripheral devices in an MIMD computer are connected 10
each other through the interconnection network. Interconnection can be dynamic or static. In a dynamic inter-
connection network, switches are provided so that elements in Ihe system can be connected under program
control. Crossbar switches, shuffle exchange networks, etc. are widely used dynamic interconnection networks.
CHiP computers [KWGCS84] are highly configurable MIMD machines in which different topologies of the
interconnection can be formed under Ihe program control of the switches. The static interconnection has a
fixed lopology by which processing elements are arranged in multidimensional patterns and permanently con·
nected to each other. The most widely used interconnection melhod is the bus structure. In this method, all
elements share a single connection medium and employ a variety of techniques such as loken passing or time-
sliced broadcasting to assure correct data transmission. Different types of communication networks include
mesh, hypercube, torus, cubic connecled, and butterfly.
2.2.1.2. Memory Hierarchy
Machines that use distributed local memories attached to processing elements are called distributed·
memnry MIMD systems. Message passing is the communication method among the processing elemcnts in
distributed-memory MIMD systems. Examples of dislributed-memory MIMD systems include nCUBE 2. Inlel
iPSC/2, Intel Touchslone, and Pringle.
Shared-memory M1MD machines are tightly coupled systems using shared memory among mulLiple pro·
cessors. Shared-memory MIMD machines can have multiple levels of memory hierarchy. In Ihe exlrcme. a
machine with complete memory hierarchy may include global memory, cluster memories, local memories.
cache memories, or even multiple levels of cache memories [KDLS86]. Shared memory MIMD machines are
normally bus connected or direct connecled. Examples of shared memory MIMD machines include Sequent
Symmelry, Alliant FX/80, and Cray X-MP.
2.2.2. Programming Issues for MIMD ParaDel Computers
For both shared and non·shared machines. data management presents the biggest challenge to the utiliza-
tion of the parallel architectures. Bad memory management may cause network or bus contention and dramati·
cally degrade the performance of the system. For shared-memory syslems, Ihe major programming issues
include task creation and scheduling data allocation, locality, memory and cache management, and minimiza·
lion of synchronization. For distributed-memory :MIMD machines, process creation, data decomposition. load
balancing, locality, message routing, and deadlock prevention are Ihe most significant problems. Distributed·
memory parallel computers are nonnally more difficult 10 program because data communication needs to be












Figure 2.2. Example ofdifferem network topologies.
2.2.3. Computational Models
The computer system as perceived by users may be quite different from lhe physical device that carries
out the compulaLions. This is because lhe functionalities and limitations of the machine may be modified, lim-
iled. or extended by languages. operating system. software, microcode, or optional hardware of the system. In
other words, the compnlalionaI model of the compuler can be viewed as a hierarchy of layers thal includes
hardware devices, operating system, programming languages. software environments, etc. Each level of the
"machine" that the users see appears to be different and possesses different funclionalities and Lhe melhodolo-
gies for using the machine. When the user is using the computer, Ute actual computing device that "appears"
to lhe user depends on which level of the hierarchy the user is at We called these virtual machines lhe compu-
tational models on (he machine.
Based on the level of abslraction, the computational model can be viewed as a hierarchy of levels (figure
2.5), with the physical hardware at the lowest level. The operating system and utilities are built on lop of the
physical hardware level, and the programming languages are implemented on top of both the operating systcm
and the hardware. The applications and the software systems are lhe topmost layer in the hierarchy, which
usually is what users see as the computer. This division is an extension of lhe division of machine, program,
and algorithm level of parallelism in section 2.1 and can be further extended to divide each level into sub-
levels. The computational models of the levels are composed of the features and methodologies of using the
levels and they form the interface between the levels. The upper levels can be viewed as programmed on the
lower levels. Starting from the bottom up, the computational model of lhe hardware level consists of
specifications and features of the hardware (for example, lhe number of vector registers, and cost ratio of
memory access and processing). The operating system level includes operations and control for multiprogram-
ming and utilization of the devices (for example, process start up time, swapping time, and self-scheduling
operations). The programming language level is defined by lhe constructs and concepts of lhe programming
language. It defines <futa structures, instructions and control structures to conleol lhe execution of the program.
The programming language level foons lhe foundation for the program parallelism. The application level con-
sists of various applications and is out of lhe scope of our interest.
13
Shart:d-memory lln:bi\J:etun: DiJ!ributed memory 8rt:hilecturo





Hybred-memoJ)' architeclurc HicIllrcrnCllI memory an:hi\J:crure
Figure 2.3. Different types of MIMD computers.
Features of a level may be preserved, obstrucled or modified as we go across the level boundaries.
Some new features may be the abs!racLion of the features of the lower levels. Some may be modified because
of the limitations of the upper levels. When the machine is seen from a level down, the actual characteristics
of the computational model consist of the features that belong La the cwren! level and lhose of lower levels
that are not obstructed by the level boundaries.
Adopting the notion of computational models at different levels of the system provides a good abslrac-
Lion of the levels. The model can shield Ute users of the level from the lower levels and is therefore conceplu-
ally easier to manipulate. It also makes identifying differences between levels easier.
2.3. Program ParaUelism
2.3.1. Granularity of tbe Program Parallelism
Program level parallelism can be further divided into lhree concurrency levels: task, micro task, and
operation. At the task level, a program is decomposed into processes which may be run on different proces-
sors. At the operation level. vector or scalar operations are the units of computation. The size of the vector
operation represents the degree of concurrency at this level. The micro task level is the level between the lask
level and the operation level and is often characterized by loop bodies. More precisely, inside a task, opera-
Lions are grouped into micro tasks. which are blocks of code that are executed between synchronization points.
Efficient utilization of a program on a parallel machine depends on the composition of tasks and balance
between operation level parallelism and task level parallelism. A parallel compiler will have to decide the best
way to form the tasks. vector operations, and the scheduling of the lasks. Decisions about the granularily of
the lasks. structure of the tasks and the scheduling of the tasks are based on many factors; a non-exhaustive list
includes lhe cost of task dispatching. distribution of shared data, compile and run time knowledge of the pro·
gram and system. and ratios of compuroLion and external data references of the tasks. These decisions can be
made either statically at compile time or dynamically at run time. Making the decisions at run time has the
advantage of having additional infonnation (such as values of some variables) that is not available at compile














critical regions dead-lock prevention




Figure 2.4. Programmi/lg issues/or MIMD computers.
message routing
Imessage consolation
compromise between compile and run time decision making is needed.
23.2. Program Dependence
Parallel execution of the programs is governed by the dependence relations of lhe program. There are
two lypes of program dependence: control dependence and data dependence. The control dependence
[FeOtWa83] specifies the preconditions on the operations which are required for them to be acLuaily executed
(such as loops. conditions. and exit statements). The dala dependence represents the set of essential constraints
on the execution order of the data references. Together, these two types of dependencies fonn a complete
summary of the semantics of the program.
Since the program dependence specifies the constraints that the program parallelization process must
respect, violating the dependence relation may cause data access and modifications Lo happen in the wrong
order and thus change the meaning of the program.
There are four types of data dependence relations: input dependence. flow dependence. anti-depelldence
ami ol/rpm dependence [KKLPW81] and [Wolfe82]. In the following discussion, a program componelll can be
a data reference, an expression, an assignment statement or a compound statement such as a loop or condi-
tional statement.
Defini/ion 2.1 Consider two. not necessarily distinct. components S1 and S2; we say that.:
1. There is a flow dependence from S I to S2. if a value computed by S I is stored in a location associated
with some variable name x which is latcr referenced in S2 before other components overwrile the value.
We denote the flow dependence relation as 8x : S I -;. 52'
2. There is an anti-dependence from SA to S2, if a variable x referenced by 51 must be uscd before it is
overwritten by S2 and is denoted by 8x : S I -;. S2.
15
Figure 2.5. A hierarchical view of (he computational models.
3. There is an output dependence from S I to S2. if a value computed by S J is slored in a location associated
with some variable x and is later overwritten by compulalions in 52' This relationship is denoted by
3.r":SI ~ 52'
4. There is an input dependence from S I to S2. if both components use the same value slored in a location
associated willi variable x and is denoted by a/ :S 1 --+ 5z.
These dala dependencies impose constraints on the execution order of the components. If there are flow,
anti- or output dependencies from 5, 1052 , component 51 must be executed before component 52' These
conslrainls are necessary in order to preserve Ihe input-output semantic of the program. Note lhat the input
dependence is a reflexive relation and does not impose any constraint on the polential parallel execution of Ute
two components. However, it can be used to compute the cache-miss ratios for memory management
[GaJaGa87J.
A loop instance in a loop or multiple loops can be identified by the index or indices associated with the
instance. A dependence relation that crosses from one loop instance into another is said to have a distance and
the distance is defined to be the difference of the loop indices.
Definition 2.2 Suppose there is a dependence relation from stalement S 1 in loop index tuple lIto statement S2
in loop index tuple 12, then the distance vector of the dependence is defined to be 12 - 'I
Not all optimization techniques need the detailed knowledge about the distances between references of
the dependence. In some cases, only the signs of the vector elements are needed. The dala dependence direc-
/;011 vector is defined as the vector of the signs of lhe distance veclor of a dependence. For the representation
of the direction vector, '<', '=', '>' are used to represent the relationship between the indices of lhe source and
destination array elements, where'<' means the dependence across a loop boundary in a forward direction; '>'
means the dependence across a loop boundary in a backward direction; '=' means that the dependence appears
in the same loop instance.
Algorithms 10 compute the program dependence graph can be found in [Wolfe82, AlKe84a, Wang8Sb,
BuCy86, Bane88, Wolfe&9, Zima90j. Interprocedural dependence analysis can be found in [Allen74, Bane76,
Hecht77, Barth78, GrWe76, KaUl76, MyerS!, CoKe88, LiYe&8a, LiYe88b].
In figure 2.6, we show the program dependence graph of the following sample program. The labels of




51; e.{iJ]:- (ap-l.D + a(iJ-l) /2; D«,_)
end for
end lor D{_,<)
Cal A _lUDPle program.. tb' 'l'he prograa dependence graph rar (a).
l{l,l) l(l,2) l(1,3) L{l,4) L(l,M)
I I I I I
L(2,1) -L{2,2} _L(2.3} _ L(2,4) - .....- L(2,M)
I I I I I
L(3.1) -L{3,2) -l(3.3) -l(3,4} - - l(3.M)
I I I I I
L(4,1) -L(4,2) -L(4,3) -L{4,4) - - L{4,M)
I I I I I- - - - .....-
I I I I I
L{N-l,l) _ L(N-l,2)_ L(N-l,3)_ L(N-l.4}_ .....- L(N·l,M)
I I I I I
L(N,l) _L{N,2) _liN,S) _l(N,4) - .....- L(N,M)
(c:) The e:a::pan404 data tlow graph of loop iDetencee tor the ......ple prog:r:am.
Figure 2.6. A sample Blaze program, its dependence graph and expanded dependence.
2.3.2.1. Representation of Program Dependence Graph
There are two kinds of dependencies in the program dependence graph (that is generated by the depen-
dence decision algorilhm): those that are proven to exist ("proven dependencies") and those that are assumed
to exist ("assumed dependencies"). The "assumed dependencies" are included because the decision algo-
rithm fails to prove the non-existence of the dependencies. For example, most compilers assume the existence
of the dependence when they encounter one of the following situations: a subscript of an array reference is
non-linear wilh respect to loop indices, a subscript contains variables, or there are variables in loop bounds.
Existing compilers make no distinction between "proven dependencies" and "assumed dependencies." Sub-
sequently, the generated dependence graph contains many dependence relations that do not exist in the pro-
gram. These "false dependencies" may prevent the compiler from performing certain lransformations or may
even sequentialize the program. This problem may be solved by generating run-time tests to validate the
existence of Ute dependencies at Ute run-time. However, run-time tests are expensive, and when dependencies
are generaled at a different phase, tests that have been done in the dependence analysis are often repeated at the
run lime. We solve this problem by introducing a generalized representation for the program dependence
graph. In our represenlation scheme, the type of a dala dependence is stored wiUt the dependence. For an
•'assumed dependence," conditions for it to exisl are also stored. Also, a confidence factor is assigned to each
dependence. The first information can be used to minimize the run·time tests when such tests are warranled.
The confidence factor can help the inference engine of the optimizer to make better decisions. Both types of
17
information can be provided by a modified dependence decision algorithm. Other information can also be
gtored, and lhis set of information is collectively called the attributes of the dependence.
2.4. Optimization of ParaUeJiclm with Program Transformations
2.4.1. Abstraction of Program Parallelism Based on Program Dependence Graph
Abstracting parallelism out of a program is the first step toward ulilizalion of the program parallelism.
For compilers that allow assertions or interactive user involvement, program parallelism recognition is largely
the responsibility of the programmer. For automatic parallelism optimizing compilers. the compilers will have
to discover the potential parallelism by examining the program. The program dependence graph represents the
constraints on the parallel execution of Ute program. The compiler usually realizes the parnllelism of the pro-
gram by verifying that certain patterns of Ute dependence do not exist in the program. For example, the fol-
lowing vectorization test can be used 10 delermine whether a loop can be vectorized.
Example 2.1 Veclor;zat;on test. Let PG(B,D) be a program dependence graph of a loop L wilh index i. If
lhere is a dependence arc from node SI to node 52 whose direction for index i is '<' and S1 precedes 52 in the
topological sort order or S1= S2, then the loop is not vectorizable.
Techniques lhat can be applied to abstract lhe parallelism of the program include paltern recognition,
graph-reduction, and heuristics.
The following Prolog procedure shows how a simple paltern recognition tcchnique can be applied to dis-
cover the fact lhat the program fragment is an inner product operation.
/* StmJ;n loop Loop;s an ;nner·produc( o/vector A1 and A2 */





accumulateOver(5tmJ, '+', Var, Expr),
expression{Expr, '*', Ai, A2),
simplefndexSubscript(Ai, f),
simplelndexSubscript(A2, I).
Whal the above procedufC shows is lhat the pattern of an inner-product is a statement inside a loop
whose index is I, and the loop can be distributed into the statements if there is more than one statement in the
loop. The computation is to multiply elements of two arrays and the results are accwnulaled into the variable
on the left hand side of the statement. Furthermore, the subscripts of the arrays Utat contain index f are simple
subscripts (contain only variable I).
The pattern defined by the above program is ralher general. For inslance, all program fragments in
figure 2.7 except figure 2.7(e) satisfy this pattern.
The program in figure 2.7(e) fails the inner-product test because the accwnulation of the pair-wise multi-
plications is in the outer loop instead of the inner loop. By interchanging the execution order of lhe loop, lhe
program can be converted into inner-product operations. This situation can be covered by the following rule
which says lhat if the statement is a nesled loop and the loop that the accumulation is based on can be moved
to be the innermost, then the program conlains an inner-product.
18
(a) foriinl .. ndo
x:= x + ali] '" b[i};
(b) for i in 1 .. 12 do
jorjinl .. mdo
yIil := y[ij + a[ij} '" x[j];
(e) fori in 1 .. ndo
[orjinl .. mdo
forkinl .. ldo
c{ijI := cfijJ + a[i)cJ '" b[kj};
(d) foriinl .. ndo
forjinl .. mdo
b[ijJ := o[ij) ** 2;
y[i] := y[i] + a[ijJ '" xli];
c[i] := xUJ+y[i];
end/or
(e) forjinl .. mdo
foriinl .. ndo
y[i] := y[i] + ali j/ '" x{j] ..
Figme 2.7. Example a/program/ragmenrs that contain inner-products.
generallnnerProduct(NestedLoops, Loop, StmJ, Ai, Al) ;-
inlerchangableLoopOrder(Nested, NewOrder),
innermoslLoop(NewOrder, Loop),
innerProducl(Loop, StmJ, Ai, Al).
After the potenlial inner-product loops and statements are identified, the statements can be split from the
loops and translaled inlo inner-product statements. Figme 2.8 shows the resulting programs by lrnnslating (he
computation into explicit inner-product operations. Nolc that in (a), (b), and (c), lhe programs are simply
translated into inner-products. But in (d), lhe statement y[IJ := y [i] + a [i,j]*xUJ: is first swapped with the
statement b [i,iJ := a [i,i]"'*2: to separate it from the rest of the loop; then the i loop is distributed into the
two blocks of statements and the first generated j loop is translated into inner-products. And (e) is obtained by
first interchanging the loops to move the jth loop to the innennost and then translate it into inner-product.
(a) x:= innerProduet{a, b, I, n)
(b) for i in 1 .. n do
y[i] := innerProduet{a[I,*l, xl"'), I, m);
(e) fori in 1 .. n do
forjinl .. mdo
e[ij):= innerProduel{a[i,*I,bl"'j/,I,I);
(d) for; in 1 .. n do
y[11 := InnerProduel{ali,*I, x["'I, 1, m);
[orjinI .. mdo




(e) for I ;111 .. n do
y[i] := itulerProducl{a[i,*l, xl"'l, I, m);
Figure 2.8. Result of lranslating the statements into inner-products.
The techniques that we have just described to translate the program fragments into inner-product opera-
tions are called program transfonnations. Another technique to improve the parallelism of the program is to
suhstitulc the program fragment with a previously optimized program for the same target machine. These two
approaches will be discussed in more detail in the next section.
19
2.4.2. Optimization of Program ParaUelism
The program-restructuring process changes a program into a semantically equivalent representation of
the program. The purpose of program rcslructuring is to modify the structure of the program (0 utilize the
parallelism polenlial of the target machine.
Definition 3.1 A program is said to be semantically equivalent to anolher program if their input-output
behaviors are the same.
Definition 3.2 The potential parallelism of a program on a target machine is defined to be lhe best parallel
execution order of Ute program on Ute target machine lhat is semantically equivalent to the original program.
Definition 33 The actual parallelism of a program on a target machine is the degree of parallelism lhat the
current Conn of the program (i.e. with no modifications to Ihe slructure of the program) can achieve on the tar-
gel machine. In other words, the actual parallelism is a measure of the malch between the current form of the
program and the target machine. The major factors in deciding the aclual parallelism of a program on the lar-
get machine are data and control dependencies and paHems of data references.
The goal of the parallelism oplimization process is to find a semantically equivalent program so that its
actual parallelism on the target machine is as close to Lhe pOlential parallelism of the original program as pos-
sible.
2.4.2.1. Parallelism Realization and Program Restructuring
The process of optimizing program parallelism consists of two steps: the program-restrucLW'ing process
and the program-realization process. The program-restructuring process improves the potential parallelism of
the program by modifying the strucLure of the internal program representation. The program-realizalion pro-
cess maps the program onto the computational model of Lhe target machine by effectively ulilizing Lhe con-
currency potential of Lhe machine. The program-restructuring process optimizes Lhe potential parallelism of the
program for a larget machine, while the program-realization process actually maps the program onto the target
machine to utilize the actual parallelism of the program on the target machine.
Based on Lhe dependence constraints of the program and the feature descriptions of the larget machine,
the program-realizaLion process partitions Lhe program into operation blocks and composes them to fonn vector
operations, micro tasks, tasks, and processes. In the abstract, Lhe program-realization process can be viewed as
a function:
ProgramJealization: Computational_rrwdel X Programs -+ Programc
where Programs are program dependence graphs that are semantically equivalent to the original program,
Compurational rrwdel is Lhe computational model of the target architecture (note that target architecture here is
an abstract concept which does not need to have a physical implementation in hardware), and the elements of
Programc are program dependence graphs that are annotated with parallelism and run·time infonnation such
as processor assignments, synchronization, and vectorizable or parallelizable loops.
Strictly speaking, the program-realization process does not improve the true parallelism of the program.
It simply lakes the current fonn of the computation, as represented by the program dependence graph, and
applies a mapping to convert the program into a parallel form. For example, for multiprocessor systems the
ontennost parallelizable loop can be used to gcncrate tasks. For machines wiLh vector capability, lhe innennost
loop can be vectorized (if it is legal to do so). The synchronization technique that is provided by the computa-
tional model is used to satisfy any data dependence not already satisfied by sequential execution of parts of the
program. More sophisticated modification of the program (such as interchanging Ihe nested loops or even
non-nested loops) is Lhe task of program restructW'ing. Many existing compilers for parallel architeclures pro-
vide program parallelism-realization capability but not necessarily lhe program parallelism-optimization capa-
bility.
2.4.2.2. Two Approaches for Program ParaUelism Optimization
The program-restructuring process maps a program into a functionally equivalent program by altering the
control or data slructures of the program. The goal of the restructuring is to properly change the order. the
decomposition, and the allocation of Lhe data and control structures of the program so that the achievable paral·
lelism can approach the polential parallelism of the program on the larget machine.
20
There are two different approaches to restructuring the programs: the pre-oplimized algorithm substitu-
tion approach and Lhe program transformation approach. The pre-oplimized algorithm subs/;tu/ion approach
replaces the program fragment under consideration wilh an algorithm in the library that has been pre-optimized
for lhe particular largel machine. The precondition for this substitution is that the functionalities of lhe pro-
gram fragment need to match those of the pre-optimized algorithm. The program trans/anna/ion approach
improves the match between the program level parallelism and the machine level parallelism in a stepwise
refinement fashion. The program transformation approach breaks the program restructuring process into basic
steps that are called program transformations. Each program transfonnation alters the slruclurc of a program
segment in order to improve the parallelism of lhe program. It involves techniques such as changing the
instruction execution order (by forward substitutions. statement reordering. elc), modifying program control (by
loop interchange. loop distribution. elc). and eliminating unnecessary data accesses and modification (by data
localization, block transfer, cache optimization. dead code elimination. elc).
Since a transfonnation only modifies the program structure slightly. it is possible to define the conditions
which the program must satisfy so that the resulting program will have the same input-output behavior as lhe
original program. These conditions are called the preconditions of the transfonnation. If a program satisfies
the preconditions of a transformation, the transformation is said (0 be applicable to the program.
Optimization of a program requires careful selection of a sequence of applicable transfonnations based
on the underSlanding of the program and the target machine. The composition funcLion of the IIansfonnations
maps the original program into the final form and determines the net effect of the complete sequence of the
lransformations.
2.4.2.3. Pre-optimized Algorithm Substitution Versus Program Transformation
The major difference between lhe pre-optimized algorilhm substitution and program transformation
approaches lies in the granularity of lhe changes. Program transformation modifies the program structure step
by step. whereas the pre-optimized algorithm substitution takes the "wholesale" approach by replacing a whole
program fragment by a pre-optimized algorithm in the library.
The major problem wilh the pre-optimized algorilhm substitution lies in the difficully of recognizing the
opportunity for algorithm substitution. Comparing the semantics of two programs is an NP-complete problem
and it is usually very difficult if not impossible for a soflware system to match two programs. This approach is
uscd by some parallcl programming environments since the programmer can provide the needed information
that a software system cannot obtain. Another drawback of this approach is that in most systems only a small
set of pre-optimized algorilhms is available. This is due 10 the high cost of constructing lhe library of pre-
optimized algorithms which limits the scope of the application also. On lhe other hand. once the opportunity
for algorithm substitution is recognized, litis approach can usually achieve very good results since special
attention has been paid to lhe efficiency of the pre·optimized algorithms. The FAUST project (GGJMG89] at
CSRD. University of Illinois takes the expert systems approach to choose lhe appropriate algorithms that
match the user program.
Oplimizing parallel compilers usually take the program transfonnation approach. since compilers are par-
ticularly well-suited for mechanical analysis of the program dependence graph and for verifying the pre-
conditions of lhe transformations. However. the problem with this approach is that the decision for restructur-
ing the program structure is fragmented. The effect of a transfonnation on the utilization of the parallelism
may not be clear at the time when the transfonnation is selected. Global analysis of the IIansfonnation is
needed. Selecting a right sequence of transformations is highly target arehitecture and program dependent and
is very difficult. Finding a generic framework for selecting transfonnations is even more difficult and most
experts rely on heuristics. Another problem in building parallel compilers is that most sludies of lhe lrnnsfor·
mations center on the lheories about the pre-conditions of the transfonnations and algorithms for carrying oul
the lrnnsfonnations. Few efforts have studied lhe practical problems of program transfonnations such as recog-
nizing the opportunities for applying the transformations or melhodologies for selecting the best transfonnation
among several applicable transformations. This may be the main reason that, despite the development of so
many different program transformation lechniques during the lasl decade. the performance of parallel compilers
is still disappointing.
In lhis research, we take the following stance concerning the two different approaches:
• The Lwo approaches are not necessarily mutually exclusive and they should be applied togelher to over-
come lhe shortcomings of each other.
21
• At the compiler level, the pre-optimized algorithm. substitution can be applied 10 some well-defined prob-
lems, such as those functions defined in BLAS[I,II,III]. More complicated algorilhm substitution is pos-
sible, but recognizing the opportunity for application is usually too difficult for Lhe compiler to handle.
For programming environments, this requirement is less crilieal since the user can provide some infor-
mation \0 ease the difficulty the environment faces in recognizing the chance for applying pre-oplintized
algorithms.
• If the opportunity for algorithm substitution is found, it should take precedence over program transfor-
mation.
• It is impractical to oplimize every useful algorithm for all different target machines. Therefore, for simi-
lar architectures, we only need to pre-optimize an algorithm for a particuJar type of machine and equip
the system with rules to select the pre-optimized algorithm that is optimized for a similar machine and
apply 8 set of progmrn transformations to fine-tune the pre-optimized algorithm to fit the actual target
machine. 'This approach will be discussed in more detail Chapter 7.
2.4.3. Classifications of Program TransformatioD Techniques
Program transfonnation techniques change the structure of the program by modifying control
dependence(s), data dependence(s), the decomposition and allocation of instructions and data, data reference
patterns, or the combinations of above. Program transformation lechniques can be classified into two
categories: machine-independent and machine-dependent transformations. This classification is not disjoint
and the same transfonnation may be included in both categories for different objectives.
The machine-independent program transformations are used to improve the general parallelism of Ihe
program such as breaking dependence cycles, improving localily, and eliminating redundant instruclions. More
specifically, transformations that can be used to break dependence cycles include scalar expansion, variable
renaming, statement splitting, and fonvard substitution. Transformations to improve locality include statement
reordering, array gathering, subscript blocking, army reshaping, and loop interchanging. Due to the difference
in architectural requirements, it may be wise 10 delay the locality improvement until the target machine is
chosen. However, some obvious improvement can still be achieved without lmowing the target machine since
locality is almost always good for parallel execution. Transfonnalions to eliminate redundant instructions
include code motion, dead-code elimination, and loop merging. Most traditional compiler optimization tech-
niques to remove redundant inslrUctions can also be applied.
Machine-dependent transformations make use of the special lmowledge about the underlying parallel
architecture to improve the match between the program and the target machine. Machine-depcndent transfor-
mations can be used to improve vectorization and parallelization, decrease synchronization time, minimize data
access time, schedule execution, dislribule data, etc. Transfonnations to improve veclorization include loop
distribution, loop merging, loop blocking, loop unrolling, loop collapsing, loop interchanging, loop spreading,
Boolean recognition, scalar expansion, if matching, and if removal. Transformations to improve parallelization
include loop interchanging, loop blocking, loop merging, loop distribution, loop collapsing, loop coalescing,
vector scalarizing, code motion, high level loop spreading, low level operation spreading, and doacross
scheduling. Transformations to do memory optimization and minimize synchronization include array decom-
position and distribution, army copying, synchronization, cache utilization, array reshaping, array block
transfer, array gathering, array scattering, array scalarizing, and scalar expansion. Transformations 10 create
concurrent lasks include loop interchange, loop blocking, and cyclic loop blocking. Transfonnations to
schedule execution include high level loop spreading, doacross scheduling, self-scheduling, and ron-lime
scheduling. This classificalion is summarized in the figure 2.9 below. A survey of many program transfonna-
lion techniques can be found in [Zima90].
Note that program transformations are merely mechanical techniques 10 change the structure of the pro·
gram. Inappropriate application of the I:ra:Jt<lformations is likely to cause more harm than good. To have a
positive effect on performance, the transformations must be carefully chosen based on clear objectives and
well-thought out heuristics to take full advantage of the parallelism provided by the target architecture.
The capability of a compiler to optimize the program parallelism is usually determined by the richness
and effectiveness of the heuristics that the compiler used. Due 10 the lack of systematic researcb in this area,
there is no readily available resource of heuristics for parallel compiler writern to employ. ConsequenUy, only
scattered heuristics are utilized in most parallel compilern. What is needed is 10 integrate the heuristics of
applying the transformations into the lmowledge database so that the inference engine of the compiler can
22
utilize them systematically. A few dozen well-known heuristics are described in [Allen83, Wolfe82, WaGa89].
Some new heuristics for utilizing these transformations 10 optimize the programs on different classes of target
machines are presented in chapters 7 and 8.
2.5. Summary
In this chapter, we discussed some fundamental problems about parallel programming. We first dis-
cussed levels of parallelism and how computations are mapped from one level 10 another. We described some
basic concepts of machine parallelism and computational models. We also discussed program-dependence
relations and issues in representing the program dependence graph.
We talked about the abstraction of program parallelism and give some simple examples. We then dis-
cussed two different ways (0 improve program parallelism and some theoretic foundations for program
transformation. In the next chapter we will discuss a framework for integrating heuristics, machine features,
and perfonnance prediction into the decision making of the program oplimizslion process.
Figure 2.9. ClassificaJion ofprogram transfOT71ll1Jion techniques based on objectives.
23
CHAPTER 3
TOWARD INTELLIGENT PARALLEL COMPILERS
3.1. Need for IntelUgence in Parallel Compliers
For a compHer to generate reasonable parallel code for a parallel architecture, it has to be able to recog-
nize the parallelism of the hardware architecture, abstract the parallelism available in the user program, and use
suitable algorithms or heuristics to match the program parallelism with the machine parallelism. These tasks
require a fairly high level of intelligence. Unfortunately, mosl parallel compilers today lack such needed intel-
ligence. Although a few existing parallel compilers exhibit limited degrees of intelligence, none is capable of
generating acceptable code for parallel computers without extensive user guidance.
Below we will use two simple examples to demonslrate these difficulties. The first example involves
parallelizing a matrix-vector product program which nicely illustrates the complexity of the problem because




y[if =y[if + af4i/'x{jJ;
endfor;
endfor;
We assume that the resulting vector y has been previously initialized to zero. We seek to transform this
program for three different Iarget archilectures: the BBN butterfly, the Alliant FXJ8. and the Pringle. These
three parallel computers support different views of parallelism and represent a wide variety of parallel architec-
tures.
The BBN BUllerfty supports a model of compU18lion based on a synchronized access to shared dala in a
global address space. The memory modules and processors are connected to a multiple Slage network. The
cost ratio of global and local memory accesses is about two to one.
The Alliant FXJ8 is an eight processor shared memory system where each processor has a vector pipe.
The processors are connected by a crossbar switch to a shared cache, which, in term, is connected to a shared
memory through a high speed bus. Lightweight threads are supported for programmer-directed task schedul-
ing.
The PringIet [KGSF84, KWGCS84] is a 64 processor, non-shared memory CHiP prototype system. The
Pringle demands that computations be viewed as a static network of communicating tasks that operates as a
dala-driven systolic array. It is a good representative of many non-shared memory systems such as the Intel
iPSC/2 or nCUBE 2. The major difference between Pringle and these other machines is that the laller two
machines are based on the hypercube architecture which is a special case of the connections possible on the
reconfigurable network of the Pringle. Also, the cost ratio of communication (cost of sending one floating
point number) over compulation (cost of one float point operations) is about 1 for Pringle and 200-300 for the
two hypercubes.
By using many different heuristics, we derived the three programs shown in figure 3.1 that are believed
to be optimal for the BBN butterfly, the Alliant FXJ8, and the Pringle, respectively. The block_transfer and
t Developed originally by 1. Synder and D. Gannon for an DNR SPOllSOred project at Purdue University and Wash-
ington University.
24
innerJ'roduct used in figure 3.1 (a) are pseudo names of the built-in operations that BBN Butterfly provides.
Ch_x used in figure 3.1 (b) is the channel variable generated by the compiler to pipeline the data transfer in
Pringle. A more detailed explanalion about these programs and how they are derived is given in chapler 8.
IOTall k in 1 .. P loop
blockjransfer(x, x_loctJ~ sizeo/(x»,'
endforall;
[arall i in 1 .. N] loop
block_transfer(a[i, ·1, a_loca~ sizeof(a[i, *]));
y[i]:= inner-prodact(a_local[*], x_localf·]);
end/orall;
(a) The BBN buLlerfly version.
forall Ie in 0 .. p-l loop
local tmp;
forjinl .. mdo
Imp = if (k= =0) then x[j] else Ch_x[k-i];
Ch_x[k] = Imp;
for i in [Ie"nlp _, (k+l)'tn/p] do




(b) The Pringle version.
forall kinO .. n/32-1 loop
local kl, k2 : Uu;
k1=k ill 31+1j
k2 = (k+l) ·32;
forjin[l .. mJdo
y[kl .. k2] ,um= a[kl .. k2, il • x[j];
end/oT;
end/orall;
(c) The Alliant FXIB version.
Figure 3.1. Three different versions of the moJrix vector mul1iply program (for Butterfly, Pringle ond Alliant,
respectively).
Each of these programs was derived from the original program by a sequence of correctness-preserving
transformations. The state-space in terms of the program transformations for this program can be represented
in a tree structure diagram which we called a decision tree for program transformations. The paths that lead
from the original program 10 the above tluee programs are shown in figure 3.2. This diagnun shows two
interesting facts:
1. The complexity of the problem. The state space search diagram that we show is only a fraction of the
whole tree that can be derived from the simple program.
2. The archilectW'll1 dependence. The program transformation sequences for these three architectures are
dramatically different. It is obvious that the differences in the sequences of the transformations are due
to the differences in the architectures.
Some interesting questions arise from this example. What features of the architectures would caliSe such
differences in selecting the program transformation sequences? How can these features be identified and
integrated into the compilers? How does a compiler lind these transformation sequences based on its under-
standing of the architectural fealures? What kinds of intelligence does the compiler need to make such deci-
sions? How can we program the compiler so that it will have such intelligence? We will come back to this
example to explain how architectural differences affecl the program performance and how our system utilizes
various heurislics 10 derive transfonnation sequences based on the features of the program and the architecture.
Algorithm lIllbBlllUtloD(L(I), IODcr1'fO'lua)
















[TIl. PE'ingl. _rdon I
Code mo~oa(Kl=.K2=, L(j)
25
Figure 3.2. The subtree of the decision tree for program trl11l.J!ormaJion thai shows the transformation se-
quence for the four archilecture.s.
If programs can be optimized for an architecture based on a particular program transfonnation sequence.
this situation may not be too bad, since we can afford to spend a significant amount of resources, once and for
all, to find this powerful sequence of transformations for this architecture. UnfortWlately, the optimal sequence
of program transformations depends not only on the target architecture but also on the program being optim-
ized. Also, the behavior of the program not only depends on the architecture and the program. but also the
input 10 the program. However, the static analysis of the compiler will not be able to do a good job on the
input dependent behavior of the program. For this kind of program, the compiler should minimize the condi-
tions that a computation depends on and generate some ron-lime tests 10 select the better approach.
The second example that we present is an attempt 10 parallelize a program 10 compute a fractal image
called the Mandelbrot set on a 64 processor nCUBE 2. The Mandelbrot set arises from sequences of complex
numbers defined inductively by the relation Z,,+l = z; + C, where cis a complex constant. The behavior of the
sequences depends on the parameter c and an initial value zoo The Mandelbrot set is obtained by fixing Zo 10 0
and varying c. The nCUBE 2 is a distributed memory MIMD machine that utilizes the message-passing para-
digm for inler-processor communication.
The Mandelbrot problem is a so-called "embarrassingly parallel" problem, because each pixel in the pic-
ture can be computed independently. The algorithm itself contains absolutely no interprocessor communica-
tion. Even though this is an ideal program 10 parailelize, we will demonstrate that an unintelligent paralleliza-
tion can perfonn badly due to the poor load balancing.
The sample image that we used for the experiment is shown in figure 3.3. This image was chosen
because the computation is not well distributed in the image.
Running the original program sequentially with the input we describe above on nCUBE 2 takes
1045.737 seconds (see column one in table 3.1). Since the program is a perfectly nested loop (looping through
26
the rows and pixels), it is obvious that the program can be optimized by parallelizing the outennost loop. A
simple method for parallelizing (he loops is to block lhe Qulennost loop to form P parallel tasks and distribule
Ute tasks to processors. The speedup we get for using 32 processors is 16.601 (63.101 seconds) and the
speedup for 64 processors is 32.881 (31.859 seeomls). This result is certainly not satisfying. The major prob-
lem willi this approach is the imbalance in computation loads, which is clearly shown in the variance of the
elapsed time among all processors shown in column 2 and 5 in table 3.1.
A different approach that uses dynamic task scheduling to balance the load shows a very interesting
result (figure 3.4). By dislribuling the tasks of 102A pixels each, the resulting program is well balanced for
different sizes of cubes (see the variance of elapse time in column 3 and 6 in lable 3.1).
Figure 3.3. The Mande/brat set computed with p = [-1.781 .. -1,764], q = [0.0 _. 0.013] where the complex
plane C =P + i * q. The black region in the upper-right corner represents the most
computation-intensive area.
Table 3.1. Performance result of using cyclic. dynamic. and block allocations for the Mandelbrot set problem
(number ofprocessors = 32 and 64. task size = 1000 pixels).
Number a/processors
J 32(block) 32(dynamic) 32(cyclic) 64(bk>ck) 64(dynamic) 64(cyclic)
Max. elapse 1045.737 63.101 40.471 33.492 31.859 64968 17201
Max. idle 4.890 0335 8.790 0286 1.92184 1.066 47.558
Vari. ole/apse 0.000 307.411 0.003 0228 67379 0231 0.038
Stand. deviaJion 0.000 17.533 0.051 0.478 8208 0.48 0.196
Speedup 1.000 /6.601 25.839 31271 32.881 16.096 60.901
27
Figure 3.4. Performance results of dynamic scheduling using mfferent sizes of tasks and numbers of proces-
sors for the Mande1brot set problem.
As can be seen in the timing result in table 3.1. for the 32-processor case, the maximum elapsed time for
dynamic task-scheduling is better than for the static blocking method. However, by doubling the processors to
64 processors, the result is aclually slower than the 32 processor casc. This slowdown is caused by the
increase in the communication cost for larger cubes and the overhead of assigning tasks at run time.
28
A much better result is obtained for this particular example by blocking the outermost loop cyclically to
form the tasks. The speedup for this method is 31.277 for the 32 processors and 60.901 for 64 processors (see
columns 4 and 7 in table 3.1).
This example shows the characteristics of parallcl programming: (1) Tediow details cannot be over-
looked; (2) Gain in one area may cause loss in other areas; (3) Even when the sequence of traruiformations is
lmown (in this case, to parallelize the outermost loop), the method for applying the transformations can still be
very different. In this example, !he compiler needs a fairly high level of intelligence and a good performance
estimation tool 10 figure out the optimal task size (or cube size) to parallelize such an "embalTIlSSingly parallel"
problem. The final method is clearly better after the performance resulls are analyzed, bul the challenge \0
parallel compilers is: how does the compiler figure this oul without actually running the program?
The above two examples not only demonstrate the difficulties of parallel program optimization, but they
also lead 10 the following observations:
1. Architectural characteristics have a significant impact on program optimization.
2. Different programs may require different sequences of program transformation (0 achieve optimal results
on the same archilecture.
3. Runtime behavior of programs alIects the balance of the load, and lhe balance between the computation
and the communication and cannot be ignored.
3.2. Melbodologies for Improving Intelligence In Parallel Compliers
In this thesis, we coordinate three approaches to improve the intelligence of parallel compilers. These
three approaches are a new paradigm for program optimization, systematic utilization of heuristics, and exten-
sive application of AI techniques. We will discuss these three approaches in this chapter, the realization of the
paradigm will be discussed in the next chapter.
3.2.1. Paradigms for Program. Parallelism Improvement
Parallel compilers use program transformation techniques to restructure the control and data structures of
the programs. Different sequences of program transformations lead to programs with different performance
characteristics. One of the major tasks of parallel compilers is to choose on an appropriate sequence of pro-
gram transformations so as to effectively map a program onto the target machine.
The key to achieving intelligent behavior in parallel compilers lies in appropriate paradigms for program
optimization and sound methodologies for knowledge organization and integration.
Below we first examine existing models for selecting program transformation sequences and the prob-
lems with these approaches. We then introduce a new model called the feature-directed program optimization
model.
3.2.1.1. Six Models for Selecting Program Transformation Sequences
Most existing parallel program optimizers adopl one or more of the following models:
1. Compiler option modeL The compiler provides command-line options for users (0 choose a sequence of
transformations among a set of pre·defined sequences or to specify a user-determined sequence of
transformations. The Paraphrase system [KKLWBO. KKLPWBl] from the University of llIinois is an
example of such an system.
2. Annotation modeL The prognunmer directs the compiler (0 parallelize or vectorize certain program com-
ponents (usually loops) or to decompose or distribute data in certain pattems by annotating the user pro-
gram in forms of user directives or assertions. This model is supported by most parallel or vector com-
pilers to make up for the shortcoming of the compilers.
3. Predefined sequence modeL The compiler builder finds one or more predetermined sequences of
transformations that are supposed to be optimal for the particular target machine and applies the fixed
sequence on all programs. If there is more than one pre-defined sequence (0 choose from, the user may
use either command-line options or assertions to sclecl alternative sequence (0 use.
Paraphrase[KKLWBO] and many other parallel compilers support this model.
4. Interactive modeL The compiler provides the programmer with an interactive programming environment
and a set of program transformation techniques for the uscr to direct the program restructuring process
29
step by step and view the intermediate results of the transformations. Experienced parallel programmcIS
may utilize this kind of compiler (usually referred (0 as programming environments) 10 produce highly
optimized programs. PFC [AlKe84] from Rice university and Faust [GGJMG89] are two good examples
of this model.
5. H~rislics-driven model. The compiler chooses !he program transformations on the basis of heuristics.
The quality of \he knowledge base of the compiler determines the quality of the decisions that the com-
piler makes. Most parallel compilers utilize some heuristics, but very few rely exclusively on heuristics;
none provide systematic processing of heuristics. In [Wang85], we presented a prototype expert system
implementation of a parallel compiler that relies on heuristics and some user interaction to direct all its
decisions.
6. Program-directed model. In this special case of the heuristics-driven model, the compiler selects
transformations based on the patterns of the program dependence graph of the program. An example of
this approach is presenled in [WaGa89].
3.2.1.2. Analysis of the Models
One way to compare these models is to compare the effects of the paths that these models visit during
the program oplimization process.
The predefined sequence model will only visit a few predetermined paths out of the many possible paths
in the decision tree. Due to the dynamism of the program behavior, the chance that the predefined sequence is
the optimal path is rather small. The philosophy behind this model is that this model will achieve acceptable
perfonnance without expensive analysis for specific type of problems that the sequence is designed for.
The user annotation model and the inleractive model rely on the user to select the transformations. Thus
the part of the decision tree that is visited depends on the experience of the user and how hard he or she lries
to optimize the program. The help the compiler provides is the mechanical program transfonnBlion and code
generation.
The interactive model is superior 10 the user annotation model or predefined sequence model in the sense
that programmers can base their decisions on the results of previous transfOIDlations and various programming
tools can be incorporated into the environment to help users understand the consequences of their decisions.
Some interactive parallel compilers also provide Blimited degree of advice. However, the programmer is still
the one who is supposed to make all the hard decisions. The inleractive model gives the user better control of
the optimization and parallelizalion but fails to remove I.he major burdens of parallel programming.
The heuristics-driven model is efficient in choosing the transformations. The number of transformation
paths that the model examines depends on the size and quality of the heuristics. Some heuristics are rough and
may cut off useful transformation paths accidentally. The qualily of the compiler is determined by the richness
of its knowledge base. Another problem of this model lies in the difficulty of collecting heuristics. Much
effort has to be put into constructing a powerful knowledge base.
The program-directed model can be programmed to exploit the decision tree of program optimization
systematically. It may achieve good results because the transformation sequence is chosen based on the struc-
ture of the program. However, the performance of a program depends heavily on the target architecture.
Machine features need to be integrated inlo this modeL Otherwise, the compiler will be machine dependent.
The major problem with the program-dlrecled model is that information that is not available at compile time
may hinder the program optimization process. Heuristics for handling these situations or run-time tests need to
be incorporated into the module to solve this problem. To summarize, none of the models alone have the
capability of solving the complex program oplimizaUon problem. New models that support systematic
analysis, incorporation of machine knowledge, knowledge organization and integration need 10 be designed so
thai intelligent parallel compilers can be buill.
Why do most existing parallel compilers avoid syslematic analysis and thus leave the hard decisions of
parallelism improvement to the user? Among other consideraUons, we weigh the following as the two most
significant factors in influencing the design decisions of compiler wrilers.
• Difficulty in obtaining expertise for parallel programming. Heuristics that application-programmers use
are usually specific to the particular application and are not directly exlendable to general problems that
a compiler is facing.
30
• Too much computing power required for deciding on the optimal program-restructuring sequence at
compile lime. The computation and analysis of the program dependence graph also require a significant
amount of computing power.
We will show below that with suitable infrastructure, the difficullies in oblaining optimizing expertise
can be overcome and automatic knowledge acquisition is possible. With the rapid advances in workstation
technologies and the call for utilization of parallel computing resources, using fast and cheap processing power
of the workstations 10 optimize programs for supercomputers diminishes the problem of needing 100 much
computing resources and makes this approach attractive. Also, by applying appropriate AI techniques, it is
possible to cut down the decision Iree 10 minimize the cosl of optimization. The bottom line is: lhe desire for
improving parallelism OlltonwJically should not be put off by pragmalic problems t!tat con be solved by suit-
able methodDlogies. In the next section, we will present a new program optimization paradigm to solve these
problems.
3.2.1.3. A New Paradigm for Improving Program Parallelism
As we discussed in the last section, a new program optimization paradigm is needed so that intelligent
behavior can be observed in parallel compilers. In this section, we introduce a new paradigm for improving
program parallelism that incorporales machine features into the decision-making process and provides a good
foundation for the organization and integration of system knowledge.
A parallel architecture can be characterized by machine features which are properties of the machine that
are related to the concurrent execution of user programs. Also, a program can be abstracted into a list of
program features such as patterns or functionalities of the computation. By analysis of the heuristics for
improving the program parallelism on the target machine, the heuristics can usually be attributed to certain
features of the program and the architecture. On the basis of this understanding, we introduce a new program
optimization paradigm that is called the feaJure-directed program oplimizLZJion paradigm. Under this para-
digm, heuristics are encoded with features of the target machine and the program. The control process that
guides the program-restructuring process utilizes these heuristics to select the program transformations to
apply. The program-restructuring process is an iterative process of selecting and applying the program
transformation techniques to match a program to a particular parallel architecture. At each step of the process,
the program is analyzed, a set of applicable transformations is chosen and compared based on some metrics,
the most promising transformation is chosen and applied on the program, and the resulting program is
evaluated. This process is repeated until the resulting program is "satisfactory." A performance evaluation
unit can be defined from the match between the program and machine parallelism to serve as a metric for
evaluating the transformations.
To realize this process, several questions have to be answered.
• "What transformations to consider?" Each transformation has different effects and purposes; it is not
efficienl 10 consider all possible transformations at each step. The selection of the set of transformations
affects the efficiency of the optimization process. TIlis selection should be based on the objective of the
optimization. Heuristics to decide which transformations are more promising for certain tasks are
needed to limit the search tree. To improve the efficiency, multiple transformations may be chained and
treated as a new transformation in the iterative optimization process.
• "How to select the most opproprioJe trtJ1l1JformaJion?" All transformations have a tradeoff and over-
head. The decision for selecting a transformation will need to be based on the particular program, target
machine and the current stage and objectives of the optimization. One possibility is to use the perfor-
mance estimation to estimate the possible contributions the transformation can have on the concurrency.
A heuristics-oriented rule-based system can also be used to make the decision. We will study the frame-
work for control decisions in the next chapter.
• "What are the effects of the target 1TIl1£hine on the program transformation?" The view of parallelism
of the target architecture direcUy affects the execution of the program and must be considered when the
transformations are chosen. To maximize the effectiveness of the compiler, the effect of the target
machine on the selection of transformations needs to be studied carefully. This issue is studied in detail
in chapter 5 and the paper [Wang88].
• "When 10 stop the transformation process?" For the same progra:m, Olere are many different representa-
tions that have the same input-oulput semantics as the original program. It is impractical to try all of the
sequences before choosing the best way to restructure the program. Heuristics, and some kind of metric,
31
must be employed in order to find the most promising transformation to apply at each step.
One major motivation for deriving the feature-directed program optimization paradigm is in its potential
application in intelligent parallel compilers. TItis paradigm opens up many interesting research problems that
have not been addressed before an intelligent parallel compiler can be implemented based on it:
• Representation of the 1TlQl;hine fcaJures. How should the machine features be abstracted, represented and
organized in the knowledge base and how are they integrated into the decision-making process?
• RepresenJaJion of the heuristics. How do we acquire, represent, organize, and utilize the program-
restructuring heuristics? What is the relationship between the program-restructuring heuristics and the
machine features?
• Choosing program transfonnations. How does the system decide which program transformation to use
at a particular Slage of program optimization. What effects do architectural differences have on the
selection of transformations?
• Efficiency of the decision making process. What techniques can the system utilizes to improve ils
efficiency and effectiveness? Can the compiler ilself be parallelized and therefore able to utilize more
computing power?
• Learning. Can the system learn from experience to improve ils own ability? Can the framework pro-
vide hooks for a learning module?
• Intelligent user interface. When the system fails to come to a conclusion about a certain situation, can
the system query \he user in a intelligent manner? Can \he system provide a friendly user interface?





Figure 3.5. The feature-directed program-restructuring paradigm. The black pipes represent the flow of the
program dependence graph. The gray pipes represem the flow of the control informmion of the
system. And the white pipes represem the machine and program fea/ures.
Our approach to solving these problems and building intelligent parallel compilers can be ouLlined as fol-
lows:
1. Femure-based targeJ machine description. The features of the parallel machines are analyzed and the tar-
get machines are described on the basis of their features. The target machine can be either represented in
a heterogeneous at hierarchical structure. An object-oriented knowledge representation scheme is
32
described in chapter 5.
2. Feature-based heuristics representation. The program transformation heuristics are encoded on the basis
of the features of the Iarget machine and the programs. The heuristics are hooked to a hierarchy of
machine features that is constructed by the system. 1bis allows heuristics 10 be manipulated efficiently.
3. Knowledge organizaJicn and integration. The inference knowledge of the progmrn transformation is
organized into a structure that is called the heuristic hiemrchy. This structure closely mimics the structure
of the decomposed problem space. It also features competitive and opportunistic problem-solving metho-
dologies. The heuristic hierarchy and methodologies of organizing and integrating this knowledge are dis-
cussed in section 4.3.
4. Expert systems approaclL. Since beuristics are used extensively to control the decision making, expert
system technologies are used to build the intelligent program-restructuring system for program optimiza-
tion and to use it as the centra] control unit of the intelligent parallel programming environment Other
expert systems, such as a machine knowledge manipulation expert system, an explanation expert system,
and a program feature abstraction expert system can be incorporated.
5. Intelligent program-restructuring based on fea1ure-directed modeL The control process of program-
restructuring is guided by features of both the target machine and the program. A knowledge base lhal
conl.ains a rich set of program-restnJcturing heuristics can be used to aid the control of the program-
restructuring process. Usern have the option to query the syslem about the decision-making process and if
desired, 10 take over the control.
6. UlilizaJion ofAI techniques. AI techniques are used extensively to cut down the unnecessary branches in
the decision tree and improve the efficiency of the system.
7. Learning. Learning models are being studied.
8. Parallelism in the decision-making process. Parallelizing the compiler itself allows the compiler 10 run
on a more powerful machine and use more computing resources to solve the program optimization prob-
lem and to improve the quality of the generated code. In next chapter. we will discuss a program-
restructuring framework which itself exhibits parallelism in the program-restructuring process.
3.2.1.4. Comparison to Other Parallel Program. Optimization Models
Our paradigm for building intelligent parallel compilern and progromming environments differs from
traditional compiler approaches in the following respects:
• Model of parallelism optimizalion. Most compilers and parallel programming environments either U'le
predefined program trallsformation sequences or rely on users to make the major program-optimization
decisions. In our system. we ulilize the feature-directed program-optimization model which opens up a
completely new avenue for research into intelligent parallel compilers.
• Organizalion and integration of the knowledge. The knowledge is explicitly represented in our system
but implicitly hidden in most conventional parallel compilers. In our system, knowledge about the Iarget
machine is encoded and employed in terms of machine features. Specificalion, organization, integration
and utilization of the heuristics are all based explicitly on the machine features and program features.
• Systematic analysis. The feature-directed program. optimization model uses systematic analysis and
automatic program transformation, while most other pamllel compilern rely on ad hoc heuristics.
• Degree of the user involvement. Our reasoning model provides a mechanism for the user to intervene in
the decision-making process. This allows the system to adjust to suit different experiences and capabili-
ties of different users.
• Extensive urilizalion of AI techniLJ.ues. Our approach makes extensive use of AI techniques and
knowledge manipulation methodologies for parallel compilers. heuristics.
• Multiple target machines and knowledge generaliza1ion. One key ability of an intelligent parallel com-
piler lies in the ability of transporting and integrating experiences learned from a particular machine to
others. The integration of knowledge for optimizing different kinds of parallel architectures into one sys-
tem and the use of machine features as a basis for knowledge representation ease the problem of
knowledge transfer and Imowledge generalization. Accumulating the abilities of the system can greatly
enhance the capability of the system as the development of the syslem progresses. This feature is partic-
ularly valuable for parallel compilers since the cost of building such a system from scratch for each
33
individual target machine is so high. With Ibis approach, only the back-end (code generation) needs to
be specialized for the target archilecture. More important, users' parallel programs can be immunized
from machine-dependent constructs to preserve portability without sacrificing efficiency.
To summarize, the combination of expert syslems, knowledge acquisition, and AI techniques for
analysis, collection and accumulation of the system knowledge provides a practical alternative to traditional
parallel compiler approaches. This paradigm can be used to build compilers that are far more powerful than
even the best parallel programming environments available today. On the other hand, with this approach the
need for efficient decision-making processes and new methodologies for representing, organizing, integrating
and utilizing the knowledge becomes even more important. Methodologies for applying stale-of·the-art AI
techniques to these problems to realize this new framework in the construction of parallel compilers is studied
in [WaGa89], and in chapters 4 and 7 of Lhis lhesis.
3.3. Utilization of Heuristics
Heuristics are methods, criteria or guidelines for selecting promising actions among alternatives. They
are usually referred to as simple principles that we learn from experiences or experimentst·
Heuristics may not always lead to effective solutions, but they represent the compromises between sim-
ple fast guesses and actual but expensive evaluations. Heuristics are usually effective in the following situa-
tions: state-of-the-art decision-making in non-well-defined problem domains, short-cut principles for expensive
operations, and approximated substitutions for non-atlmctive exponential time-bound algorithms.
The state space of the problem consists of facts that are derived from the problem domain or intermedi-
ate conclusions of the roles used in solving the problem. A heuristic can be translated into a function which
maps a state of the decision-tree into another slate by adding new facts or modifying existing facts. In expert
systems, a heuristic may be represented in one or more roles in the lmowledge base.
3.3.1. Systematic Discovery of Heuristics
Heuristics may be discovered systematically by consulling simplified auxiliary models of the original
problem ([Pear84]). By simplifying the problem, the hope of finding a solution is higher and the cost should
be lower. Also, since the problem being solved is identical to the original problem except for a few con-
straints, it is generally true that the solutions to the simplified model are likely to be admissible to the original
problem.
Although systematically relaxing constraints of the problem can lead to the discovery of good heuristics
in many cases, relaxing a few constraints in a complex problem domain may not be enough for simplifying the
problem. In general, the relaxing process can be repeatedly applied to the relaxed model until a solution is
easy to find. However, relaxing more constraints means that the differences between the relaxed problem and
the original problem are more significant; the heuristic derived from the relaxed problem may no longer be
effective in the original problem. Therefore, any new heuristics self-learned by heuristic relaxation should be
subject to the inspection of the lmowledge engineers.
3.3.2. Translating Heuristics into Rules
One major source of obtaining heuristics is through case studies; most progmrn transformation heuristics
are derived through careful study of the problems at hand. Problems of knowledge acquisition in case studies
include recognizing lhe factors involved in the decision-making process and the similarities between !he partic-
ular case and more general cases. In figure 3.6, we describe a process of abstracting domain-dependent
lmowledge.
For each heuristic, there are actually many ways to represent il in the lmowledge base. The proper
choice of encoding is critical for the lmowledge to be effective. To illuslrale the complexity in transforming
heuristics inlo roles, we show an example for evaluating the match of a program fragment to the architecture.
Heuristic 3.1. When mapping a loop into parallel tasks, it would be better if the size of the loop, N, is a multi-
ple of lhe nwnber of processes, P, if the loop size is small. This factor is less important when the loop size, N,
t In this thesis. the IcI1Q "heuristial" is genemlized into 8.lIy s.Imple, elI"o;:tive or emdenl methodologies used in
decislon·making. Therefore, 8 heuristic may be 8.lI expertisll that comes out of the S18le-Of-the-an experience of 8lI e;l[peJl
or a wcll.dcflmrl a1gorilbm.




A<:qui.i!ia::l Rcpracnllltion'l &pcrt System
Frame WOl\t
Case SludlCll Encoding I JelO;Wloog
base-,.g Em,~ Heuriwcs l/,,~?\ ~Pro_ 1=p-ro' rod """,.. -) ,.- Demain.r~ l\lgoril!:m.o "-lIliwa KaJwledgc..~ ....
Be;I(p--, jJ fcalllra t= L/,__ ,\ I-J Machine Machi""Mach!"" j 1\.-'" ) Pnoodi1l8 ) ,.- P.-I ~ , Lilt 0......
, ,. .
: :
Figure 3.6. Process of abstracting a domain dependent knowledge base.
is large compared to P.
To translate this heuristic into rules, the following formula is used to capture both conditions in Heuristic
3.1.
NIP
V. ",al(N, P) ·1 NIP r
This formula quantifies the heuristic quite nicely. NIP and rNIPl are both integers differing by at most
one. The result of the real division is 1 when N divides P or approaches 1 when NIP is a large integer.
On the other hanet. this encoding represents a non-trivial heuristic in knowledge acquisition. It is very
difficult for the system to perform this encoding without help from human experts. A good knowledge acquisi-
tion system may ease this difficulty but may not solve the problem completely. This is why human assistance
is needed in the above knowledge abstraction process.
3.4. Applying AI Technologies to Parallel Compilers
From our experience, there are many areas where AI techniques may help in the construction of parallel
compilers. A list of the AI techniques that we have employed in the implementation of our prototype system
is listed below: We briefly discuss some of them here.
• Search algorithms. In chapter 4. we will discuss the use of some AI search algorithms, such as A ., for
searching through the decision tree in the program optimization process. These AI search algorithms use
heuristics to cut down the unnecessary lraversal of slates in the decision tree to improve efficiency. Willi
suilable performance estimation functions. the A • algorithm can find optimum solutions.
• Goal reduction. Goal reduction techniques (such as forward-chaining, backward chaining, hybrid
methods, etc), can be used as goal searching and processing and to cut down llie search trees 10 improve
efficiency in decision making, Furthermore, resolution and unification can be used to deduce llie search
goats. The theorem-proving system can be used to deduce and analyze heuristics and find the incon-
sistency in the knowledge. The hier -blackboard model discussed in section 4.4 is a framework for goal
reduction.
• Constraint propagation and satisfaction. Static analysis of the program has its limits. For instance, a
program dependence test may be obscured by a variable in a loop bounds, or the task decomposition
may be crippled by the unknown loop bound in the outermost loop. Some of these decisions can be
postponed until run time. To ensure that only the minimal run time test is generated, constraint propaga-
tion can be used to propagate the critical conditions for such run time tests at the needed points.
35
• Planning. Using planning for selecting a program transformaUon sequence or generating the schedule of
parallel execution is discllSsed in section 4.2.
• Generate and (esl The model consists of a generater and a tester, where the generator generales a
number of possible cases, !he lester eliminates \he inapplicable or non-promising ones. An example is to
apply this technique to data decomposition, where a data decomposition generator can generate possible
decompositions of arrays and an examining expert can be used to eliminate less promising compositions
for limiting the selections. In [Wolfe82], generate-aod-test technique is also used in deciding the plausi-
ble loop order for interchanging perfectly nesled loops.
• Pattern recognition. Pattern recognition can be used to recognize \he program parallelism, machine
features, and opportunities for improving parallelism. An example of using pattern recognition to
abslract program features involves recognizing opportunities for pre-optimized algorithm. substitution
(see chapter 7). A sample program to recognize a generic pattern of inner product operations was given
in chapter 2.
• Man-machine interface. Advances in the man-machine interface such as natural language processing and
visual programming can be used to achieve intelligent user-interaction.
• Knowledge engineering. Knowledge representation and manipulation techniques can be employed to
represent, organize, and integrate the program transformation heuristics and manipulate machine features.
Object oriented knowledge representation of machine fuatures and the organization of beuristics are dis·
cussed in chapter 5.
• Problem-solving models. Problem·solving models sucb as role-based systems, blackboard systems. and
objecl-oriented models can serve as frameworks for reasoning in intelligent compilers. This problem is
discussed in detail in chapter 4.
3.5. Conclusion
In this section we discuss some challenges that a parallel compiler faces. A new paradigm for progrnm




A FRAMEWORK FOR THE CONTROL OF
INTELLIGENT PARALLEL COMPILERS
In chapter 3 we proposed a new paradigm [or parallel program optimization. In lhis and the following
three chaplers, we will discuss methodologies for realizing the feature-directed program opHmiz.a(jon paradigm
into intelligent parallel compilers. We first decompose the problem of program optimization into hierarchical
structure of subproblems. In this chapter, the framework [or implementing the paradigm is discussed. The
problem is first formulated into a planning problem. Then three different frameworks for implementing the
paradigm are presented and \heir advantages and disadvanlages are discussed.
4.1. Parallel Program Optimization as B Planning Problem
The program optimization system can be viewed as a plarming system. A planning system is a program
that develops a course of actions, or a plan, 10 reach a desired goal. This plan is then used to guide the execu-
tion of planned activities. When the activities represented in a plan are timed, the plan is called a schedule.
Formally, a planning system PS can be defined as a quadruple:
PS co (S, OP, 50' SG)
Where 5 is the set of problem states, OP is the set of operators defined by a state-transition mapping from one
stale to another, 50 is the initial state and 5G is a set of goal states. The planning process is to find a plan 'V
(which is a sequence of operators) thaI will transfer the initial state to one of the goal states. That is:
•
So - S" whereS8 ESG.
l.p is actually a sequence of operators that change the problem states:
'1'1 '1'2 '1'3 '1'.
SO-SI-S2-53 ... _S,
Where 'II; 's are operators
'V(So) = 58 col.p..('P.. _l('P.. _2( ...
that map 5j _ 1 into
"',(So) ... ))).0"
Si, thaI is. we have
A parallel program (represented as an augmented program dependence graph) is a rough schedule for
concurrently executing the computation specified in the program on a parallel architecture. The program
dependence graph represents the constraints that the operations on the program must respect. The schedule
generated by the compiler statically determines the flow of the control, the flow of the data, and the utilization
of resources. The parallelism optimization process modifies the schedule of the operations to improve the per-
formance of the program on the target machine.
There are many ways to formulate a parallel compiler as a planning system. The planner can use the
dependence graph as constraints for generating plausible execution plans. Alternatively, programs can be
viewed as problem states and transformations as operators to refine the states. For each node in the decision
tree of the program optimization process. the children of the nodes are applicable transformations thai can be
applied to the program. The planner's objective is thus 10 generate a plan for refining the execution plan, or
more specifically, to generate a list of transformations to improve the parallelism of the program. The program
optimization process can be represented by the planning problem PS .. (S, OP, So. SG ). where S is the set of
program dependence graphs, OP is the set of program mmsfonnation techniques, So is the original program,
37
and SG is the set of optimized programs. The objective of the program optimization process is to find an
appropriate sequence of transformations 'lin eW"_l • Wn_2 •... WI = 'P to translate the program depen-
dence graph into an optimal form for the target machine.
Once the program optimization problem is defined from the state-space transformation paradigm, the
problem is then 10 find a solution path in a search tree whose nodes are programs and whose arcs are program
transformations that modify the programs. The key issue is the selection of the most appropriate operator to
apply at the given slale, represented as a node in the search tree. We shall now examine how a plan 'P, the
sequence of transformations, can be obtained. 1bis approach differs from an ordinary planning problem in two
respects: first, there is no clear definition of the goal states; second, even though optimization is desired, the
user may not be able 10 afford !.he cost of finding the optimal solution, since the cost of verifying the applica-
bility of the transformation is usually fairly high. In this case, a partial solution may be acceptable.
To guide the selection of the operators and identify the goal states, a perfonnance objective function, E,
is defined. The function E is a mapping from the state space to a real number which represents the perfor-
mance measure of the states. This performance objective function is usually based on certain heuristics.
Search algorithms based on the performance evaluation function are called heuristic-guided graph-search algo-
rilhms. The objective of the problem optimization process is then to find Ihe transformation sequence W,
'IT "" W" • W"_l • W,,_2 •.•• WI' which transfers So into Sg such that E(S;)'i!:E(Sg) for all 1 s i < n. The
definition of this performance objective function is subject to the degree of optimization, the affordable
resources, the knowledge of the architecture, etc. Modifying this function will change the characteristics of the
program optimization system.
4.2. Frameworks for Realizing the Feature-Directed Program Optimization Paradigm
Below we will examine three different frameworks for realizing Ihe feature-directed program optimiza-
tion paradigm. The first approach is to use heuristic-guided state space search algorithms. Under this
approach, the program optimization process is to find the optimal path in the decision tree of the program
tnmsfonnation process. The algorithm searches through Ihe decision tree to find the goal states and uses some
heuristic functions to prune subtrees that are not promising. A good representative of such algorithms is the
A· algorithm. As will be explained below, the A algorithm is a variant of the best first search algorithm that
uses a heuristic function to estimate the cost of going from the sLarting slate to a goal state in the decision tree.
A· is examined here becauw it guarantees that the algorithm will finish with an optimal solution.
The second approach, the heuristic hierarchy approach, and the third approach, the hier-bJackboard
approach, both utilize the hierarchical knowledge organization but with different flavors. The hierarchical
knowledge organization is preferred over the flat structure of rule-based systems because heuristics are often
fragmenled during the knowledge acquisition process. Also, this loss of information (about the relationship
between the rules) cannot be recovered after the knowledge is translated into rules. The hierarchical
knowledge organization retains the relationship between rules in the knowledge base. The heuristic hierarchy
utilizes the hierarchical knowledge organization but also provides a control and reasoning strategy. The hier-
blackboard model is an extension of the heuristic hierarchy with the additional power of opportunistic reason-
ing and parallel processing. We will discuss these three frameworks and compare them in deLait below.
4.2.1. Heuristic Guided State-Space Search
The selection of transfonnations may be guided by heuristic functions or olher systematic approaches
such as rules in a role-based system. or inslance, algorithm. A· [NilssonBO] can be incorporated. The algo-
rithm A • is a variant of the best-first search of a problem graph. It transforms the planning process into a
graph search problem guided by a heuristic function f. The evaluation function f (Si) at any node Sj estimates
the cost of Ihe minimal path from the sLart node So to the node Sf (denoted g(Sj» plus the cost of a minimal
cost path from node S; 10 a goal node SG (denoted h(Sj». That is, f(Sj) is an estimation of the cost of a
minimal cost path constrained to go through node Sj.
[(5,) • g(5,) + h (5,). (4.1)
At each sLage of the node expansion, the algorithm chooses the node that achieves the minimal evalua-
lion function to expand. This algorithm is called Algorithm A.
Let costr:c.iD(Si, Sj) be the actual cost of a minimal path between the two nodes Sj and Sj. The function
h·(S.) is defined to be the cost of the minimum cost path from node S; to any of the goal slates.




Also define g-(Sj)" costmin(SO. 81). which is the cost from a start node to the node 5i • And
t(S/) = g-(Sj)+h·(Sj) is the cost of an oplimal path from So constrained to go through node 5j • When the
estimation function h is a lower bound of h·. this algorithm is called A • [NiissonSO].
To map the algorithm to the program optimization process, the function g (Sh M) is the estimated perfoe·
mance of lhe program Sj on the target machine M. And g.(Si. M) is the aclual performance of the program 5i
on the target machine M. The function h (Sit M) is the estimated performance improvement the program
optimization process can have on the program 5i on the machine M. And the function h·(Si. M) is the max-
imum improvement that program optimization caD achieve on the target machine M. This definition realizes
the feature-directed program optimization paradigm because the features of the program and the machine are
casted into the performance estimation of the program on the machine.
Lemma 4.1. Algorithm A • will terminate if !here is a path from So to a goal state S, [Nilsson80].
LemmtJ 4.2. Algorithm A • is admissible (that is, A· will temtinate by finding an optimal solution if there is a
path from So to a goal slate Sg) [NiissonSO].
The efficiency of theA· algorithm depends on the choice of the evaluation functions. The precision of h
depends on the amount of heuristics it possesses. When h _ 0, it reflects complete absence of any heuristic
infonnation about the problem and results in a breadth-first search. However, since such an estimate is a lower
bound on h· thc algorithm is still an admissible algorithm. Another interesting property about A • is that the
more "informed" the algorithm is, the fewer nodes it will expand. This property is described in the following
Lemma.
Lemma 4.3. Let A 1 and A 2 be two versions of algorithm A· that use dilfecent evaluation functions. IfA 2 is
more informed than Al (i.e. hl(Si) ~ h 2(Si) for all i), then at the termination of their searches on any graph
having a path from So \0 a goal state Sg, every node expanded by A 2 is also expanded by AI' It follows that
Al expands alleast as many nodes as doesA 2 [Nilsson80).
The major question in applying the A· algorithm involves defining the evaluation function and goal
states. There are many ways to define the evaluation functions; the more accurate the estimation is, !he fewer
nodes the algorithm has to visit On the other band, accurate estimation of the program performance is usually
very expensive to obtain. So the proper choice of the estimation lies in the compromise between !he cost of
computing the evalualion function and the cost of traversing the node (applying the transformations). More
details about evaluation functions and their application in controlling the program optimization is discussed in
detail in chapter 6.
Here we give a simple example. One heuristic for estimating the execution time involves using the
operation counts. For a sequential slB.tement block, this number is the sum of the number of the operations in
each of the components. For loops, this is the number of iterations times the number of operations inside the
loops. For conditional S!atements, a probability value can be assigned to each branch and the operation counl
for each branch is the number of operations in the branch times the probability that the branch would be Iaken.
The operation count for the condition statement is then the maximum number of operations of the lwo
branches. The operation count for a parallel loop is defined to be the maximum number of operations in each
of the parallel tasks. This estimation function described above is rough but cheap 10 compute. Nole that archi-
tecture dependencies such as the synchronization cost, memory utilization, and caehe miss-ratio are all ignored
in the estimation. However, this can serve as a framework for more sophisticaled performance estimation. For
example, for memory optimization, the operation count for a statement can be replaced by the cost of memory
accesses in the statement. The sequential operation count Del is the total number of operations 10 be per-
formed in the program and the parallel operntion count DCP is the number of operations for concurrent execu-
tion on P processors. The speedup based on this performance estimation function is thus defined to be
SPoc = DC:. For algorithm A·, the cost of the arc between a pair of nodes is defined 10 be the changes inOC
operation count that the transformation would have on the program. And the heuristic functions g, and h are
defined as g(Si). 0 and h(Sj)" SPoc.
Lemma 4.4. Algorithm A· defined above using the estimation functions g, h and f is admissible.
39
Proof: Since the arcs in lhe slate space graph represent changes in operation counts, the cost of the path from
So 10 node Sj is the operation count of the program S;_ Subsequently, the function h·(Si) is identical \0 the
function h(Sj). And the function h is always a lower bound on h· J so by lemma 4.3 we know that the A •
algorithm will always find a program that would generate most parallel statements.
QED.
Since this heuristic ignores the impact of synchronization cost and differences in cost of.different opera-
tions, the "optimal" program generated by the algorithm may not be optimal for general programs. However,
this heuristic is useful for "embalT8SSingly parallel progrnms" since the lack of communication belween the
processes makes factors ignored by this heuristic unimportant, and the parallel operntion count is thus a good
indication of the concurrent performance.
Another possible way of defining !he heuristic function is 10 define lhe function g to be the estimated
perfonnance of the program based on the current state, and the function h to be the estimated potential
improvement for this state.
4.2.1.1. Non-linear Plaunlng aud the CoordinatioD of Multiple Thread State-space Search
For a finer formulation of the planning problem where each node represents the scheduling of an opera-
tion, the goal of the system is to generate an execution plan. This problem can be translaled into a multiple-
task scheduling process where nodes in the dependence graph are the operations to be performed and depen-
dence arcs define Lhe precedence relations between the operations. The output of the planning system is a P-
thread parallel program with annotations for data decomposition and allocation. The control threads of the pro-
gram compete for shared resources such as the memory, communication network and processors but cooperate
to carry out the objectives of the program through synchronized communication. The resulting plan is partially
ordered since the dependence relations need to be respected. This type of planning is commonly referred 10 as
non-linenr planning [Sacem, Tale76, Wilkins84]. The complexity of the search algorithm depends on the
number of nodes generated. The size of the search tree is bounded by b d, where b is the branching factor and
d is the depth of the tree. Obviously, the search tree for parallel program optimization grows very rapidly. To
overcome this complexity, the problem can be decomposed into P sUbproblems where each subproblem plans
the execution for a particular processor. The algorithm A· can be applied 10 generate the plans for the sub·
problems. The coordination between these P planning processes is a global strategy to avoid violating pre-
cedence constraints and resource-conflicts. The interaction between the processors is the data dependence
between the blocks of slatements assigned 10 each of the processors. The cost of an arc is defined 10 be the
time to perform the operation that the arc points 10. Communication and synchronization time needs to be
added to the processing time of the operalion if data must be obtained from remote processors. For this prob-
lem, the perfonnance evaluation function f is defined 10 be g + h, where g(Si) is defined to be the time it takes
to perform operations from the beginning up 10 operation Si' and h(S;) is the estimated execution time for the
remaining operations. The parallel execution time is the maximum execution time of each of the processors.
The goal of the compiler is then to find a schedule (a parallel program) so that the parallel execution time of
the program on a P processor machine is minimum.
Depending on the granularity of the schedule, the program dependence graph can be abstracted at
different levels. It is generally impractical to schedule programs at a fine grain level, since this multiple task-
scheduling problem is an NP-complete problem [Sarkar87] and !.here is a big overhead for scheduling large
programs. Even at the task level, the problem is still relatively complex.
To summarize, the state-space search approach for parallel program optimization applies a heuristic func-
tion as the guideline for searching for the solutions. Heuristics are quantified into heuristic functions 10 help 10
choose the "best" node to expand. As indicated by Lemma 4.3, the quality of the algorithm depends on the
quality of the heuristics U5ed and the quality of the quantification of the knowledge. The advantage of this
approach is clear: systematic processing is possible. Also, the behavior of the algorithm can be controlled by
the heuristic function so that the characteristic of the algorithm can be modified by changing the heuristic func-
tions based on the objective of the system. Based on this model, new heuristics can be tested or compared to
existing heuristics. On the other hand, the disadvantage of the approach is that it is not always possible to
characterize the heuristic by numerical values. Distortions to the heurtstics are likely in the process. Also, for
complex problems that involve many different heuristics at different stages of the problem (such as the problem
for program optimization), lrying to merge all heuristics inlo one heuristic function and applying it to every
step in the search process is difficult and inefficient.
40
One important characteristic of the above approach lies in the specification of the problem. The separa-
lion of control knowledge (planning), data (state descriptions and goals), and problem-solving knowledge
(operators) allows the system to be adaptive to different objectives of the problem at different stages of prob-
lem solving. 'I'his suggests an alternative approach which hierarchicaUy decomposes the problem inlo subprob-
lems and uses diITerent sets of rules that are specialized for the subproblems in order to select the best node for
the particular stage of the problem-solving process to expand. We will first look at the structural organization
of the problem domain and then discuss two frameworks that base on this approach.
4.2.2. Hierarchical DecompositIon of the ParaUelism Improving Problem
The problem of program optimization is a very complex problem, therefore, decomposing it into simpler
subproblems and coordinating the subproblems to solve it is beneficial. In this section we present one such
decomposition method and this decomposition will serve as the foundation for the discussion of lwo olher
frameworks heuristic hierarchy hier-blaelcboard which are discllSsed in sections 4.2.3 and 4.35, respectively.
The process of program parallelism opUmization can be classified hierarchically into three problem solv-
ing modules thai we call the parallelism-defining layer, the parallelism-molehing layer, and the parallelism-
matching control layer. Depending on the complexity of the subproblem., each of these three layers may be
further decomposed into finer subproblems.
The pamllelism-defining layer abstracts the program parallelism and the machine parallelism inlo lists of
machine and program features. The parallelism-matching layer malches the program onto the target machine
by performing a sequence of program transformations. Since a single transformation may serve different pur-
poses, it may belong to different categories. Therefore, we separate !he heuristics in the program restructuring
control layer into lwo SUb-layers: the program restructuring subgoal selection layer and the tronsfOrmalion
layer. The transformation layer contains the transformation lechniques which we term transfOrmalion modules.
Each transfonnation module consists not only of the description of the transformation lechnique, the con-
ditions for the transformation to be applicable and the procedures 10 carry out the transformation, but also the
heuristics pertaining to the feasibility of the transformalion under various circumstances, short-cut rules for
applying the transformation, methods for estimating the effects of the transformation, etc.
The program parallelism improving process can be decomposed into the following five subproblems.
• lmpraving general program parallelism. The major purpose of this process is 10 improve the structwe
of the problem to prepare for other processes below. This goal can be achieved by reducing the amount
of data or control dependence present in the program dependence graph. Machine-independent transfor-
mations for removing redundant code, breaking dependence cycles, and improving localilies can be
applied.
• Crealing tasks. The aim here is to decompose the control structure of !he program so as to create tasks
and vector operations. One major consideration involves balancing the load of the created tasks.
• Scheduling tasks. The scheduling of the IaSks/processes is another important factor in obtaining optimal
performance. Traditionally, this problem is viewed as the task of the operating system. However, studies
have shown that static estimates done at compile time can simplify the task of the operating system at run
time [CytronB4]. There are techniques (e.g. do-across) that can estimate the required minimum process
delay time to reduce unnecessary memory traflic due (0 pulling !he lock variables prematurely. Compile
time analysis can also help to decide what run-times to generate.
• Minimizing syTJ£hronizillion. When a sequential program is mapped to a multiprocessor machine, the
proper synchronization operations must be inserted in the code in order to preserve the semantics of the
original program. Synchronization costs penalize the program performance, and, in the worsl case, may
serialize the whole computation. Fewer synchronization points mean less processor idling time and better
overall sySlem performance. Grouping closely related micro-tasks into one task, copying repealedly used
data inlo local memories, and cbanging data access patterns may have a positive effect on minimizing the
synchronization cost.
• Optimizalion of memory tu:cesses. Since the data access time for different components of the memory
hierarchy may be different, the utilization of fast memory components Qike cache) and the removal of
unnecessary data accesses will shorten the access time and speed up the computation. Array decomposi-
tion, data copying, scalar gathering, strip-mining, loop interchanging, loop blocking, and other transforma-














Figure 4.1. A hierarchical decomposi/ion of the process of optimizing program parallelism.
Each of these five subproblems may select any of the transformations in the underlying transformation
layer. The selection of the transformations is based on the heuristics in lhe transformation layer and the
features defined in the parallelism..<Jefining layer. Since these five problems are interrelated. the restructuring
control process coordinates Ihe interaction between lhem.
The program fragment under consideration is called the ClUren( focus of the system. A program is
decomposed inlo a list of focuses by the focus selection process. The topmost layer of the hierarchy is the
parallelism-ma/ching control1ayer. It selects the ClUren! foclls and controls the optimization of the clUrent
focus.
Global coordination belween different focuses is orlen needed. For example. the memory access optimi-
7.alion subgoal will try to optimize the memory accesses and decompose the array storages based on the pro-
gram focus and the machine model (0 which it is assigned The array decompositions chosen in the subgoal
may be changed when global consideration and adjustments are made.
The perfonnance evaluation process evaluates the perfonnance of transfonnations on the program focus
and provides quantified evaluation for the parallelism matching conlrollayer to make decisions.
This hierarchical decomposition of the program parallelism optimization problem divides the problem
into interacting processes based on the hierarchical sbUcture we described above. It models the conceptual
interactions between different functions of the program restructuring process into a concrete structure so that
controls in these funcLionai units can cooperate and interact with each other. This hierarchy decomposition
also allows specialized heuristics to solve the problem so it can significantly improve the flexibility and
efficiency of the transfonnation process. It also provides a model for the decomposition and organization of
Ute heuristics.
4.2.3. Heuristic-Guided Reasoning and the Expert Systems Approacb
For the rule-based approach. at each stage of the program optimization process, the control of the pro-
cess utilizes a set of rules to decide how to resbUclure the program. A prototype implementation that used Ute
rule-based expert system approach was reported in [Wang85]. In this experiment, control heuristics were
42
encoded into Prolog predicates which chose Ute "most appropriate" program transformation to apply. This
experiment with an expert system achieved mixed results. While the system was able to generate efficient
code for some particular programs, it failed in many oilier cases. The major drawback of the system was its
lack of heuristics. It employed only about lhree dozen rules for choosing the transformations. On Ute other
hand, this experiment exposed a problem common to flat-structured first generation rule-based systems: the
fragmentation of the knowledge and lack of systematic knowledge acquisition tools. This makes enhancing the
ability of lhe system a very involved process. We concluded that structured organization of Ute knowledge is
necessary and that systematic integration of the heuristics is a key issue to knowledge enhancement and
automatic learning of the system. The heuristic hierarchy reported in [WaGa89] was our first attempt in mov-
ing to more intelligent expert systems.
4.2.3.1. The Heuristic Hierarchy
While the modularity and integralability of the rule-based expert systems make modifying the knowledgc
base easy, its opacity of knowledge and inefficiency in execution are the major drawbacks. For example,
translating a heuristic inlo a set of rules causes the knowledge to be fragmented across the rules. Even though
there may be strong relations among many of the rules, the fragmentation causes an unfortunale loss of coher-
cncc. Furthcrmore, this makes maintenance and modification of the knowledge base difficult
To improve the inlegration and modularity of the knowledge, we organize the heuristics based on the
decomposition of the problem-solving methods. This organizes the rules into a hierarchical knowledge struc-
lure called the heuristic hierarchy [WaGa89]. A heuristie hierarchy consists of one or more hierarchical
layers. Based on the functionalities of the rules, rules in the same layer are divided into groups of rules thaL
are called acrions. Each action has a goal associated with it; invoking the action is an attempt to accomplish
the goal of the action. The top layer of a hierarchy contains only one action, which is the entrance point of the
control flow, and the goal of this action is the goal of the hierarchy. The heuristic hierarchy is a way to sim-
plify the modeling of the problem into structured wtits. Layers in the hierarchy represent the conceptual
hierarchical levels of the problem-solving process where in each layer the differenL actions represent possible
solution steps that can be utilized to achieve the goals of the subproblem that the layer faces. The heuristic
hierarchy integrates rules into conceptually and logically related units whose relationship reflects !.he control
flow of the problem solution. Horizontal relations among the actions represent the parallelism or independcnce
that can be exploiLed in a layer by employing multiple actions at the same Lime, and vertical relations represent
the inheriLed sequcntial control flows among the adjacent layers. This hierarchical structure-organization of the
heuristics is simplc, modular, efficient, and flexible.
Note that the purpose of inLroducing the hierarchical structure is not to impose a tightly coupled structure
inLo thc knowledge base, because nol all knowledge can be represented in structmed or procedural form. Also,
if the slructure of the rules is too Light, thcn the flexibility of the rule-based system may be lost. The purpose
of the hierarchical structure is to provide a knowledge organization structure that malches the hierarchical
sLrucLures in lop down problem-solving processes. The hierarchical structure preserves all the advantages of a
rule-based system but has beLter efficiency, modularity, and flexibility in the way it represents knowledge.
An example which applies this technique to the hierarchical decomposition of the parallelism optimiza-
tion process was presenLed in [WaGa89J. The implemenLation of the control in each layer detcrmines the
efficiency and effectiveness of the subsystem. One can apply forward reasoning, backward reasoning, or
opportunistic reasoning to achieve the best result. By merging the flexibility in opportunistic reasoning of a
blackboard architecture and the well-structured control in the heuristic hierarchy, we derived a new problem-
solving model called the hier-blackboard. A hier-blackhoard is a hierarchical problem-solving model that util-
izes the inference power of opportunistic reasoning but follows the control flows inherited from the subprob-
lem decomposition. This achieves a very flexible model that is well-suited to solving complex problems such
as optimizing program parallelism.
4.2.4. Opportunistic Reasoning and the Blackboard Architecture
In this section, we will discuss a new problem-solving model called the hier-blackboard. The
hier-blackboard extends the power of the heuristic hierarchy by employing the heurislic hierarchy for slruc-
Lured organization and dynamic control, opportunistic reasoning for inference, and blackboards for information
sharing and communication. This results in a flexible and powerful problem-solving model that is well-suited
10 complex and ill-conditioned problems such as program parallelization and optimization.
43
4.2.4.1. The Blackboard Architecture
The blackboard architeclure [Nii86b. Nii86c] combines the blackboard and the opportunistic reasoning
model. The opportunistic reasoning model [Hayes83. Hayes85] is a problem-solving model in which pieces of
knowledge are applied eilher forward or backward at the most opportune lime; whereas the blackboard is a
centralized knowledge representation method in which solution states and infonnation are kept in a shared
blackboanl.
In a blackboard system, the solution space is divided into one or more application-dependent hierarchical
levels and is stored in the blackboard. Infonnation at each level in the hierarchy represents partial solutions
currenUy known to the level. The problem task domain is divided into loosely coupled subtasks which
correspond to arecc;; of specializalion within the task. Accordingly, the knowledge for computing intermediate
results and perfonning subtasks is organized into modules called knowledge sources. The knowledge sources
are logically independent and specialized entities, and they communicate with each other only through the uses
of the blackboard. During the problem-solving process, the knowledge sources post infonnation or intennedi-
ate results onto the blackboard to update the state of the solution incrementally and they act according to the
information in the blackboard. H more than one knowledge source is willing to make conlributions, the
conllict is resolved by a unit called control. The control uses control slrategies to choose the most appropriate
knowledge source(s) to update the solution state in the blackboard. Opportunistic reasoning is applied within
the overall organization of the solution space and task-specific knowledge: that is. which module of knowledge
to apply is delennined dynamically. one step at a time, resulting in the incremental generation of partial solu-
tions. This problem-solving process is repeated until an acceptable solution is found or the process cannot
continue for lack of knowledge or infonnation.
4.2.4.2. Blackboard Systems and Production syslems
Several factors distinguish blackboard systems from production systems. First, the knowledge is organ-
ized into independent or semi-independent models in the blackboard system. while in production systems all
knowledge is represenled as production rules. Second, in blackboard syslems, the conlrol decision is distri-
buted into the knowledge sources, whereas in production systems the control is sequential.
4.2.4.3. Advantages of the Blackboard Model
The blackboard model has been a favorite choice for solving ill-conditioned or complex problems
because of its following properties: modularity in knowledge organization, .flexibility in opportunistic reasoning
and potential for parallel implementations.
Studies have shown the effectiveness of the opportunistic reasoning in complex and ill-structured prob-
lem domains [EHLR80], [Nii86b]. An ill-structured problem is characterized by poorly defined goals and an
absence of a predetermined decision palh from Ihe initial state to the goal state. In our case, the problem of
optimizing program parallelism falls into this category. The blackboard approach requires no a priori deter-
mined palh; Ihe decision of what to apply next is made during Ihe problem-solving process at run time.
4.2.4.4. Weaknesses of the Blackboard Model
The blackboard model has the following weaknesses:
1. Cenrralized and global data is a botlleneckfor parallel implemenration. All modifications to the black-
board arc visible and monitored by all knowledge sources; this can be very difficult to implement
efficiently on parallel machines. This problem can possibly be solved by designating private blackboard
sections to knowledge sources.
2. Lacking general guidelines for implementation. The division and organization of the solution states,
solution knowledge. and solution tasks make a great deal of difference in the efficiency. clarily and
effectiveness of the implementation. General guidelines in this regard are nceded. but it is difficult to
come up with such crileria because of the diversily in the different problem domains. Also, the hierarch-
ical structure of the domain knowledge is blurred by the flat struclure of the knowledge sources and con-
trol.
3. Expensive to build. The blackboard model should be used only when its advantages justify the cost of
building it
44
4.2.5. The Hier-Blackboard Model
In lhis section, we inlroduce a hierarchical muHi·blackboard model that we call the hier-blackboard.
The key idea here is to generalize the concept of knowledge sources to map the structure of knowledge sources
to the struclme of the solution method and localize the interaction between knowledge sources. A commonly
used problem-solving method for complicated problems is the divide-and-conquer approach in which the prob-
lem is divided into subproblems that can be solved directly or be further divided into subproblems until the
subproblems can be solved. This achieves a hierarchical level of problem partitioning which is a natural struc-
ture for lhe organization of Ute problem-solving knowledge. When the task of the system is divided into ana-
lytic levels of knowledge sources that implement the sublasks. the knowledge sources can be implemented as a
blackboard subsystem if the sllbtasks iliey are responsible for are complicated. By applying this meUtodology
recursively to lhe derived blackboard sub-systems, we derive a model that is logically and structurally Lailored
to the particular problem at hand.
4.2.5.1. The Framework of the Hier-Blackboard Model
A hier-blackboard system consists of lhe following four types of components: knowledge sources,
blackboards, controls and communications.
Knowledge Sources
Knowledge in a hier-blackboard syslem is organized into hierarchically struclurcd knowledge sources
according to the problem domain. In our model, a knowledge source is a conceplual unit that inleracts with
other knowledge sources which share the same blackboard through explicit information updales on the black-
board. In other words, a knowledge source can be a set of rules, a procedure, or a blackboard subsyslem if Ihc
sublask: warrants the creation of such a subsystem.
Blackboards
There is a master blackboard which is !he blackboard at !he top level of the hierarchy that holds the glo-
bal solution slate of the problem. Optional blackboards can be added in the hierarchical tree structure. The
blackboards in lhe sub-blackboard systems are used 10 hold private entilies needed and produced by the local
knowledge sources. Each blackboard subsystem works on its privale blackboard until global information
updates or accesses are needed; in that case, they simply act like other knowledge sources and compete to
update lhe blackboard at the higher level.
Control
For each blackboard in the model, !here is a blackboard controller which can be a set of knowledge
sources or a separate module that monitors changes on the blackboard and decides what knowledge sources in
the systcm should be executed in case of conflicts. The idea of having a control blackboard ([Hayes85J) can
also be incorporated into this model.
Under the supervjsion of !he control, the problem-solving process proceeds through a series of solution
cycles. During each cycle, specialized knowledge sources check the blackboard; they self-nominale (if possi-
ble) by reporting their possible contributions to a special section of the blackboard that is called the
regislration-board. The control checks the registration board and picks the most appropriate knowledge
sources to perform their actions. User-supplied control strntegies can be consulted by control to select the
"mosl appropriale" knowledge sources among those who self·nominated. This process continues until the
problem is solved or terminates in failure.
As an analogy, the control struclure of a hierarchical blackboard syslem is very similar 10 the corporate
hierarchy of a company. The divisions are connected by tree-structured command channels. Each subdivision
can make decisions on its own, but inter-division mallers need to be solved by going through lheir supervisors.
Communication
As the structure of the system is defined, !he communication channels between the knowledge sources
and the blackboards are defined implicitly. Inleraclion belwecn the knowledge sources in the same blackboard
subsystem is done through the updates of the local blackboard. Inler-blackboard communication is carried out
through communication channels that are specified by the system organization. If a knowledge source is
implemenled as a blackboard sub-system, the communication control in the sub-system will coordinate its
45
knowledge sources and update the shared blackboard at the upper level under the request of its knowledge
sources. The communication in the blackboard hierarchy is done by messengers who run up or down one level
in Ole hierarchy 10 deliver messages between tlJe levels. A messenger is a special knowledge source which
monitors a designated area called a message box in the blackboard for communication. When certain data in
the message box is updated, certain actions are triggered and the messenger delivers the message to the target
blackboard. For example, if one blackboard suhsyslem decides to modify a piece of the global problem state,
it gives the message to its messenger for its parent, and the messenger delivers the message to the upper level.
The messenger in the upper level updates lhe information in its blackboard. The update of Ute information
triggers an action which is to update the blackboard one more level up. In this way. the message will be pro-
pagated to the master blackboard. Note that lhe original knowledge source which initializes the chain of
update actions may not even know that the infonnation has been sent up the hierarchical ladder since it is only
responsible for contributing to its own blackboard. The messenger mechanism helps to make the blackboard
sub-systems modular and clean. The blackboard update operations can be implemented efficiently based on lhe
target architecture to minimize lhc communication overhead.
Primitives for communication between blackboards include access and update of entities of lhe parent or
children blackboards. For update operations, a condition may be sent along wilh the request. The condition
will be checked on the target blackboard with infonnation present in the target blackboard. This last operation
is very poweIfuI since the knowledge source does not need 10 know the state of the parent blackboard syslem
[0 updale information. This conceptual simplicity is accomplished by lhe messengers who act as representa-
tives for the subsystems to their parent blackboards. This knowledge encapsulation shields the inlernal opern-
tions inside lhe subsystems and allows them to appear to their parent blackboards as regular knowledge
sources. The messenger model we provide here is simple, but powerful, and efficient operations can be defined
lhrough this framework. For example, an entity in the message board can be set up so that whenever data is
wrillen to il, the data will be immediately sent up to the message board of the parent In lhis way, data can be
pipcIincd up the blackboard hierarchy.
4.2.5.2. Issues for ParaDel Implementation
Most early research in exploring parallelism of blackboard systems was based on models of target
hardware architectures. Examples of parallel implementations of blackboard syslcms are distributed systems
such as lRICERO ([WiU8S]) and the distributed vehicle monitoring test-bed ([LeC083]), and concurrent blnck-
boards such as CAGE and roUGON ([Nii86a, Nii86b]). The distinction between the sbUcl1aes of the under-
lying computational model and the solution model allows lhe implementation of the solution to be consbUctcd
in a more clean and sbUchaed fashion. The multiple level structure of the hier-blackboard model can be
mapped onto an aclual hardware by a dynamic or static scheduling procedure. For an unlimited processor
model, the mapping is simple, since a processor can be assigned to each basic knowledge source. When lhis
unlimited processor model is mapped to the actual machine that has a finite number of processors, the static
mapping will have to be based on the estimated costs, the structw'e of the hier-blackboard, and the locality of
the dala. Due to the nature of the uncertainty, the dynamic task-allocation scheme may have an edge on static
task-allocation.
The actual instructions to update the blackboards can be tuned to the underlying hardware, but this is
shielded from users with the blackboard update operations.
4.2.5.3. Simulation of ParaDel Bier-Blackboard on Sequential Machines
When several knowledge sources need to share a processor, a special knowledge source called a
scheduler to control the execution of a set of knowledge sources on a processor can be provided. The
scheduler monilors the regions of the blackboard (not necessarily in the same processor) that its knowledge
sources are interesled in and activates them when appropriate. In the extreme case, all knowledge sources of a
blackboard subsystem share a processor, and the scheduler enables a sequential simulation of the parallel
model.
4.2.5.4. Comparison with Other Blackboard Models
The hier-blackboard is a generalized blackboard model. It inherits all the benefits of the blackboard








Belter framework for organization ofproblem-solving knowledge. The hier-blackboard model provides
a framework for belter strucluring of problem-solving knowledge by matching lhe s!ruc(ure of the
knowledge sources with the decomposition of lite problem space and lhe solution methods.
Locality and efficiency. Localized communication improves both lhe locality and the efficiency of lhe
system.
Flexibility in utilizing opportunistic reasoning. Opportunistic reasoning can be applied only at lhe spots
that need lhe power and flexibility of opportunistic reasoning. Simpler subproblems can be solved by
using more straightforward approaches such as rule-based systems.
Flexibility in structural organization. The slructure of the hier-blackboard model is very .flexible. lis
message-passing method can range from the centralized blackboard model to the dislributed message-
passing provided by the messenger mechanism. Depending on the problem domain, lhe structure of the
knowledge sources can be flat or a complicated multiple level hierarchy. The flexibility in the structure
and the specialized knowledge sources that is built-in make problem solving easier than in the lraditional
blackboard syslems.
Higher potential 10 be para/lelized. The mer-blackboard model is designed wilh concurrency in mind.
Higher localily means higher potential for parallel implementation. Built-in knowledge sources such as
the scheduler, control, and the messenger simplify the implementation of Ute parallel problem-solving
model. Depending on the structure, the hier-blackboard model can be applied at different degrees of
parallelism ranging from sequential to highly parallel.
4.2.5.5. Applying the Hier-Blackboard to PSl'aUe1 Compilers
To apply the hier-blackboard model to parallel compilers, we decompose Ute process of optimizing pro-
gram parallelism hierarchically based on the method we described in section 4.2.2. As can be seen fTOm figure
4.1, the most complicated subproblems are the parallelism matching process and the five subproblems for
improving genernl parallelism, creating tasks, allocating processors, minimizing synchronization, and optimiz-
ing memory access. These subsystems can be implemented as blackboard subsystems with the parallelism
matching module managing the oUter five blackboard subsystems corresponding to the five subproblems for
improving program parallelism. The topmost layer, Ute parallelism malching control layer that conrrols Ihe
performance evaluation and Ute selection of the focuses, can be implemented as the control module for Ute
blackboard system for parallelism malching. We encapsulated each program transformation technique inlo an
object that contains the procedure for performing the lransformation, the applicability test, heuristics for ul.iliz-
ing the lIansformation for different purposes for different kinds of programs and target architectures, and
heuristics for selecting appropriate arguments to apply (meUtods of application). These modules form the
knowlcdge sources for the subsystems for improving general parallelism, creating tasks, allocating processors,
minimizing synchronization, and optimizing memory access. We also chose to implement the program paral-
lelism analysis and machine parallelism analysis as rule-based systems that can be activated by the program
parallelism matching subsystem and its knowledge sources.
4.2.6. Compal'ison Of The Tbl'ee Framewol'J{s
We discussed three different frameworks for realizing the program optimization process for parallel com-
puters. We will compare these three frameworks from five aspects: optimism, efficiency, flexibility, simplicity
and parallelism.
The mer-blackboard is an extension to the heuristic hierarchy so they share most of the characteristics
except that the hier-blackboard utilizes the parallelism within the program optimization process and fealures
opportunistic reasoning which is much more sophisticated and flexible than the control in the heuristic hierar-
chy. The opportunistic control in the hier-blackboard model is a bit more complicated than the rule-base
approach used in the control on the heuristic hierarchy and is more difficult to program (but it is still
significantly simpleI' than the general blackboard architeclure due to the modular organization and abstraction).
The A· algorithm guarantees that the resulting solution will be optimal as long as the heuristic function
h is a lower bound of the actual cost h·. The other two approaches use heuristic rules to select the transforma-
tions and thus lose the optimism. The efficiency of the A· algorithm depends on how close the heuristic func-
lion h is to h·. In other words, the A· algorithm can be very efficient if the performance estimation function h
is very accurate. However, accurate performance estimation at compile time is as hard as the program optimi-





Figure 4.2. The structure ofa program parallelism optimizing system based 011 the hier-blackboard model.
estimation and lhe efficiency of the program optimization process is needed.
The efficiency of lhe hierarchical reasoning framework depends heavily on the richness of the
knowledge. AI techniques such as pallem recognition, constraint propagation, goal reduction, and learning can
be used to improve the efficiency of lhe system.
The A· algorilhm is nol as flexible as the other two approaches since the heuristic functions are applied
to the whole decision tree, while the decision-making processes of the other two approaches can have differenL
control strategies at different stages.
The above comparison is summarized in table 4.1 below. Our original implementation of !he prototype
syslem used the heuristic hierarchy [WaGa89], but was later lransferred to lhe hier-blackboard. We also stu-
died lhe A· algorithm because of its optimal characteristic. The selection of the framework is based on many
complicated considerations and docs not imply that any of the framework is significantly better than lhe olhcrs.
Table 4.1. A subjective comparison of (he three frameworks: the A· algorithm, the heuristic hierarchy, and
(he hjer-blackboard.
optimism efficiency simplicity flexibility parallelism
A· algorithm oDtimal deDends 200d fair NA
heuristic hierarchv deDends Rood fair Rood NA
hier-b/ackboarddepends depends complex no excellent excellent
48
4.3. Conclusion
In lhis chapter. we discussed three frameworks for implementing tlJe fealllre-direc:ted program optimiza-
tion framework: in a parallel compiler. The three frameworks are discussed in detail and are compared based
on lhe optimism, efficiency, simplicity, Hexibility, and parallelism. We will discuss issues that support the
framework, such as machine knowledge manipulation, perfonnance prediction, heuristics and transformation
techniques to improve program parallelism, in the next three chapters. In chapter 8 we will describe a prolO-




MACIDNE KNOWLEDGE MANIPULATION ISSUES
FOR PARALLEL COMPILERS
5.1. Introduction
In this chapter, we study issues related to the representation and manipulation of the knowledge of
machine parallelism and lhe implication of these issues in parallel compilers. The main questions that we are
concerned include what kind of machine knowledge a parallel compiler needs to have, how to analyze program
optimization heuristics and identify machine features involved in them to describe the knowledge, how to
represent the machine knowledge and the heuristics, how to acquire the inference capability, and how to sup-
port different classes of parallel computers.
An object-oriented hierarchical machine feature representation scheme that is designed to address these
problems is presented. This represenlalion scheme features modular knowledge representation, various degrees
of abslraction, and hierarchical reasoning. It provides a foundation for systematic analysis of heuristics and
allows parallel compilers to support different classes of target architectures. A machine knowledge manipula-
tion system that realizes Ute representation scheme is also presented. The system is implemented in Prolog and
provides a mechanism for interactive machine feature specification and classification; it also supports reasoning
based on the program and machine features.
5.1.1. Feature-Directed Program Optimization
As discussed in the last chapter, methodologies for optimizing parallelism on a particular parallel
machine are often based on features of the architeclure. Under the jeatlUe-direcred program optimizatioll
model, optimizing parallel compilers use the features of the program and machine explicitly to control the res-
tructuring of the programs. Unlike olher parallel compilers where decisions for program optimization are
based on the implicit heuristics that are hardwired and scattered in the compilers, this approach allows the
compiler to base its decisions on features of both lhe target machine and tlJe program. In this way, the com-
piler is aclually •'progranuned" by the features of tlJe chosen target machine and the program to be optimized.
For example, figure 5.1 shows the flow graph of a simple heuristic for loop blocking that is based on machine
features.
The effectiveness of knowledge-based compilers and multiple target parallel compilers relies on a suit-
able knowledge representation and processing schemes for representation and manipulation of Ute machine
parallelism and program optimization knowledge. The machine knowledge representation scheme needs to
provide a foundation for the integration and organization of the program optimization knowledge and support
for performance evaluation and reasoning. In section 5.2, we discuss machine features that are important to
parallel program optimization and the requirements for knowledge representation in parallel compilers. In sec-
tion 5.3, we present an object-oriented machine knowledge representation scheme which features modular
knowledge representation, various degrees of abstraction, and support for hierarchical reasoning.
Recognizing and collecting useful heuristics and analyzing and separating machine features from the pro·
gram optimization heuristics are very involved jobs. A good knowledge manipulation system can help
knowledge engineers comb through the complex and ill-organized knowledge and identify the essential elc-
menlS of Ute knowledge to help them transform fragmented heuristics into well-defined programs. Automatic
tools to help knowledge engineers 10 program parallel compilers are highly desired and are long overdue. In
section 5.4 we introduce a machine knowledge manipulation system that is based on the knowledge representa·
lion scheme discussed in section 5.3. The machine knowledge manipulation system can be used to interac-
tively analyze heuristics for optimizing parallelism, comparing machine features, abstracting new machine
50
71°O\~
has_vectoccapability V> 0 and V <N innennost loop Lk
/:~ro~f1~~
number a/vector registers = R Lk is vectorizable
machine feature list program dependence graph
Figure 5.1. A simple example to illustrate feature-directed program optimization.
knowledge, and insl.alling new target machines.
5.2. Machine Features and Parallel Compilers
5.2.1. Machine Features
Properties of the target machine that affecl the concurrent execution of the machine are called machine
features. Definition of the machine features records the distinct properties of Ute architecture lhat need to be
considered in utilizing the parallelism on the architecture. The manipulation of machine knowledge includes
representation and organization of machine features, inference and modification of machine features, and sup-
port for reasoning based on machine features.
A delailed discussion of machine features and their effects on program transformation was presented in
[WaGa87]. In Otis section, we examine different aspects of machine features (such as interaction between
features. abstraction and classification of the features) and the criteria that these aspects impose on the
representation of machine features.
5.2.2. Important Machine Features for Parallel Compiler
Functionally. a parallel computer can be characterized into four components: the processing elements,
the memory hierarchy. the communication networks, and the control unit.
The processing clement is the hardware unit that carries out the computations. The memory hierarchy is
a functional unit that provides data stores of different speeds. The communication network interconnects the
different components of the system. The control unit is the functional device that controls the execution of the
parallel computer. The processors can also be grouped into clusters. A c1USler is a collection of processors
that is capable of exccuting a collection of tasks in a tighUy coupled manner. For example, the computational
complex (CEs) of the Alliant FX/80 forms a cluster that is distinct from the interactive processor sysfem in the
FX/80. A system may support multiple clusters with multiple processors per cluster, as in the Cedar system
[KDLS86l. or it may be viewed as one tightly coupled cluster of processors, as in the Connection Machine, or
a loosely-coupled system of one-processor clusters. as in the Cray XMP.
A parallel computer may have some or all of the above components. Since there are so many possible
combinations of the machine configurations, it is impossible to describe all cases here. Instead. we will sludy
variations of some important features and their effects on the programming methodologies and the parallelism
optimization issues of parallel computers.
51
5.2.2.1. The Processing Elements
A processing element may contain one or more scalar arithmetic processors and veclor processors. A
scalar arithmetic processor processes data sequentially and can be characterized by the cools of basic operations
and the number of functional units (adders. multiplier. etc). Parallelism which exisls in litis level is in very
fine-grain and has been explored extensively in sequential compilers.
The veclOr processor contains one or more vector pipeline(s), vector registers and vector operation con-
trol. The vector pipeline overlays the execution of the operatiOJL<;, and can be characterized by the number of
pipelines and the number of stages in each pipeline.
The vcc(or operation control unit controls the loading, storing, chaining and execution of the veclor pipe-
lines. It can be characterized by the following features: vector instruction types, coslS of vector operations.
vector startup time, vector operation chaining, cost of non-unifonn stride operations, cost of scatter-gallier, and
vector reductions. For a machine lhat has vector registers, it is important to keep the vector operands in the
registers and to match the length of vector operations with length of vector registers. For a machine that has
no vector registers, the operands of the vector operations need to be in the memory (e.g. Cyber205). Important
issues for these kinds of machines include increasing vector lengths, avoiding bad strides, and chaining the
vector operations.
Table 5.1. Machinefealures ofprocessing elemenls of some parallel computers.
Feature Name nCUBE 2 iPSC/860 AJliant FX/B
Number ofprocessoTS 64 512 8
Maximum number of processors 8196 102A 8
computation mode parallel parallel parallel+vector
scalar instruction type three-address three-address three-address
scheduling melhod [user control] [user control] [user_control, self_scheduling]
clock cycle ratc 20 40
peak pcrfonnancc MIPS 7 40 10
peak pcrfonnancc FLOPS/pE 35 80 11.75
peak performance I vector pipe 40 30
llddresS bits 26 32 28
operotion overlapping [data--Pfcfectch] [alu_fpu_units, Cpu_fpu_units] [alu_fpu_units]
Cpu-Cpu operlapping operations [add.mul]
vector capability 00 no(f1oating-point pipeline) <rue
vector startup time small 'moll
vector chaining <rue
vector operand rcgister
vector reduction [inncr-product, min, mu, ...J
number of pipelines(pcr proc.) 2 1
pipeline stages 3 5
5.2.2.2. The Memory Hierarchy
A complete memory hierarchy may have global memory, cluster memory,local memory, cache memory,
and vector memory. Most parallel computers utilize two or more levels of memory hierarchy, but some have
only shared or private memories.
Global memory can be accessed by all processors, and can be either physically centralized in one
memory module (as in the Alliant FX/8) or distributed among processor units (as in the BBN BuHerily and the
IBM RP3). Cluster memory is shared by the processing units in the cluster (e.g. CSRD Cedar). Local
memory is owned exclusively by individual processors. However, some computers have a centralized con-
troller which can access all local memories (as in the Pringle [KGSF84, KWGCS84], or the Connection
Machine [Hillis85]). Another extreme is the memory modules in the mM RP3 where private and shared
memories share the same module and the boundary between them is shiftable and controlled by software.
Cache memory is usually a very fast memory module; there is usually a high bandwidth bus between memory
52
and cache so that data can be transferred in blocks. The vector registers are used to slore the vector operands
so that memory accesses can be minimized Cor vector operations. Important features about veclor registers
include the number and the size of the vector registers and the cost of loading and storing infonnation into vec-
tor registers.
Important features aoout the structure of the memory hierarchy include the size of different memories,
memory sharing (shared cache, private cache, share memory. private memory. etc), cache coherence strategy
(compiler management, snooping cache, etc), centralized or distribuled memory modules. memory interleaving,
cost ratios of different memory accesses, vector p("(}ofelch mechanism. and available memory synchronization
commands (fetch-add, locks, memory lags, etc).
The major goal of memory management is to minimize the data access time. TIris can be achieved by
removing unncccssa.ry data dependencies. keeping data in the fastest memory. changing memory access pat-
terns, overlapping memory accesses with other operations, using block memory accesses, deciding what data to
cache, and so on.
Table 5.2. Machinefearuresfor menwry hierarchy of some parallel computers.
Feature Name nCUBE2 iPSC/860 Alliant FXjBO
number of veclor registers 0 8
size of veclor rcgislcrs 32
data cache size (byles) 0 8000 128000
cache block sizc(bytcs) 32 32
cache coherence strategy write_back write_back
shared cache "' "' y~local memory size (Mbytes) 4 16 0
max local memory (Mbytes) 64 4000 0
global memory size (Mbytes) 0 0 64
maximum global memory(Mbytes) 0 0 [28
veclor prefetch emchanism ['global to cache', 'globallo register']
special synchronization inslr ['memory tags']
register/cache access cost ratio 2 2
cache/memory access cost ratio 3 5
For machines with only global memory with uniform access times, Ule primary issues are minimization
of synchronization and critical regions and utilization of fast registers.
For machines that have local memory. cache memory, or global memory wiUl non-uniform access times:,
locality of the data becomes very important. In such situations, deciding where to put the data is generally
based on the ratios of lhe different memory access times:. Storing shared data in the local or cache memory
introduces the data coherence problem along wilh the overhead of moving lhe data Only when the gain in
shorter accessing Lime outweighs the loss in overhead of transmiLling and updating of the data should a local
copy of the data be created. Some machines support block access of the da1a which can decrease the data
transmission time and should always be used if possible. For a machine that has only local memory,
message-passing stralegies are the basis of all synchronization and accesses to shared information. Issues to be
considered include domain decomposition, load balancing, ratio of communication and computation, and over-
lapping communication with computations. On most networks, the unit cosl of transmitting data decreases as
the message gets longer. Therefore, long messages are usually favored. On the oUler hand, long messages need
more time for computing so may undesirably increase the synchronization cosl Striking a balance between the
communicalion and compulation costs is necessary and is non-trivial.
5.2.2.3. Inten:onneclion Networks and Busses
The connections between componenlS of the system (processors, memory modules, clusters, etc) can be
either busses or more complex interconnection networks. A bus architecture has the advantage of simple and
fasl dala transmission, but it allows only a small number of devices to be attached to it. For machines with
busses, the primary concern is the cost of memory accesses. For machines with network connections,
53
important features of communication networks include network topology, network bandwidth, delay per net-
work stage, packet or circuit switched, packet size, maximum number of pending memory references a proces-
sor can have in Ihe network, network routing (self-rouling, store-and-forward, worm-hole-routing, etc), and the
performance penalty of self-routing.
Network topology is probably the most important property of a network. It affects the algorithms for
solving the problem and the way the dala are distributed in the system. On networks with low bisection
widths, certain data movements are notoriously slow. For example. a matrix tIanspose is extremely coslly on
trees and rings. A complete sludy of the role of topology in parallel algorithm design is fOWld in [GaVR84].
From the compiler's point of view. there are two critical issues in network management: the routing
algorilhms and minimization of network traffic. Some computers have self-routing hardware or software for
which routing mechanisms like worm-hole routing ([Dally87]) can save significant data tmnsmission time and
ease !he job of bo!h programmers and compilers. For other machines such as the Pringle, the compiler needs
to plan a path and generale ccx:le to perform U1e routing. Also, If the network is such that some processors arc
"nearer" than others, and if U1e message delay from a far processor is significantly more than from a near pro-
cessor, oplimal data placement becomes critical. Unfortuna1ely, not only is this problem NP-complete, but
there are also very few good heuristics for it Techniques to solve this problem include minimizing intersec-
tions of subdomains, minimizing distances the messages need to travel, accwnulating data into longer mes-
sages, and using ccx:le motion to minimize synchronization delays.
For machines that support self-scheduling loops, the program restructuring system can leave the task
scheduling problem 10 the operating system of the machine at run time by changing the outermost loop into a
self-scheduling loop. However, using the self-scheduling loops makes good global array decomposition almost
impossible, since it can only be known at run time which loop will be run by which processor. For machines
that have no combining network, balancing the computational load and avoiding network contention are among
the major challenges to parallel compilers.
Table 5.3. Machinefearuresfor communication networks of some parallel computers.
FealW'e Name nCUBE 2 iPSC/860 AlliantFXAlO
interconnection type network network networklbus
network topology hypercube mesh(16x32) crossbar
memory-<:8che interconnection channel channel b",
cache-PEs interconnection NA b", crossbar
memory-<:ache bus bandwidth 150MB
cache-PE network: bandwidlh 376 MB
network bandwidth (Mbytes/channel) 55 25 47
number of channels per processor 14 4
message start up init a O, J.IS 158.45 100
message start up per hop a l , J.IS 1.29 1.0
message Iransmit per word ~o, J.IS 2.49
min package size (bytes) 0 100 1
routing control [self-routing] [self.routing] NA
5.2.2.4. The Control Unit and Processor Clusters
Some machines have control processors that conmll the execution of the processing unilS (for example,
most SIMD machines have a single conmll processor). On other machines, the conmll exists only in the form
of cooperating software at the operating system level. On machines with centralized control at U1e instruction
level, such as SIMD machines, VLIW machines, synchronization overhead is small, so parallelism can be util-
jzed at the fine-grain (instruction) level. In particular, for VLIW (very long instruction word) machines, all
scheduling, runtime synchronization and communication are completely specified at compile lime. This resulls
in very expensive compilation and limited parallelism. Opposite to the VLIW machines are the data flow
machines where processors are activated by the flow of the data. The control on a dataflow machine tends to




of the dataflow graph of the program. This can be done by common subexpression elimination, code motion,
dead-code elimination, constant folding, etc.
Another type of control involves dynamic task scheduling on most shared memory fvfiMI) systems. TIlls
kind of system usually ulilizes the parallelism at the process level with dynamic process control. Depending
on the configurations of the machines, the cost of process initialization and process switching may vary from
very cheap (light-weight task) to very expensive (heavy-weight process). For machines that support light-
weight processes (tasks), the major issues are lhe scheduling of tasks and the minimization of communications.
For machines that do not support light-weight tasks, large processes need to be created. This usually results in
matching the number of processes with the number of processors.
For machines with multiple clusters, lhere are two levels of scheduling: a "micro-task" level thaI
manages jobs within each processor and a "process" level that assigns processes to each clusLer. The major
issue to consider is the granularity of the tasks and locality of da1.a in the clusters.
Features of the control wtit at the hardware level include the number of processors, hardware primitives
for scheduling (for example, the fetch-and-add operation), types and features of shared resources, task schedul-
ing lime, process swilching cost, and primitives for scheduling and their costs. The conttol features at the
operating syslem level include the scheduling mode (self-scheduling, data-driven, elc), scheduling heuristics
and system scheduling operations. At the programming level, the control includes user assertions, heuristics
for process scheduling, explicit parallelism control constructs, and timing specification for real time control
programs. Features at the cluster level category include cluster size, homogeneous or heterogeneous clusters,
shared resources within clusters, task swilching time within a c1usler, and processor scheduling policy within
cluster.
5.3. Design Considerations for Machine Knowledge Representation Schemes
The machine features listed in the last section represent fragmented knowledge of the machine and lack a
comprehensive understanding of the whole architecture that they describe. In order for the system to build up
an understanding of the target machine, the knowledge needs to be connected by relationships between the
features and the knowledge of how these features affect the parallelism of the target machine. This implies
that the knowledge representation will have to support the composition of the knowledge and mold pieces of
the knowledge together 10 obtain a whole view. To avoid a ledious specification process, Ihe task: of finding
relationships between features should be done only once for each pair of features and not repeated for other
machines. In the next section we will describe an object-oriented knowledge representation scheme which
allows one to build up the structure among the features. Before we get into the details of the representation
scheme, we discuss several design decisions for supporting the reasoning capability of the feature-directed pro-
gram restructuring model. Should the machine knowledge base be organized as a flat-structured feature list or
a hierarch.ical-structured tree? Should the representalion allow implicit knowledge implication or should aU
relationships be spelled out explicitly? These decisions affect both the power and elegance of lhe representa-
tion scheme.
Flat Structure Versus Hierarchical Structure
Representing the machine features in a flat structure has the advantage of being simple and explicit. In
[WaGa87l, a flal-structured, unifonn machine feature representation scheme is used 10 represent and model
parallel computers. All basic features are represented as faclS in the database. When a target machine is
specified, the features of the machine are loaded into the kernel of the expert system. The features are
abstracted by applying a set of rules to the facts. This approach is simple and yet powerful enough to model
the parallel architectures. Specification of the machine can be done feature by feature and is straighlforward.
However, the f1at-struclure representation does not have a mechanism to show the relationship and interaction
betwccn machine fcalures. The relations between the features must be encoded by using other construclS such
as the rules and pre-conditions used in the above system. To find the relationship between particular features,
one musl exhausL the rules to find the rules that define the relation belween them. This makes manipulation
and maintenance of the system an involved task.
A hierarchical knowledge representation scheme incorporates the inlerrelationship into the representation
of the knowledge itself. The relationships between the machine features are mostly archilecture independent
and can be inherited from the previous known structures. Therefore, they do not need to be redefined when a
new target archilecture is defined. On the other hand, it is important to have the ability to define new
•.
55
relationships so that similar architectures can share most of the knowledge but are allowed to specify the
differences explicitly.
To effectively support lhe hierarchical knowledge representation, the knowledge representation scheme
should support abstraction and classification. Classification allows grouping of relative features into distinct
classes that possess special features. whereas abslrnclion defines relationships between features of different lev-
els. For example. figure 5.3 (b) on page 125 shows an organization of lhe machine features; the vertical
dimension shows the abstraction levels of the architecture and the horizontal dimension shows how related
features are grouped together 10 fonn the base of Ute abstraction. Together, classification and abstraction pro-
vide a powerful mechanism for the integration of feature knowledge. Our representation scheme defines a
hierarchical structure by using the relationship between the subclass and supcrcIass along with relationship
functions.
Explicit Versus Implicit Knowledge Representation
Knowledge of the machine can be explicitly expressed or implied by other knowledge. Explicit
knowledge has the advantage of being simple and immediately available, but may contain redundant
knowledge and inflate the size of the knowledge base. On the other hand. allowing implicit knowledge
representation leads to more concise representation but increases the difficulty of maintaining the knowledge
base. To balance the tradeoff, it is a good idea to make all knowledge implication rules explicit.
For example, in the following example, the feature cost ratio of global memory access and computation
is defined to be the ratio of the cost of a global memory fetch and the cost of a floating point multiply. This
property of the machine can be defined explicitly or implicitly by the features cost of a global memory fetch.
cosl ofafloating point multiply, and rule 5.1 which explicitly defines the relation.
Rule 5.1 feature_value{'cost ratio of global memory access and computatio/l', R) ;-
feature_value{' cost of a global memory fetch', F),
feature_value{' cost ofafloaring point multiply'. M).
RisFIM.
The advantage of spelling out the relationship instead of specifying the value for the information is that
when the feature is to be modified (for example, when the memory is upgraded into fast memory), the feature
cost ratio of global memory access and computation does not need to be redefined. On the other hand, when
optimizing a program for a particular machine, only the relative cast rntio will affect the decisions. So when
the actual costs of memory access and computation are not required elsewhere in the system, the value of the
ratio can be specified explicitly.
5.4. An Object-Oriented Knowledge Representation Scheme for ParaDel Computers
From our point of view, an intelligent parallel compiler needs a "simple" machine knowledge represen-
tation scheme that can support reasoning, knowledge abslractioR, organization, and heuristic comparison. The
scheme we describe here is based OR an object-oriented knowledge representation paradigm, in which features
about machines are treated as objects and the understanding of a machine can be composed from feature
objects.
The object-oriented paradigm is a sound knowledge representation methodology. Its modularity and
hierarchical abstraction capability make it particularly appealing for the representation of complicated real
world knowledge. The inheritance and chaining of this paradigm allows a compact and elegant representation
of the relationships between objects; this also makes porting the system to new machines easy. On lhe other
hand, just like any other knowledge representation paradigms, there are many tradeoffs in the object-oriented
paradigm and design decisions in the implementation that must be made based on the application domain. In
this chapter, we concentrale on just one application domain _. multiple-target parallel compilers -- although
most of the principles and methodologies we have discussed here can be applied or extended to other problem
domains as weD.
Our knowledge representation scheme consists of three elements: the feature objects. relationship
between objects, and operators on the objects. Under this scheme, the knowledge of the parallel computers can
be decomposed into features and reassembled into hierarchical models based on feature organization and
abstraction techniques. This knowledge representation scheme provides a vehicle for heuristic manipulation
and intelligent compiler construction.
56
5.4.1. Feature Objects
We followed the convention of the objecl-oriented paradigm by which objects representing machine
features are represented by a set of basic objects that we callfeature objects. Feature objects are classes of
objects that are the basic units for defining properties of parallel computers. A feature object represents a par-
ticular property of the target architecture and has slots to slore infonnation about the property. These slols are
called the attributes of the feature object The structure of the feature object varies by the type of the property
it represents. Some attributes of the features are common (0 all features and some apply only to certain
features. The template of feature objects is defined by a mela-class that we call ajeoture class.
Each entry in the feature class defines possible values for feature objects. An irnlance of a class is a par-
licular instantiation of a class. An instance of the feature class identifies the properties of a machine feature.
These properties include the name of the feature. type of the feature (used for type checking), conditions for
this feature to be meaningful, relation of the feature with other featwes, and the altributes of this feature. The
feature object, which defines what lhe instances of the feature should be, is the foundation of the represenla·
tion. The example shown in figure 5.2 defines a feature named vector registers. In the example, the feature
object vector registers is active only when lhe larget machine has vector capabilities. This object is a subclass
of the objects register and memory, so it inherits all properties of the two classes. Since all machine fealures
arc subclasses of the mela class feature object the latter can be omitted from all definitions.
In figure 5.2, slots in lite first part of the template are common for all feature objects (inherited from the
clBss fearure object). Slots in lite second part arc the attributes of the feature.
A class describes the implementation of a set of objects that all represent the same kind of knowledge.
The individual objects described by a class are called its instances. An instance of a feature object is defined
when a value is associaled with the object by a feature assignment statement: featureDefine. For example, lhe
following Prolog function causes a message to be sent to the object vectorJegisters and sets the values of lhe
number and the size of lhe object to be 64, 32, respectively.
featureDefine(vectorJegisters, number. 64)
fea/ureDefine(vectorJegis/ers. size, 32)
When defining the machine features. lhe following alternative form is also accepted:
the number ofvee/or registers;s 64.
the size ofvectorJegislers is 32.
The statement featureVa/ue provides a uniform way of accessing lhc fealure instances. For example,
fearureValue(vectorJegisters, number, N) relurns lhe value of the current instance of the number of !.he feature
vectorJegisrers in variable N.
5.4.2. Attributes Associated with Objects
New atlribu(es can be associated with a feature object lbrough the feature-attribute assignment state-
ments:
57






feature name: vector regis!ers






number of vector re~isters intej:ter 64
size of vector reuisters inlelJer 32
other altribulcs: omitted
FigW'C 5.2. The definition of the/eoture object vector-,eg/8fers.
!eafureAttrihute(ObjectName, AttributeName)
jeaJurl'Attribute(ObjectName. AttribureName, Type. DefaultVa/ueJ.
The first slaLement defines the relationship between the attribute and the feature, and in the second slalemcnt
both the relationship and tlJe value are defined. A feature attribute assignment statement explicitly and dynam-
ically defines the binding between the feature object and its properties.
A object defined to be dynamic may have more than one inslancc but only one inslance can be current.
In some cases, allowing more than one inslancc of the same object provides a degree of •'non-delerminism."
This is useful when the larget machine contains multiple characleristics. For example, on hypercube comput-
ers, the communication network can simulate different kinds of networks such as rings, meshes, shuffle-
exchange networks, and trees. Different algorithms can utilize anyone of the communication patterns. On the
other hand, the current inslancc mechanism provides a way to obtain the deterministic effects.
5.4.3. Feature Organization
Feature objects form the basis of the representation of parallel computers. In the real world, knowledge
of objects is not isolated. Instead, relationships exist between knowledge on different levels of abstraction.
Therefore, representation of the relationships between objects by using a higher level of abstraction and organi-
zation is needed so that the relationships can be recorded and manipulated. The bottom-up approach allows
feature objects of similar properties (0 be grouped to form a feature object of higher level. Using a top-down
approach, one may decompose the feature objects into more detailed descriptions and continue expanding until
the feature is a basic facL With either approach, the feature classes are organized into hierarchical structures
based on relation fWlctions. The relation functions explicitly define the relation between feature objecls of
different levels. This includes predefined relations such as parent, children, exclude, complement, associate,
compose, along with user-defined relation functions. The relation functions serve two purposes. First they
define the relationship between objects; second, they define the flow of control and messages. The collection
of the relation functions organizes the machine knowledge into a hierarchy of features. Some of the relation-
ships, such as parent and children in the class hierarchy, are inherited from the organization of the hierarchy
and are defined by the supcrclass or subclass_of relations; others need to be explicitly defined.
The organization of the features can simplify the representation of the features. For example, the pre-
condition 'has_vector_capabilities' listed in the example in figure 5.2 is redundant because the feature vector
registers is the descendant of the feature vector capabilities. and this pre-condition is actually impliciUy
encoded in the hierarchy of lhe features.
58
5.4.4. Operations on the Objects
5.4.4.1. Inheritance, Specification and Qualification
The notion of classes and meta-<:Iasses provides a mechanism for sharing information between different
objects via inheritance. Inheritance allows a subclass to inherit its properties from a superclass. For example,
defining the class locaI_menwry to be a subclass of the class menwry implies lhat local_memory inherits all
properties of menwry unless otherwise specified.
Different subclasses or instances of a class can have distinct properties through the use of specification.
For example, cachememory is also a subclass of memory, but it has the property cachecoherencescheme thaI
lhe class loca/memory does not have.
A class can have more than one superclass. and it inherits all the properties of its supercl.asscs. For
example, the class vee/orJegister defined above is a subclass of both memory and register.
Inhcrilance and specification are used to group closely related knowledge into lite same class so that the
information can be accessed locally in the system. Specification helps to distinguish objects in the same class,
while inheritance keeps the size of the system manageable.
Inheritance and specification are usually associated wilh qualification. That is, an inherit3nce or
specification for an aUribute is applied to an instance of a class when cerla.in conditions are satisfied. For
example, in dislributed computing, achieving a balance between computation and communication is important.
Also communication with processors that are far away is generally more expensive. These heuristics can bc
inherited by knowledge of alI dislributed computers.
class 'distributed memory' subclass_of' globaCmemory'
heuristics: [heuristic(' balance computation and communication ratio'),
heuristic(' avoid far access'), ...J,
However, when the cost ratio of far-access and near-access is close to unity, sending messages to far
away processes does not incur a significant penalty. In this case, the reslriction can be lifted by changing the
attribute 'avoid far access' to 'far access OK' as below. In lhe following example, the heuristic 'far access
OK' is a specification that overwrites the inheriled heuristic 'avoid far access'; and the conditions in thc
qualification statement in_case validate the specification statement
'dislribuledMemory' instance of'distributed memory' with
heuristics: [
heuristic('far access OK',




Some machine features can be changed during the course of program optimization. This provides flexi-
bility in both matching the algorithm with the machine and reasoning. We call features that can be modified
IImable features, and features that cannol be changed are called static features. Possible modifications to a
feature object include updating the feature value, changing the feature attributes, and modifying the heurisLics
associated with the feature. An instance of a feature object can be modified by the featureDefine statement
(hat we described above. For example, suppose the target machine is a hypercube computer. Since the hyper-
cube can simulate other network topology (like mesh, trees, and shuIDe exchange networks), at certain stages
of the algorithm, a particular type of communication topology may match the algorithm better than others.
The communicaLion paltem can be easily changed by the following statements:
59
featureDeftne(networkTopology, network, mesh, OldTopology).
..... comnutllicarion using mesh .....
fearureDefine(necworkTopoiogy. network, Olaropology,.J.
..... communication using OldTopology .....
The feature modification operation can only be applied to the tunable properties of the architecture and
altempls to modify static fealureS will fail.
5.4.4.3. State Adjustment with Dependencies
One problem that arises from object modification involves maintaining integrity and consistency. In our
object-oriented scheme. this problem is addressed by providing a dependence-mechanism to notify an object of
changes in relaled objects. The relation functions explicitly define the dependence relations between the
objects involved. When the source of the dependence is changed, the target object is notified. And the target
object may examine the change to decide whelher to change its own state or nol For example, the heuristic
object utilizing vector registers depends on the feature objects available vector registers and vector operations.
If the object available vector registers is changed and no vector register is available, then the object utilizing
vector registers needs either to move certain data in vector registers (0 memory 10 reclaim the vector regisler
for the next operand or get the next operand from the memory. What action to take depends on the heuristics
in the heuristic object.
5.4.5. A Simple Example
The specification of a feature involves defining the Lemplate (class) and the value (inslance) of the object.
The former is normally the task of the system implemenlor and the latter is Ute task of system maintainer who
inslalls a new machine. For example, Ihe following is a fragment of the program that specifies Ihe template for
the global memory in a shared-memory architecture.
class memory_hierarchy with
type: one_of [shared,distributed,hybrid),
structure: list of {global,c1uster,local,cache}.
class memory with
type: one_of{shared,distributed,hybrid},




read_cost: real, % in clock cycles
write_cost: real, % in clock cycles
prefetch: {boolean,false},
prefetch_size: integer in case prefetch == true,
connection: one of {bus, network}, % to processor
nelwork_topology: one_of{hypercube. omega, crossbar, .. .]
in case connection == network.
class locatmemory subclass_of memory with
type: local,
size: {integer,4000}, % specify default value
cOllnection: bus.
Under lhe above definition, Ihe local memory on a nCUBE 2 can be defined as:
60
memoryHierarchy instance_of memory_hierarchy with
type: distributed.
structure: {loca/Memory].
/ocalMemory instance of locof memory with- -
size: 4000 in_case pid < 16, % has uneven memory distribution






The entries can be accessed or updated in the following way:
!eatureValue(locaIMemory, size. K).
the size of IocalMemory is K.
the connec/ion of localMemory;s bus.
An interactive feature-specification scheme is described in section 5.6.2 as an alternative way to interface
with the machine feature manipulation system. Different types of people involved with the system (users. sys-
lem implemcntors. or system programmers) can use different interface schemes.
5.4.6. Features of the Parallel Machine Knowledge Representation Scheme
The most important feature of our knowledge manipulation scheme is in the flexibility and lhe modular-
ity of the scheme. The representation scheme can be used to achieve the following features.
1. Allows dynamic modification of machine knowledge.
2. Has high flexibility in knowledge charnclerizalion and organization.
3. Has various abslraction levels of lhe knowledge.
4. Supports knowledge hiding and global visibility.
5. Supports inference and knowledge encapsulalion.
5.4.6.1. Static and Dynamic Knowledge Representation
Allhough most features of a parallel architecture are fixed after the configuration of lhe system is set up,
some features of the machine should be allowed to change dynamically. Allowing dynamic modification to
some machine features has lhe following advantages. First., conceptual decomposition of the machine is possi-
ble. This means that the programmer can decompose the computational model 10 match the algorithm decom-
position such as divide-and-conquer algorithms. Second, users are allowed to define "virtual machines" based
on Ihe existing machine features and knowledge. This is ideal for testing new machine designs or program-
ming heuristics. Third, computer vendors usually provide many different configurations to suit specific needs
of the users. Fourth, specification for similar architectures can be more elegant and compact The represenla-
tion scheme we defined above allows the represenlation of bolh static and dynamic knowledge in a unifooo
struclurc. Machine features that can be modified at run lime are marked as dynamic features by setting Ihe
atlribule dynamicFeature to be true. Dynamic machine features include the number of processors used in com-
puting, system load. algorithmic network topologies. and task control strategies. Fealures that are slatic over
lime are teooed slatic features and allempls to modify static features will be rejecled by the system.
5.4.6.2. Flexibility in Knowledge Characterization and Organization
Knowledge characterization is the basis for knowledge organization. Machine features can be character-
ized in different ways under different constraints. For example. features of a parallel computer can be calegor-
ized by the physical component modules (such as processing units. memory modules, communication medium.
cluslers. and control) or by the functionality of lhe features (such as program partitioning, dala decomposition,
process scheduling, memory utilization, communication minimizing, and synchronization). Each of these
modules can be further characterized by smaller modules. For instance, the physical configuration of the
61
memory hierarchy is composed of global memory modules, cluster memory modules. local memory modules,
and cache memory modules. Features of lJ1e modules can be refined by further decomposition. Different ways
of organizing the machine features have different advantages and tradeoff. Our knowledge representation
scheme allows the application to choose the desired way of organizing the knowledge according to !he require-
ments of the application. Figure 5.3 shows two different ways 10 represent the hierarchical structures of the
parallel computers based on different characterizing schemes. In figure 5.3(a) the organization is simpler but
features are not as detailed as the ODes shown in figure 5.3(b). For applications requiring minor optimization,
the ORe in figure 5.3(a) may be enough, hut for more complicaled optimization, a finer representation such as
the one in figure 5.3(b) would be beller. As a matter of fact, the representation can be refined by adding more
feature objects and relationship functions as the system implementation progresses. Thus, this refinement can
be done incremenla11y without affecting !he part of the software that has already been done, although the
behavior of !he system should be improved with more detailed machine knowledge. One good strategy in
adding new machine features is (0 check if this added fearure can distinguish between conflicting knowledge






ScqU<nl A.IlianI BBKbul1crlly Prir>glc
Ctrrtrt>1 model
memory orgonizalion
(0) R.lotion<hips of rr.atun:s bosc;I on cllllsilication.
'''''''''"L
elasslllr::allon
Figure 5.3. Two highly simplified examp/esjor characterizing parallel architeclures.
5.4.6.3. Various Abstraction Leyels of Features
A key idea in automatic generalion of high performance parallel programs is to express the knowledge of
the machine and the program at an appropriate level of abstraction. Abstracted features of the machines range
from high level concepts such as "shared-memory:MIMD" or "distributed memory MIMD" or "lopoiogy of
the interconnection network" to detailed properties such as "vector startup time" or "memory reference
costs." The most appropriate abstraction level for the specification of Ute machine knowledge depends on the
current state and goals of the compiler and the types of applications that utilize the parallelism of the machine.
62
Different types of applications will require different levels of absttaction to express the computation model of
the application. In some cases, it is determined by the extent of optimization that the compiler is seeking. For
example, 10 decide the patterns of data movement, only tJie topology of the communication network is
required. But to get Ute optimal data movement, other infonnation such as the dimension of the hypercube, the
slartup and unit costs for message transfers, network protocols, and optimal communication/computation ratio
for the architecture are needed. Allowing different abstraction levels simplifies the implementation of parallel
compilers, since different tasks of the compilers can deal with computational models at different abstraction
levels that are suitable for the tasks. On the other hand, it complicates Ute implementation of the knowledge
representation. Our program representation scheme allows dynamic abstraction of the knowledge by encoding
the relationship between the knowledge at different abstraction levels. And the machine knowledge can be
mapped to the appropriate abstraction level at the run time of the compiler.
5.4.6.4. Global Visibility Versus Knowledge Hiding
Data abstraction hides the internal data representation of an object from the uses of the object. This con-
cept has been proven to be useful in conceptual abstraction of the objects and in lhe practice of modular pro·
gramming. On the other hand, there are many cases where the ability to access lhe states of the objects is
desirable. For example, the objects that define the relationships among objects are better visible globally. This
is particularly useful in keeping the representation scheme flexible enough so that accommodating new archi-
tectures is easy. Also, a unifonn structure for systematic access and support for representing heuristics also
requires individual features to be accessible globally. Our representation scheme provides some mechanisms
for knowledge to be encapsulated in objects but also globally visible.
5.5. Implementation of tbe Machine Knowledge Manipulation System
In this section, we provide some details of a prototype machine knowledge manipulation system that we
implemented to support the parallel compiler we are constructing (see chapter 8). The knowledge manipula-
tion system uses the object-oriented representation scheme we described above, and it supports interactive
feature specification, update, query, and reasoning. Il is designed for parallel compilers but may also be
applied to other software systems require detail hardware knowledge.
5.5.1. A Machine Knowledge Manipulation System
The machine knowledge representation system consists of a machine feature database, an inference
engine, an SQL relational database intelJlreter, and an interactive machine feature specification system.
The machine feature database contains three kinds of knowledge about machine features: knowledge of
feature definitions. knowledge of feature usages, and knowledge of features of parallel computers. The infer-
ence engine can be used to compare and deduce features to help the specification and classification of the
features. A subset of the SQL relational database language is implemented in Prolog to compare the features
of the machines and help lhe knowledge expert to abstract machine features from heuristics; this provides a
powerful mechanism for the manipulation of machine knowledge. The interactive machine feature
specification syslem provides the man-machine interface for interactive specification and manipulalion of the
features of parallel compuLers. The interactive machine feature specification system is interfaced to both the
SQL database server and the inference engine so that the user can query or analyze features of the machines.
The machine knowledge manipulation system is implemented in Prolog. The system is menu-driven and
the user is allowed to pick a fact or predicate known to lhe system from lhe menu or to specify a new fact
inleractively. Details of the procedures for machine feature instaUation are given in the next seclion.
To effectively utilize parallel computers, a program optimization system needs to have enough
knowledge in two areas: the hardware features of lhe machine and the heuristics of using the machine. The
machine knowledge manipulation system can assist the parallel compiler writer in manipulating lhe parallel
machine knowledge by providing the following functionalities: specifying of new machine features to lhe sys-
tem, specifying of machine features for a new machine, finding relations of a feature with other features, com-
paring features of different machines, and supporting the reasoning for intelligent compilers. The last three
abilities are especially important when analyzing the machine features to construct the system or collecting
new heuristics to enhance the capability of lhe system.
The process of installing new knowledge includes identifying, translating and representing new machine
fealures and heuristics. Human interaction is needed for this process, but systematic assistance from the
63








I Inference engine i'""'"
~
Interactive machine[eature specifieQlion system
User interface
Figure 5.4. The machine know/edge manipulation system.
system can reduce the complexity of the task significantly.
5.5.2. Machine Feature Abstraction and Installation
The process of collecting machine features is illuslrated in figure 5.5. The figure shows four components
in the process: installing features for the machine, installing heuristics for the machine, inslalling new machine
features to lite system, and installing new heuristics to the system. These procedures are intcrrelaled and can
be carried out interactively.
5.5.2.1. Interactive Machine Feature Specification
New machines can be added to the system interactively with a user interface that allows the user to
specify the features one by one. The system composes queries to ask for lhe features of the new machine
based on its knowledge in its database and the machine features specified so far. A top-down approach based
on the hierarchical sbUcture of lhe known machine features is adopted and the query session begins with high
level specification of the machine, such as computational modes. dislributed or shared memory, and gradually
gels to the delails of the machine features. The user docs not need to know the structure of the machine
features. Based on Ute pre-conditions and the organization of the features, the system is intelligent enough to
ask only for Ute related features. There are also commands to allow the user to input features that the system
did not ask for or does not know at all.
This interactive machine feature specification contains three kinds of activities:
1. The syslem asks Ute system programmer values of the related features of the machine.
2. The programmer specifies Ute features that Ute machine does nor know about.
3. If the feature specified by the programmer is new to the system, Ute user also needs to specify rela-
tionships between the feature and other features, possibly with help from the machine knowledge
manipulation system. The system will also help the user in specifying the value of the new fealure
for machines already in the database of the system but whose corresponding feature values have
not been specified yet.
To specify a feature that the syslem already knows (has feature object definition for the feature), Ute sys·
tern prompts the user for the value of the feature with a menu dynamically consbUcled at run time. The user
can also input the values of the features through the keyboard.
64
I Install a new machine II I
I j
insWJ. r...1nr= Car M inslall heuristiCll for M




I imll.ll fealulCvalocfcrM ...
rI install new hcurisliCi
I
-----i installncw{ClIbI",LDthe'JlLan t-- "'" thcl!ellmli
j .-1 ~,..
IIIdchln~1 I~rogu", r-MUle Il:mpLotc: fOTthc {""tun: r".nr.... tQ"~ur ...
j




encode program fc:alulC .bstrac!iDII I
ocIc other mochin... for this Cealun:
I
~ply n..... heuristics -,..
Figmc 5.5. Machinejealure specification and installation.
A new fealure is added to Ute system by defining the lemplalc of the new feature (by defining lhe feature
object) first. This process is illustraled by the example as shown in figure 5.7.
After lhe new feature object is defined, the next step is to find relationships belween the new objccl and
other features. This process is nonnally non·trivial and the reasoning ability of the system may provide the
user with help. After the user specifies some basic relations, the system lries to help by finding all features
that are related to these features and providing this information to the user.
Finally, after the new feature is installed, an attempt is made to relate this new feature to the parallel
computers already known to the system. In other words. the system will try to enhance its data base by figur-
ing out whether other machines in the database have this feature. and if so what the values of the feafw"es are
for these machines. The user will have to decide whether the new feature can be applied to particular target
machines in the knowledge base of the system. but the syslem will be able to rule out machines lhat cannot
have the feature based on lhe relationship specified by the user.
After the machine fealures are installed, the heuristics of using the machine can be inslalled. One prob-
lem is to figure out what machine features are involved in a given heuristic. The machine feature comparison
and deduction provided by the machine feature manipulation syslem is very useful. To specify a heuristic. lhe
user uses menus to specify the preconditions and the actions of the rules. The menu can lead users lhrough lhe
hierarchy of machine features from the lop down and the knowledge base keeps a list of abstractions of Ihe
program fealures so lhat the user can relale machine fealures and the program features 10 the heuristics. The
struclure of the hierarchy and the computational models helps the user in analyzing the relationship between
lhe features and the heuristics. After the related features are picked. the syslem uses information in the feature
objects to generate Ihe preconditions and actions for the heuristics and translates them into rules. If Ihe heuris-
lic involves fealures lhat are not in the knowledge base. then the new features need to be installed.
the n.....
65
Restrt..o.:lt.ur-e the pNlV- for the current tar!:.t """*'1.... ( -...be- )? Cy/n): n
P.z- :L
Pl..._ choowo ON' of' tJw abow lilOiols:
Cenci with <rfl.urn» 6
h~ "achirw naM? '0I"ay)lHP-4'.
Know}"!:_ .bout. the """Chine~ is curT'l!'ntlll in '"I:l d.~.
Is it. OK for _ to~ it rd .-.pbOll' it with knowl9dp of cr.... 1 Cain): 1:1
Th_ ..... no feature Yalues de.flned for MOhine 'or~-4"
Do !:IDU llant to OMIab it.? Cain): !l
~ I asok !Pol the yal... of the r ..w.-. that ""'" :MIllet,
do !jOY nHCI i'lq'IM'llition for t.ho foat.tr..? Cy/n>: !l
PIi'a59 5peol~ the value of the f'eoat.ur-l! nmb~_oF_olu5te~:
~_of_olll5ter5 15













Pl",_ ohoose ""ffl' oF t:hv abow lab9ls:
Cenci with <rfl.Irn» 4
PI_ speoolf\l the ......1.... of thlt f'ellt:<r. cll,ck_cycl.._tlllle:
olock..C\1CIII.ti_ b
.....b"r oF eyel" per ii'cond (:In MHz>.
Please I:ly" ~ .Cn) ~l_~ber: 111.5.
PINI!oe' speoif\j the v.. l .... of the F...t:ur.. ~ak...llps:
Figure 5.6. An interactive session to specify features ofa target machine.
Pale :I.
''''.
As we noled above. gelting a complete description of the machine features is difficult but is usually
unnecessary. Using a partial list of the machine features to represent the machine adds two requirements in
manipulation of the machine knowledge:
1. Knowing what features are relevant to the syslem and
2. Knowing the relationship of the new knowledge with the existing knowledge of lhe system.
One simple melhodology 10 decide what machine features to represent is based on lhe heuristics of the
system. When installing lhe heuristics. users represent lhe heuristics and melhodologies of utilizing Ihe
machine parallelism as a function of the machine features and program properties. After lhe user finishes the
66
[SJ 113:111ne f n~lo)lE'd:'" 11anl::,_'la-:l':,-' '.:olJs-:",n ED- -
Bri'ore- I J'ta-t. I ld 11k. to laIow u... det;ree Of Inbract.lon l:jOOJ
......,t, this~!l. to .
Do !fOU nNd II'1olf'1aNt101'l for t~ proc:",,? (yin); n
Il.ant. • ~ lI'''Planatim In front of questl_ that 1 ask l;jOU1 (!:Y'n)~ n
Should I ask~ conri.....t.lon .ncr .........","", Is 'If'Klflfil 1 (yin); n










Pl~. Sf'eOl~ tM Wf'1'f!j oondltlon to ot-.oI< that tho ful:lre ...alue is ocr'P'lI'ot.
PI..". list ......1ablOO5 In frent. of' tM ooncIit.lOtlI: no•
..... i .... u.t. or II'ttl"lbo.lt.. eboo.rt. the fNttrvI ....-oto,--f'if"'lir-s===---.. -_.._--..•_--- --_._-_ -
~ltlon1
hU.1Il!'Otor_oapabl11 ty
T~ of' u.. ",.IUI'%
[iist-of-lnhc.",'( I')
F..t .... -ro1...t1on:
IIlJCtor pipelines, "jr~t SiY1P5 the ........ of 10m .-xl stonl piPIII1nn., tot.n C
h.e ruob«' of floatlne !"OInt ,",lltlply .wid adcI plPi!lInK.
Y~lfy oondition for the ...alUll'I
~
Ia tot. .abowe IhUnl; oorrect? Cy,fnH y
Th!' forNt of th!' fea~ Is Initialized by the folloutn( predlo.atel
fN'tl.re_attrlbuteC...eotor.pipel1r.es. [h,as_wotor-_oapabl11 ~y. [11stnlf-lnteoeer•• [ ]")
,'Yectol" pil"!'llneos, First ;:1....." the .-....bo!r of load""" stere pipelines, then (I...
II nu-ber of floatine point .ultiply end add pipelines.',nol)
Do !:lOU OI..t to et.Ii .... "_at of ott-- f.ettr-? (!fin>; y
U- of the nell fe.a~ to dfl'lneE NoOhlnl!'_ .';.~-;';'~:'~'!!!5!~!!'.!!'i"!!
.... speai !:l U. "........,ond ticn UW ~ ; ...et'lI ...._C!:lO 11_ I" no.
lohat should I !.~ If theo unr ask "II' t 11· '_~1 wotor..... i ...Unes :
) v.ot.or pl...ll ..... , fJr-st e'ves tn. or load rod st.oro! plpoolines. thHI I:i\l
II' ....~ of floatl", !'OInt ...,ltl"l!:! .-d .., pi.,.ll..-s.'.
PI.... c~ on.. of "'- above 1..,.15: (W"d with <r.brn» 6
PI_.II' ."..,.
"'illt!j.
Figure 5.7. An interactive session to specify template of afeature.
specification of the heuristics, the syslem collcelS the list of machine features used and compares the list with
Ihe list of features that it already knows. In this way, new features can be discovered and installed systemati-
cally.
The procedures oullined above rely on lwo things: human interaction and the reasoning abilily of the
knowledge manipulation system to help the human expert sort out the complex relationship between the heuris-
tics and the machine features. Systematic knowledge manipulation can significanUy reduce the complexity of
Ihe machine knowledge abstraction and inslal1alion process. From our point of view. this is one of the basic
requirements for all retargetablc software syslems.
67
5.5.3. Feature Deduction and Comparison
Heuristics are knowledge without a theoretical background. In order to generalize the heuristic to other
parallel computers. heuristics need to be analyzed (0 determine the fundamental elements behind the heuristics.
In this way. a new target machine can utilize the heuristic if the machine possesses all tlJe features involved in
the heuristic.
The ability to analyze relationships between machine features is needed because not all machine features
are represented at the same level. Some features may be derived from other features and some may be
obscured by other features. Similarly, when trying 10 abstract the features of a machine or dislill effective
features from a new heuristic. it is often necessary to compare features of the machines. The machine
knowledge representation scheme we propose supports bolh operations plus other reasoning mechanisms. A
knowledge base that features a simplified SQL database language plus inference mechanism is implemented 10
support the task of analyzing machine features and heuristics.
For example. it is possible to collect all machines that have feature F or find all features that can be
derived by a set of features with the simple reasoning mechanism we described above.
Feature deduction is supported by the relation functions which are part of lhe Object-oriented rcpresenta-
tions. Feature comparison of different machines is available by the implementation of a relational database.
For example, we can find all common features or different sets of features of two machines with a relational
database command: "find all common features of A and B" or ".lind all features A subst B."
select rule from machine
where machine.memory.cache_si'ze > 0
and machine.processor.speciaf intSlructions.block tranifer =P- -
and nOllvar(P)
Figure 5.8. A sample query to retrieve all rules that the system know about block-transfer for machines with
caches.
5.5.4. Specializing System Knowledge
The advantage of having a general purpose machine knowledge manipulation system is that knowledge
can be accwnuIated and shared among different architectures. The price paid is that when reasoning is per-
formed. the performance suffers because of the added tests for checking the applicability of the heuristics at lhe
compile time. We use a melhodology callcd knowledge caching to improve the performance of the inference
systcm. The method works as follows: As lhe compiler is being constructed. maintained and cnhanced, lhe
system encodes lhe knowledge with the multiple target parndigm. By the time the compiler is ready to be dis-
tributed for a particular machinc, the vendor may choose to specialize the compiler for a particular larget
machine by validating or invalidating rules in the knowledge base. This is done by pre-evaluating the rules
based on the features of the target machine. AU conditions that are implied by thc static features are elim-
inated from the rules, and the resulting simplified rules are "cached" in the specialized compiler. Also, rules
that are disqualified by the static features are deleted from lhe knowledge base of the new compiler. This
approach has the advanlage of code re-use and knowledge sharing but does not suffer loss in system efficiency.
The melhod is called knowledge caching because it also provides a runtime mechanism to optimize
fealure access. The same procedure can be used at run time to further eliminate redundant conditions or
invalid rules based on the dynamic features. The most recently accessed or generated objects are kept in the
system memory so that subsequent access will be much cheaper. On the other hand, if other infonnation is
needed, they can be loaded when the system detects that it lacks the infonnalion (this is supported by an index-
ing mechanism supported by the underlying expert system shcll described chapter 7). The resulting parallel
compiler is specialized for the target machine with the features of the machine implicitly built into lhe rule
base of the system. The rules for the program transformation can be linked with the machine features by alln-
butes of the features. Therefore, after the features of the machine are specified, a personalized knowledge base
can also be built for the particular architecture and thus improve belh space utilization and execution
efficiency.
68
5.6. Other Applications of the Representation Scheme
The machine knowledge manipulation system we described above can be applied to many other soflware
systems that need details of lhe target machines. Two such examples are given here.
5.6.1. Disb'ibuted Computing Environments
A distributed computing environment is a system that contains a set of loosely connecled computers.
Although not every disbibuted system needs detailed machine information of the computers in the system,
some applications do require the syslcm to have low level knowledge of its members. An interesting example
of lhis is as follows: in a network that contains a wide spectmm of architectures (for example, a neLwork of
workslations. graphic workslations, main frames, and supercomputers), and applications. Suppose the goal of
the operating system is to assign a task: to a machine that is most appropriate (!he objective may be adaptive)
for the application; the system requires a clean understanding of the capabilities of each machine in the system
and some information about the applications to make a smart decision. Our machine knowledge manipulation
syslem allows the system to possess and manipulate knowledge of many computer systems and support the
match of machines and applications. Thus. it is possible to build a smart task: scheduler for dislributed, hetero-
geneous, computing environments based on the machine knowledge manipulation system.
5.6.2. Flexible Simulation Systems for ParaUel Architectures
A product design cycle consists of requirement, specification, design, prototyping. testing, and
modification. The design of new computer architectures usually goes through this cycle many times before the
final product emerges. At each iteration of the design cycle. requirement, specification, and design of the
machine may be changed because of technical difficulties. marketing considerations, and other Wlexpected
problems. Any changes can affect the subsequent phases and complicate the designing problem. Building a
hardware prototype is very time-consuming and expensive. In contrast, an eleclrOnic prototype can shorten lhe
design-testing cycles and decrease the product-developing lead time. The machine knowledge manipulation
syslem we discuss here provides a foundation for building software simulation systems for parallel computers.
When coupled with the performance evaluation system discussed in chapter 6, a very flexible general purpose
parallel an:hilecture simulation system can be buill Under this model, revising the an:hitecture design is rela-
lively easy since the machine knowledge manipulation system provides easy modification of the machine
feature entries. Furthermore, when integrated with the program transformation system. (he resulting system
has lwo significant advantages:
• The domain that the an:hilecture is targeting can be used (0 test and evaluate (he design before a hardware
prototyping syslem is buill Problems in design can be discovered at (he early phase of the developing
cycle.
• The experience accumulated from other parallel computers can be applied to (he new machine. Thus, a
great deal of heuristics for using the machine exist even before the machine is actually buill
5.7. Conclusion
In this chapter, we have presented a machine knowledge representation scheme for parallel computers
lhat supports reasoning. Intelligent reasoning is possible because decisions can be made from analysis of
machine features. The machine knowledge manipulation system forms the basis of a parallel programming
environment we are implementing which can restructure the program slructurc inlelligently.
The knowledge representation scheme we propose has the following significant fealures:
1. Knowledge sharing. Separating (he machine knowledge from (he system heuristics allows heuristics 10
be shared among different parallel computers.
2. Object-oriented representation. The object--orienled representation scheme allows modular and elegant
knowledge representation and ease of manipulation. With abstraction and classification operations, lhe
parallel machines can be abstracted into different levels of computational models.
3. Tolerance. Incomplete machine specification as well as incomplete syslem knowledge is allowed in this
representation scheme. When a feature is not present, the system simply assumes that the targeL machine
does not have it. Even though the performance may suffer, the more detailed knowledge about a
machine the system contains, the belter the approximation of the computational model is Lo the real
machine. However. the system works even with incomplete knowledge of the machine so Lhat
69
knowledge about a new machine can be incorporated incrementally.
4. Flexibility in the organization of Ute knowledge. The machine knowledge can be organized on different
criteria. For example. at the higher level, the machine knowledge can be classified on the levels of
parallelism (lhe multiprogramming level. multiprocess level, inter inslruction level, and micro instruction
level). At the lower level. the machine knowledge can be grouped on the physical organization of the
system (processing unils, memory hierarchy, communication network/bus. clusters of processors, and
control unit).
5. Test beds for complex syslems. The proposed representation scheme allows easy construction of test
beds for complex software or hardware systems. Features and relation functions can be added or
removed to test the consequence of the action. This property of the system can also be used to evaluale
or study the effectiveness of the features. relation functions or heuristics.
Although we are targeting the melltodologies to the parallel compilers, lite melhodologies can be applied
to any software systems lhat require delailed knowledge of the underlying architectures, such as parallel simu-
lation systems, disbibuted operating systems, and runtime environments.
The entire machine knowledge manipulation system and the SQL database interpreter are implemented
in Prolog. The feature objects are stored as facts in the internal database of the Prolog interpreter. The user
interface is still clumsy but is expected 10 improve greatly when the X Window interface is constructed.
70
CHAPTER 6
PERFORMANCE PREDICTION AS A BASIS
FOR INTELLIGENT PROGRAM OPTIMIZATION
6.1. Introduction
Optimizing parallel compilers use heuristics and program tmnsfonnation techniques to modify lhe pro-
gram slructure to improve the parallelism of the user programs. The ability to accurately estimate lhe perfor-
mance of a program on a target machine and assess the improvement in perfonnance or other aspects lhat a
transformation can have for the program under consideration is vital in the decision-making process of parallel
compilers. This process, call Ute performance prediction, involves evaluating lhe match between the charac-
teristics of a program and the constraints of an architecture and then using this match to estimate the perfor-
mance of the program.
The most important criterion for a perfonnance prediction module in an intelligent parallel compiler to
meet is lhat it needs 10 be inexpensive. accurate and flexible. It has to be inexpensive because the compiler
needs 10 use il repeatedly for evaluating lhe merits of tlJe applicable transformations during the program res-
trucruring process. n has to be accurate because proper decisions of the compiler rely on a good estimation of
the perfonnance. It has to be flexible because the program optimization process in an inlelligent parallel com-
piler uses different objectives at different stages of the process and for different architectures. A mechanism to
fine·lune the performance prediction model to suit different objectives is needed so that the system can adjusl
ilself 10 face different challenges in the decision-making.
In this chapler, we describe an analytical performance prediction model for estimating the performance
of programs on differenl c1asscs of parallel architectures wilh high accuracy and efficiency. We define a tem-
plate for implementing this analytical model so that it can be integrated into the decision-making process of a
parallel compiler. Many performance factors are considered and listed in section 6.4. One distinct feature of
our model is that it is extremely flexible: different performance prediction functions can be incorporated for
different PUlJlOses, and different amounts of resourccs can be utilized. The perfonnance prediction model util-
izes some heuristics to handle unslructured programs, loops with unknown bounds, conditional statements, and
can minimize run-time tcsts for these cases. This makes it an ideal candidate for intelligent parallel compilers.
6.2. Performance Prediction Models
The performance of a program on an architecture can be estimated with either the simulation or the
characterization model. The simulation model estimalcs the perfonnance of the program by simulating or
profiling the execution of the program or interpreting the execution-time on a computational model of the
architecture. The accuracy of the prediction depends on how faithful the computational model is in simulating
the actual hardware and whelher the performance of the program is sensitive 10 the inpuL of the program. The
characterization model characlerizes the perfonnance of the machine by certain aspects of lhe machine. The
accuracy of the characterization model depends on how representative the selected aspecls are. This model
utilizes a set of easy-to-compute functions called peifonnance factors. Each performance factor represents a
particular aspect of the perfonnance of programs on a target machine and is quantified by a corresponding
function called the performance evaluation function. Each perfonnance evaluation function is a function of the
program fragment, P, and the larget machine, Machine. The estimated perlonnance, EP, is a function Lhat
combines the results of the selected perfonnance evaluation functions E i•. . .




where Eil 1 ~;'5. mare m perfonnance evaluation functions. and gi is an adjuSbnent function for E i• When
leE) = Weight; *E for all i, then EP is a weighted linear combination of the evaluation functions E;. That
is: . ..
EP(? Machine) = ~ Weight' * E'(P. Machine).
j:=1
(6.2)
Variations of the evaluation function model that use a certain Cannula to quantify the performance of
programs have been applied to many olher parallel compilers. However, most of these applications use fixed
equations to compute the with little consideration to machine variance and objectives of the compiler. This has
Ute following problems:
1. Not suitable for a wide variety ofarchitectures. Most models do not take the architecture variations into
consideration. Parafrase [Husm86] uses several different parameters for computing program execution
time for seven different virtual shared-memory machines, but the number of parameters that are con-
sidered are too few to be accurate for real parallel architectures.
2. Inflexible for different needs at different stages of parallel compiling. A more critical part of lhe pro-
gram requires more accurate but expensive perfonnance estimations. Existing performance prediction
models are inflexible 10 accommodale different objectives of lhe program optimization.
3. Cannot deal with programs that contain variables in control flows. Run-time tests are essential for pro-
grams that have dynamic behaviors. The existing performance prediction model does not have provision
to decide the minimum run·time tests lhat are needed. This can lead to excessive run-lime overheads.
In lhe next seclion, we propose a highly flexible perfonnance prediction framework lhat can be
integrated into intelligent parallel compilers 10 solve the above problems. The framework, which is called
refined combined characteriSlic model, is augmented wilh a knowledge base which can dynamically selects
evaluation funclions 10 apply based on features of the machine and objectives of the program optimization.
Different weights can be given to the evaluation functions dynamically so lhat their effects can be adjusted.
Inexpensive perfonnance evaluation functions are combined into a single-valued real ftmction as an indication
of the performance of the program on a parallel architecture and to compare representations of semantically
equivalent programs.
6.3. A Framework ror Penormance Prediction
Based on the characterization model, a framework for predicting perfonnance of programs on different
parallel compilers can be defined. This framework consists of a rich set of perfonnancc evaluation functions,
and a control to choose lhe evaluation functions and decide how 10 combine the results.
The conleol of the performance estimation can be divided into four modules: the prediction-setup
module, the prediction-conslrUction module, the prediction-update module, and the prediction-refining module.
Each of these modules can be adjusted by rules in the knowledge base based on the objectives of lhe compiler.
A list of evaluation functions is discussed in the next section. We will examine lhe four modules first.
The prediction-setup module is invoked by other modules to prepare for prediction estimation. Its tasks
include selecting the set of performance evaluation functions to use, setting architectural-dependent constants
for lhe selected evaluation functions based on the machine features, setting weights for the selected evaluation
funclions based on objectives of the opLimization, and choosing the method for accumulating and combining
the results of the evaluation functions.
The prediction-conslruction module estimates the perfonnance of the program by applying the sci of
evaluation functions to each node in the program dependence graph according to a depth. first search order
based on the conleol dependence (with back edges being ignored). The process of perfonnance conslrUction
can be decomposed into the following sleps. First, it invokes the prediction-setup module to set up the con-
stants. Then, lhe evaluation functions are applied recursively on elements of compound statements and the
results are aggregated by an accwnulaLion procedure. Third, the prediction-combining procedure is applied to
inlegrate the results of the evaluation functions.
A simple example of the combining procedure is the weighted linear combination as described in equa-
tion (1) in section 6.4; more complicated combination functions can be supplied by the user. Also, depending
on the setup, lhe perfonnance results can be combined while they are computed at each node or Ihey can be
compUled individually and combined at the task level.
72
The perfonnance-construction process can be summarized in lhe following procedure:
P~ifomr.anuCOllJ~lion(Nodl!, PDG, LUIOJEvalFlUIl:tiONl, R~u'u)
Begin
if(Notle U Q compound !tQtt'mt'/If} llun
for tQch Child a/Node iIlPDG do












Figure 6.1. Performance construction procedure.
The prediction-update module is applied during the course of the program optimization process. The
evaluation functions involved in this process estimate the changes that a transformation might have on !he per-
formance of Ute program and adjust the prediction accordingly. The performance evaluation functions and the
method to combine the result used in this module do not need to be the same as those used in the
perfonnance-construction module. Only the region of the program that will be affected by the transfonnation
needs to be updated. Also, the overall performance prediction can be adjusted incrementally.
The prediction-refining module can be engaged optionally during the program optimization process to
refine the estimation by using a set of more delailed evaluation functions. It can either compute the estimation
from scratch using a set of more sophisticated evaluation functions or it can refine the original estimate by
updating certain aspects of the estimation.
6.3.1. Dynamic Performance Prediction and Run-time Tests
There are programs foc which the static analysis of the compiler may fail to predict the control flows of
the programs. These cases are nonnally programs that have variables oc indirect references in the control
structures (such as loop bounds or conditions), oc conditional statements foc which probabilities of the branches
cannot be estimated. These cases present challenges to the performance prediction.
If the unknown variable involved in the estimation is a loop index, the problem can be easily solved.
For example, for the following nested loops, the loop bound of the inner loop depends on the outer loop, but
can be easily computed.
fori in 1.. N loop
for j in ;+1 .. k * i loop
endfor;
end for;




:E(k-l)'i' fO = N' (~+l) 'k'fO.
;;1
evaluation function for loop i is
For programs where unknown variables appear in the loop bounds or conditions in the conditional state-
ments, the performance prediction model we described here can be applied by utilizing the inference capability
of the system. For each performance evaluation function there is a list of parameters that are used to compute
the evaluation. The inference engine evaluates the evaluation function by unifying the variables with their
73
values first If variables remain undefined after lhe unification stage (for example, if there are variables in loop
bounds), Ute evaluation function will be evalualed as a function of the undefined variables by a common com-
piler optimization technique called constamfolding [AhSeUI86].
Operations (such as adds. multiplies. comparisons) can be performed on resulls of evaluation functions
(including those willi non-instantiated variables) to merge different evaluations or use them to compute oUler
evaluation functions.
For variables that do not depend on the input, tile values of the variables or the bounds of the variables
can be estimated by an AI technique called constraint propagation. TItis can normally solve lhe problem or
limit the range of the performance estimation.
For variables that depend on the input data., there is not much a compiler can do to estimate the values of
the variables. It can then either query the user for possible values or generate some conditions on the perfor-
mance based on the uninstantialed variables in the perfonnance function. These conditions can be used to gen-
eratc run-time tests whcn the control decisions need to be decided at the run-time.
6.4. Performance E'valuation Functions
In this section. we first discuss how perfonnance evaluations can be defined and combined, then give a
survey of some useful perfonnance prediction functions.
A perfonnance evaluation function is a quantification of the quality of an aspect of the match between
the program and the machine. Typically, the definition of an evaluation function has two parts: evaluation of a
non-compound statement and the accumulation of the results of applying the function to components of a com-
pound statement For example. the equation (6.3) defined below is used by several of the evaluation functions
defined in section 6.4.1 to aggregate results of applying an evaluation function recursively on children of a









p (ST)'f (ST) + p(SF)'f (SF)
f (Sib)
F(S)
ifS={S, I S, I··· IS"}
ifS = for(i = I .. u)S(i) endfor
ifS = 'v' (i = 1 .. u) SCi) endfor
ifS = if (cond) ST else SF endif
ifS is a function and Sib is its body
otherwise (*)
(6.3)
where p (S) stands for the probability that the condition statement will branch to statement S and S(i) stands
for the i -th instance of the loop statement S. Also, IS I I S2 I ... I S" J denotes that the statements Sj are
to be executed concurrently.
Evaluation functions can be defined by changing the definition of F(S) in the line marked with (*) in
equation (6.1) into some specific functions.
6.4.1. Examples of Performance Faclors and Their Evaluation Functions
The evaluation functions defined by the perfonnance factors map the analytical match of a program and
a machinc into numerical values. Below we list some sample perfonnance factors and their corresponding
evaluation functions. Each of these perfonnance factors represents a particular aspect of the perfonnanee of
the program. In the next section. we will discuss methodologies for integrating these factors into a framework
for predicting program performance on target machines.
• Slaremellt count:
The statement count characterizes the algorilhm by counting the number of statements in each of the
control threads. The execution time of the program may be estimated by multiplying by a constant called the
74
average_S1Qtement_coSl with the statement count. The statement count can be computed by replacing F(S) in
equation (1) by
F (S) = 1 ifS is not a compound statement
• Operation count:
A more accurate estimation, called operation counts, can be computed by counting the number of opera-
tions instead of just the stalements in the control threads. An evaluation function for computing the operation
count f 0 for a program fragment S can be oblained by replacing the line marked with ("') in equation (1) with
the following definitions:
ifS = operand op expr
ifS is a vector operation of length n
otherwise
where o..V and Ware constants representing lhe vector startup time and unit time using the cost of a Hoaling
point operation as units.
The estimated program execution time can be obtained by multiplying the operation count with a con·
stant called average_operaTion_cost. For RISe processors, this constant can be computed easily at the assem-
bly instruction level since most manufacturers have studied and made assumptions about the distribution of
operations in the target applications. On Ute oUter hand, for else processors or for operations at the language
level, lhe variation in Ute execution time of operations can be quite large, as is Ute discrepancy beLween the
estimated operation cost and Ute actual cosl. An improvement to this approach is La classify lhe operations
into groups of different costs and usc different average-Lime estimations for each operation group. For exam-
ple, integer operations often lake less time than floating point operations, additions usually take less timc lhan
divisions, and array references usually lake several integer operations to compute their addresses. A even more
deLailed estimation can be done by using lhe aclual cost of each of the operations. For example, lhe machine
knowledge base may record the costs of floating-point multiply, integer addition, array address calculation,
Slartup and element time for vector add operations, and startup time and element time for triadic operations,
etc. Note that the costs of some operations (especially floating point operations) on certain machines lake vari-
able time depending on the operands, so estimations of the average costs of these operations are still needed.
• Number ofmemory loads and the number ofmemory stores.
The number of memory loads and slores are two special types of operation counts that we would discuss
separately. The unit cost for memory load and store operations (C/tlod and c""re) are usually constant. So lhe
cost of memory access may be estimated by eounLing the number of memory load and store operations. For
machines with a multiple level of memory hierarchy, the number of global memory loads, global memory
sLores, clusler memory loads, cluster memory stores, local memory loads, and local memory slores should be
counled separaLely since Lheir costs are usually different
For machines that support block-accesses, the memory load and store costs for referencing a data block
of size n, T/oodbiM/: and r'0reO/<><t, can be computed by:
/oDd' /oDd' IoDd' *T block(n) = C SIan + C 'U1ir n
TSlo,..,' (n) - CSlO,..,' + CSlO,..,' . • nblock - Mn """
where the superscript t in above formula is either g, C, or I, which sLands for global, cluster, and local
. I Al CloDd' C·'" C~~' d c"",re' C ih .memory, respective y. so, Slam ..".-" .rl4/f' an , lUIir are constants lor e Starting cost
for block-load, unit cost for block-load, starling cost for block-slore, and unit cost for each block-slore, respec-
lively.
• Cache hit ralio.
The cache hit ratio of an array is the ratio of Ute references lo the elements of the array that are already
in lhe cache. It represents the degree of the locality of the array in Ute program and can be used not only in
deciding cache allocation problems for architectures that have cache, but also in uLilizing block-access opera-
tions in machines that support them. Following [GaJaGa87], we define the image in Ute computation of an
array x of dimension d, lm(x) as:
75
lm(r) = {weZd I x[w] is referenced in the compuJQtion}
where Zd is the bounded integer inlervaI space of the indices of x. Assuming that R is the total number of
references to array x in the program. the cache hit ratio is defined as [GaJaGa87l;
hil(x) = R-llm(x)1
R
For example, in lhe following program fragment
for i:= 1 .. Nloop
for j:= 1 .. Mloop
fork:= 1 .. L loop




the number R. which is the lotal number of references 10 array c. is equal 10 N'" M • L. Furthennore,
[m(e) = {(Ie,}): 1 s. j 50 M. 1 5, k $. L) and 11m (e) I =M" L. SO the cache hit ratio for the array c in the pro-
gram fragment is:
.l, N*M""L-M""L N-l
hit (e)= N"'M*L =~
On the other hand, if we focus on the loop j, then R =M '" L and !m(e) is the same, so the cache hit ratio of
array c in loop j becomes O.
• Data transmitting cost.
The time it takes to transmit a set of data of size n over lh.e network or bus can be computed by the fol-
lowing Cannula:
Tdtbl«/:(n) = a+ P* n
where a and Pare the constants for start-up cost (for hand-shaking, setting up routing connections, ele.) and
unit-transmitting cost For distributed computers such as the hypercube where data may need to be transmitted
by several hops through several processors, these two constants a and Pcan be computed by the following for-
mula:
o.(hops) = 0.0 + 0.1 ,.. hops
P(hops) = po + pi * hops
where hops is the number of hops along the path. More precisely, the constant 0.0 for a distributed machine
actually includes the time for a zero-length send and a zero length receive; the constant po is the lime for the
sending task to access the data and the time for the receiving task to store the data As an example, these con-
slants for NCUBE/l and nCUBE 2 are shown in the following table (table 6.1).
Table 6.1. Constants for computing message transmitting cost on nCUBE.
Time in microseconds)
MACHINE a' a' B' B' mach. cvcle fl. add
NCUBEIl 261 193 4.25 45 0.137 2.47-3.84
nCUBE2 1585 13 2.49 0.0 0.05 0.35
Therefore, sending a one word message to a neighbor takes 461.75 microseconds on NCUBE/l and 162.3
microseconds on nCUBE 2. And sending a lOoo-word message across a 128 node cube takes 37362.0
microseconds on NCUBE/l but 2657.6 microseconds on nCUBE 2.
• Voided dara-prefetch ratio.
For architectures that support data-prefetch, most data accesses can be overlapped with the computation
if they are not void by conditional or unconditional branch statements. Therefore. the cost of data
76
synchronization is a function of the ratio of prefetched data that are voided by the control sla1ements. A sim-
ple heuristic to estimate this ratio is to count the number of statements. N stmf• in the statement block following
a branch, and assume that all but one of them are benefited by the data prefetching. This gives us a ratio which
is _1_. And the dala synchronization cost can be estimated as:
N~'
e"'(s) ~ e"'(S) • p ""(S) ~ e"'(S)' _1_
N~'
As the formula shows, the longer a non-interrupted sla1ement block is. the more dala access can be overlapped
with the computation.
• Ratio ofdata access overhead.
This function characterizes the ratio of lhe non-overlapped cost of data access over the total execution
lime. The difference between lhis ratio and the voided data-prefelch ratio is that the first ratio is the data
access overhead over the total execution time of Ute statement, whereas the second ralie is over the tolaI data
access time of Ute data in lhe s!atcmcnl. This means that this factor is used in case the computation of data
access time is to be avoided. Of course, the non-overlappcd cost of data access depends on the features of the
architeclure (prefetching mechanism. data cache. etc.) and the program (density of the local/external memory
references, distribution of the exlernal references. etc.) and is hard (0 estimate without computing the dala
access cost This evaluation function assumes that the cost of data access is proJXlrtionai to the cost of compu·
tation which may not always be true. We can estimate this cosl by analyzing a set of representative program
structures and finding the ratio of data access that can be overlapped with the computation, then using these
patterns as the basis for interpolating non-overIapping data access cost for general programs. With this ratio,
the data synchronization cost, Cds over a program region S. becomes:
e"'(S) ~ C(S)' p"'(S).
where C(S) is the cost for computing in S. and pds(S) is the ratio of the data access overhead in S.
• Do-across de/rIJ.
A gocx:l measure for data synchronization cost is the do-across delays. Do-across delay is the artificial
idle-time in the shared-memory environment that the compiler inserts into a do-across loop to minimize the
load (the hot-spot effect) that a busy-waiting processor induced on the memory. bus, or network in checking
the global synchronization data. This delay represents the expected data synchronization cost of the loop
instance and is the minimum delay that satisfies all control (condition and unconditional branching statemenls)
and data dependencies in the parallel loops. The do-across delay of a parallel loop is computed by finding the
maximum static lime interval between the pairs of statemenls involved in lexically backward data dependencies
in the loop and then spreading the delays into loop instances [Cytron84]. Given the do-across delay dL; for
loop L i , assuming Lhat !he loop has N loop instances, the estimated execution time for distributing the do·
across loop over p processors is given in the following formula:
{
(N-l)'d~ +T~
T~ l(N;1J +IJ 'T~+mod(N-l.p)'d~ orherw;se
• Do-across parallelism degree.
Do-across parallelism degree [Cytron84] is the reciprocal of Ute ratio of Ute do-across delay over the
aClual execution time of !he loop body. Assume lhat T1., is the execution time of the body of loop L; and dl..; is
Ihe do-across delay for loop L j , Uten Ute do-across parallelism degree of the loop can be computed as:
d~
Para tkxJcrvss = 1 - -
TL,
It follows that given the parallelism percentage, the do-across delay can be computed by Ute following
formula:
77
Doacross parallelism degree is an indication of the parallelism presented in the do-across loop.
• Number ofdata synchronization points.
Synchronization adds overhead and may cause processors to idle but is needed to enforce data dependen-
cies that cross the task boundaries (to be more precise, for tllose dependencies that cross processor boundaries).
To estimate lite effect of synchronization on the proposed task scheduling, the number of cross processor data
dependencies can be used This number can be collecled by simply counting the cross-task data dependencies
and eliminating those among the tasks that are assigned 10 the same processor.
• Uniformness of the execution time.
Let (Pj I i = 1 .. n } be a set of n tasks. T(P j ) the eslimated execution time of tlJe tasks, Tmn be the
maximum of the estimated execution time, and T mMJI be the mean of the estimated execution time. The uni-
formness of the execution time is defined to be:
For tasks that have more uniform execution time. load balance is easier to achieve and static scheduling can be
used. On lhe oilier hand, for tasks whose execution times vary, dynamic scheduling may be belter.
• Hot-spot percentage.
A hot-spot is a module in the multi-stage blocking network that has sufficient concentration of the t.raffic.
The non-uniform network traffic of hot-spots can produce effects (called tree saluIation) that severely degrade
all network traffic 1PfN085l. Allhough message combining is an effective technique for solving this problem
when the sources of hot-spots are global shared locks, the hardware needed for supporting the combining is
quite expensive (Pfisler and Nor1on [PfN08Sl estimated that the combined network increases the size and cost
of the swilch in a factor of 6 to 32).
Let the number of processors be p, and the number of network packets emitted per processor per swilch
cycle be r (0:5 r :5 I), the hot-spot percentage, p [PfN08Sl, is defined to be the fraction of the data references
directed at the hot-spot (i.e. each processor emits packets directed to the hot-spot at a total rate of r*h). Then
the cffcctive data ratc inlo the hot module is r(l- h) + rhp. And the asymplOtic limit of the total communica-
tion bandwidlh available is:
B = P
I+h(p I)
This function gives a limit on the available speedup for a given number of processors. The function was origi-
nally derived in [PfN085l for shared memory-modules on multi-stage networks, but can be generalized to more
general dislributed systems. This limit imposed by the hot-spot degradation is very significant. For example,
for a lOoo-processor system, a hot-spot percentage of 0.125% can limit the potential speedup to 50%. The
hot-spot percentage can be estimaled by analyzing palterns of the data dependence graph. The hot·spot percen-
tage can be used to generate network traffic in simulations, but is very difficult to compule for real programs
except in synchronized parallel loops.
6.5. Parallel Execution Time of the Program
The execution time of a program on a multiprocessor system can be broken down into four categories:
computation cost, data synchronization cost, control synchronization cost, and control overheads. The compu-
tation cost is the time the processor spends in doing the computation. The data synchronization cost is the
length of Lime that a processor is forced to idle while wailing for needed data to arrive. Two sources of the
dam synchronization are local data accesses that cannot be overlapped with the computation and the synchroni.
zation introduced by the data dependence between a pair of statements that are executed by different proces-
sors. The cost of a data reference can be broken down into the memtJry felCh cost, which is the Lime taken for
the request to reach the memory. and the data transition cost. which is the cost of transmitting the dam
through the nctwork or bus. The control synchronization is the time it takes to synchronize the control
between two or more processors (semaphores, barrier synchronization, conditional or Wlconditional branch
statements, etc.). The control synchronization is usually introduced by explicit control sbUcture8, while the
data synchronization is normally defined implicitly by the data dependence relations. The control overhead is
the overhead for concurrent execution; lhis includes factors such as vector start-up time, process start-up lime,
78
and dynamic task scheduling time, ele.




where T C is lhe computation cost, F is the data synchronization cost (which is the difference of the data
access cost (rt') and lhe data overlapping time (TdD», TCS! is the control synchronization cost, and Teo is lhe
conleol overhead. This is only onc of the many different estimations lhat can be defined.
6.5.1. Example of Estimating Program Execution Time with the Model
The execution time of the program can be estimated to different degrees of accuracy with different gelS
of evaluation functions. The selection of the evaluation functions and the method for integrating the functions
are done by a set of rules and depend on how much computing resources can be allocated to the estimation.
In Utis section, we use a simple example to illuslcale the performance prediction process under our
framework. We chose La use the nCUBE 2 to demonstrate how the estimation is established because the
nCUBE 2 has some interesting characteristics that present a challenge to the performance prediction system.
For example, integer addition and subtraction takes one machine cycle, but an operand decoding and fetching
takes up to 3 cycles; array address calculation is expensive (the code generaled by the nCUBE 2 compiler takes
about 11 cycles for one dimensional arrays and about 31 cycles for 2 dimensional arrays); integer mulliply (7
cycles) is slower than lIoating point multiply (6 cycles); integer division (38 cycles) is much slower than float-
ing point division (7-18 cycles); communication between processors is very expensive (156 cycles slart-up lime
per message for one hop), etc. These factors imply that data prefeteh has dominant effects on the cost of the
computation; simply classifying operations into floating point and integer operations may not be fine enough,
and interprocessor communication needs to be overlapped with computation as much as possible. All lhese
factors plus the fact that the compiler on the nCUBE 2 does not generate optimal code complicate the estima-
tion of the performance on the nCUBE 2. Nevertheless, lhis presents a good opportunity for examining our
performance prediction model.
The program that we chose is the mamx-vcclor mulliply program as shown in figure 6.2. We assume
thai Ihe program has been previously decomposed to be execuled on P processors and that arrays a and y have
been distributed by blocks across the processors, a local copy of x is available on all processors, and the result
y is to be collected at processor O. This example is chosen because it is simple but presents some communica-
tion imbalance for distribuled-memory computers because processor 0 is a hot-spot. The purpose of the exam-
ple is not to show how accurale the estimation can be since this varies with programs, but 10 show the process
for constructing and refining estimations under the control of a rule base.
During the performance-setup process, the following performance factors are selccled based on the archi-
lectura! features of nCUBE 2: counts for floating point operations, integer operations, memory loads, and
memory stores, array references, data synchronization. data transmiUing cost, and control synchronization cost
All of Ihe above factors except the data transmitting cost are in the list of default evaluation functions. The
data transmilting cost over the nelwork is needed because the nCUBE 2 is a distributed machine and lhe bal-
ance between the communication and computation is very important. The results of the evaluation functions
will be combined at Ihe task level. The performance of the program based on this setup is then:
Tcomp"'aJ;on = (NIOP '" Clop +N'op '" C;op +Nor '" cor
+N'OM. '" ClooJ '" pds +NWJr~ '" CWJ'~ +N Io '" C lo ) '" m '" In;
where Nlop, NWP, N°', N'oM., N~n , and N lo are the counts for floating point operations, integer operations,
array references, memory loads, memory slores, and loop overheads, respectively, in the loop inslance (k,/).
Also Clop, Ciop , CO" C lood , CSftln , and C/o are the unit costs for a lIoating point operation, integer operation,
array reference memory load, memory slares, and loop overhead, respectively.
The constants that are used by the evaluation functions are provided by the machine knowledge manipu-
lation mechanism and are lisled in the following table:
A simple tree traversal reveals that there are 2 floating point operations, 3 array address calculations, 9
load operations and 1 store operation in the loop inslance O,j). Since the loop bounds of loops j and j are
unknown, the evaluation functions are computed from the variables In and m which are Ihe loop bounds of
79
procedure matrix_vector_muItiply(ln, m, a, x, y) returns (mbuf):
begin
[orall pid in 1 .. P do --- In := rnlPl.
local a : array {Lin, l..m} of real:
x : array {l ..m} of real:
y : array {l ..ln} ofreal;
for i := 1 .. In loop _. the simple part
forj:=l .. mloop
y{i] := y[i} + a[iJ] • xUl.-
end[or
end for
if (pid == 0) then .- the harder parr
received = 0;




else send vector y to processor 0:




















Figure 6.2. Sample program and the array distriburionfor the arrays in the matrix vector multiply example.
Table 6.2. Constants for evaluationfUllctionsfor nCUBE 2.
Evaluation constants for nCUBE 2 (in microseconds)
performance factor time performance factor time
floating point op, Clop 033 inleger op Clop 035
local memory load, C 1d 0.11 local memory store Cis 0.1
message startup cost, 0.0 158.45 message startup (per hop), 0.1 1293
data trans.(per word), ~o 2.49 data trans.(per word per hop), ~I 0.0
1000 overhead, C 10 0.6 obytes read 80
loop; and j, respeclively. There are totally 2·m·ln floating-point operations. 3.m*ln array address calcula-
lions, 9·m*ln load operations, and m·ln store operations for loop;. Also, lhe loop overhead for each processor
is m*ln times the unit cost for each loop instance. Since m and In are both unknown to tile procedure at com-
pile time, the resulting estimation is a function of m and In. The execution time of the loop i in each processor
is then estimaled by lhe performance construction model to be the following expression:
80
yeomp,,/QIiotl = (2'" Clop + 0 '" C iDp + 3 '" eM + 9 '" C lood '" pM + (6.6)
1* c=~+ l*C Io ) '" m '" In
where Clop. C iop • Car. C /cad I C ncn I and C", are the unit coolS for a lIeating point operation, integer operation,
array reference memory load, memory slores, and loop overhead, respectively. And pM is the percentage of
data that cannot be overlapped with computation.
Fo' p'ocesso' P, In is n - (P-1) " r; 1, and 10' all othedoop instances, In is r; 1· A rule in the
knowledge base suggests that for loops with only one statement in the innermost loop, data preCetching has
very little effect unless lhe expression in the statement is very long. Since the statement in the inner-most
loop, j, of figure 6.2 is relatively simple, the data synchronization overhead is set 10 be 100% of the memory
load cosl Suppose N is 6400, M is 100, and P is 64, then In = 100. By substituting the constants in table 6.2
into the formula, the estimated computation time for the loop k is 4600 microseconds. As a comparison, the
measured computation time for loop i is 4823.6 microseconds.
The only cross-task dala dependence in the program is caused by the communication slatemenls
corresponding to the collection of the array y, so the data-synchronization is computed with these s!atemcnls.
The control synchronization is hard to predict on the nCUBE 2 because there is a non-trivial cost in loading
the node programs so most programs do not synchronizc the program unless it is necessary. This means that
tasks on the nodes start at different times depending on how the programs are loaded (broadcast or not), Ute
size of the node program, and the position of the node in the hypercube. The control synchronization is
needed only whcn values of yare needed later (which is out of the scope of the program that we discuss here).
Lacking the control synchronization at the beginning of Ute program makes the estimation of the data syn-
chronization cost difficult since tasks are not slar1ed at the same lime.
To estimate Ute cost of sending messages, we need to estimate the cost for read/write and data transmis-
sion. Since message sending on nCUBE 2 is non-blocking, on !.aSks that send the data only the time to copy
the data into system buffer needs to be counled. On the other hand, on the receiving side, the data synchroni-
zation cost includes lhe time for the sender to send out the data, the transmitting time, and the time for the
reader to receive and store the data This time is a function of thc message lenglh and the distance between the
processors, and can be compuled by the equation (6.7) below.
Since all the data is sent to processor 0, the processor 0 becomes a hol spot and the computation time of
processor 0 dominates thc computation lime of the program. For processor i, Ute distance to processor 0 is the
number of l's in ils binary representation of the processor identified. represented as dist°(i). Thercfore, thc
cost of transmitting the veclor y computed in processor i to processor 0 is:
Tdl(i) = nO + n l '" dist°(i) + In '" (~o + ~1 '" disro(i» (6.7)
=158.45 + l.293 " dis'"(i) + 2.492" r~1
Although the receive statemenls in processor 0 are executed sequentially, the data synchronization cost is
not the sum of costs of transmitting vector y from all other processors because the data is transmitted simul-
taneously. Ifall processors are synchronized, then the data synchronization cost would be:
T" = max { min { T"(i): i e [l..P!} + ~T'(i) ,max{ T"(i)+T'(i): i e [l..P]} }
where Tr(i) is the cost of moving the message from the system buffcr into the data structure of the user pro-
grum.
Fo' oW" sample progrum. min (Td'(i): i e fL.P]} = 159.743 + 2.492" r~1'max (Td'(i): i e [l..P]}
= 158.45 + dim" 1.293 + 2.492 " r~1. and T'(i) = 80+0.11 • r~1· We also add a teno 6Q"P to conpen-
sate the hot spot effect.
81
By substituting different values of II, m, and P, we obtain Ute estimations as shown in table 6.3. The
measured execution limes for these parameters are also shown for comparison.
Table 6.3. The predicted and measured performance of the matra-vector mnltiply program.
Predicted plrlonnance Measured plrl'ormarll:e "o.
0 m • ComputaUob CommunlatJOD To'" Computallon Communlca1loD TotalS
400 200 2 122800.00 981.74 12378/.74 121632.00 729.00 124361.00 0.<1
'00 200 , 61400,00 lOIl.74 62412J4 61806.00 776.00 62582.00 02>
'00 200 8 ]UlOO.QO 1448.24 3214824 ]0909,00 1249.00 32J58,OO O.OJ
'00 200 " 15350.00 "OJ" ImS.99 15475,00 2270.00 17745.00 0.63
'00 200 " 7675.00 47UN 12389.B7 9391.00 JOO4,OO 12395.00 0.04
400 200 64 3817.50 9179.31 13016.8/ 6!1Tl.OO 5899.00 12876.00 1UJ
""" /00 4 122800.00 1891.74 124691.74 124623.00 2008.00 126(131.00 JJ3,,'" /00 8 61400.00 /953.74 63353.74 62179.00 2038.00 64217.00 13'
/"'" /00 " 3U700.oo 2824.74 31524.74 31082.00 2990.00 340n.OO /.61
'''''' /00 Z2 15350.00 4940.24 2029024 J5564,OO 5022.00 2Q586.DO 1.44
'''''' /00 64 7til5.00 9JS7,99 17012.99 7792.00 9266.00 17058.00 OJ5
6400 20 16 24560.00 4099.74 28659.74 26010.DO 6514.00 32524.00 '"
6400 20 3Z 12280.00 J841.74 /8121.74 13008.00 7632.00 20640.00 m
6400 20 64 6140.00 10072.74 16212.74 6516.00 16930.00 4J44li.00 62.7
The perfonnance prediction cslablished in the perfonnance-construction module provides a foundation
for estimating the effects of program transformations on the program. During the program optimization pro-
cess, we can use this estimation and apply the performance-update module to adapt the estimation based on the
program transformations. For example, in the above program, y[i] has an output dependence and a flow
dependence on itself in loop j. This means that the result of y[iJ is loaded and updated right after it is stored in
the previous loop Heralion. So if we allocate y[i] into the register during the execution of loop j, then only one
load and one store of y[i] is needed. This is a reduction of m-lloads and m-l stores for each y[i]. There-
forc, by allocating y[i]'s into registers, the load and store counls are each reduced by (m-l)*ln.
Note that the performance of the matrix-vector multiply program is far short of the advertised peak per-
formance of the nCUBE 2. This is because there is only one statement in the loop body and the loop branches
prevent the prefelching mechanism of the nCUBE 2 from pre-loading the operands. Consequcntly, the compu-
lation of the opernnds dominates the computation time. By applying the transformation loop unrolling, we
may increase the size of lhe code between the conditional branch of the loop, lhus allowing the dala loading to
be overlapped with the computation. For example, if we unroll the loop m 5 times, the estimated data fctching
time becomes one-fifth of the original cost since pds is now 1/5. The loop overhead is also decreased by a fac-
tor of S. This implies that by applying loop unrolling the cost for loading can be decreased significantly. This
is confirmed by the result as shown in table 6.4:
Note that in tables 6.3 and 6.4, the estimation of the computation is fairly close to the actual cost of the
compulalion. Although there are instances where the simple--mindcd estimation for the effects of the dala pre-
fctching is not accurate enough, this model can be utilized by the program optimization process to estimate the
cffects of program transformations on the performance. More delailed estimation of the data prefelching
effects is needed only when this factor is important, for which case, the perfonnance-refine module can be used
to improve the accuracy of the estimation. On the other hand, the perfonnance estimation for the communica-
tion is good for some cases but far from accurate in some other cases. This is because we did not consider lhe
hot-spot effects of the message passing in the example. Since all nodes are sending messages to node 0, node
o becomes the hot-spot, and when the message is long and the size of the cube is large, perfonnance
82
Table 6.4. The predicted and measured performance of the matrix-vector multiply program with innermost
loop unrolled 5 rimes.
Predicted performance Measured perl"onnanee %0"
0 m • eompu1allon CommunlCl1lon T... ComputalfoD ClIIIIIIlunlc:aUOD T...
'00 '00 , 71128.00 981.74 72109.74 69132.00 730.00 70062.00 ,<2
'00 '00 , 35564.00 1012.74 36576.74 34691.00 889.00 35580.00 2.80
'00 '00 8 17782.00 144824 19230.24 17357,00 1247.00 18604.00 337
'00 '00 J6 8891 j}() 2505.99 11396.99 86891JO 1261DO 10950.00 4",
400 '00 3' 4445.50 4714.87 9160.37 5881.00 3004.00 8885.00 no
'00 '00 64 2222.75 917931 1140U)6 5127.00 5899.00 11026.00 3.41
1600 JOO , 71128.00 1891.74 7JfJ19.74 69898.00 1972.00 71870.00 1.60
1600 JOO 8 35564.00 1953.74 37517.74 34977.00 2056.00 37013.00 13J
J600 JOO J6 17782.00 2824.74 20606.74 17499.00 29501JO 20449.00 0.77
1600 JOO 32 8891.00 4940.24 11831.24 8758JX) 50651XJ 13823.00 Db6
J600 JOO 64 444550 9157.99 13803.49 4406.00 92661XJ 13672.00 0.96
6400 '0 J6 14225.60 4099.74 1832534 14913.00 6478.00 21391.00 143
6400 '0 32 7112.80 5841.74 12954.54 7468.00 7703.00 15171.00 14.6
6400 '0 64 3556.40 10072.74 13629.14 37611JO 37045.00 40806.00 66.6
degradation occurs. Especially for the case where (n. m. P) =(6400. 20. 64), the cost of sending 64 messages
of 100 words each to node 0 is more than five times that of sending 32 messages of 200 words each to node O.
Since the nCUBE 2 uses the wonn-hoJe routing mechanism where a channel is reserved for a message and
the dala is pipelined to the destination. messages that need to use the busy links will be blocked until the previ-
ous message is finished with the channel. The longer the messages, the greater the chances lhat a message will
be blocked at the sender side.
When a more accurate perfonnance prediction is required, the perfonnance-refine module can be
invoked. For example, one way to refine the estimation is (0 use a finer classification of Ute operations. TItis
will not help much for our sample program here since il does not conlain any expensive operations such as
divisions. For distributed architectures like nCUBE 2, one way to belter estimate the data synchronization cost
is to analyze the performance degradation due lo the communication hot-spot. The performance degradation of
the nelwork caused by the hOl-spot saturation can be estimated by Ute hot-spot percentage. However, the com-
pUlntion of the hol-spot percentage is very expensive and is generally not accurate due to lhe dynamic behavior
of the program. A more practical approach is to model the performance degradation wiih a set of selected pal-
terns of different degrees of congestion (as described by Ute data dependence graph). For data dependence that
does not match any of the pre-selected palterns, an inlerpolation lo the nearest pattern is done to find an esti·
malion of the degradation. The more pauerns we use. the better the estimation will be. However, matching
complex patterns is an expensive operation, so this method can be used only when Ute user can afford the
needed computing resources. For our example, the compile can easily recognize that processor 0 is a hot-spot
and adjustment for hot spol effects can be applied.
6.6. Applying Performance Prediction to Intelligent Decision-Making
The performance prediction model can be applied to inlelligent parallel compilers in the following areas:
1. To choose among a lisl of prospective transformations.
2. Provide a measure for selecting pre-optimized algorithms when several algorithms are available under
the algorithm substitution approach.
3. Decide when 10 slap the optimization model.
83
4. Help to generate minimal run-time tests for situations where the compiler cannot make decisions by
static analysis.
The performance prediction model we have pco.IXlsed is designed to be integrated into the decision·
making process for optimizing program parallelism. In particular. it is used in the feature-directed program
optimization model to guide lhe decision-making process. When applied to systematic decision-Iree baversing
algorilhms (such as the A· algorithm as we discussed in chapter 4), automatic parallel program optimization
can be achieved. Furthermore, the flexibility of our model makes it easy to refine the heuristic functions at
different parts of the decision tree, which makes the framework more dynamic. It can also be combined with
rule-based systems to improve the quality of the optimization. An evaluation function or the combination of
several evaluation functions can be used to decide the effects of program transformation techniques. Heuristic
driven rules can use lhe prediction to select appropriate program b'ansfonnations or pre-optimized algorithms.
In an interactive program optimization session, it provides users with performance information for them to
make optimization decisions.
For a pre-optimized algorithm substitution approach, there are cases where difficult decisions need to be
made. For example, there may be more than one algorithm that are equivalent to the program under considera-
tion; or the same algorithm may be optimized for several different parallel architectures but not for the current
target machine. For the fonner case, the performance prediction may help the system to decide which algo-
rithm is most efficient for lhe problem. For the laller case, the performance prediction model can be integrated
with. the machine knowledge manipulation syslem 10 find the architecture that fits the target machine most.
In the program optimization process, the optimal solution for the program optimization is usually nol
known until lhe entire decision tree is traversed. This makes deciding the tennination condition difficult Since
the performance estimation mechanism can be used to find the lower bound of the execution time, this number
can be used to decide lhe lennination condition of the optimization. First, a tolerance for the lower bound can
be selected; the greater the optimization degree is, the smaller the lolerance will be. When the estimaled per-
formance falls into the tolerance range of the "optimal execution time," the optimization process can be ter-
minated.
The compiler may not be able to make decisions based on the performance estimation which has unin·
stantialed variables in the expression. In these cases, multiple branches [BDHLW87] of the control may be
generaled, and when dynamic decisions are needed, these conditions of the performance wilh uninstantialed
variables are used to find and generate the minimum run-time tests 10 decide lhe conlrol flow.
Although any performance prediction model can be applied to the above tasks, our model is more flexi-
ble and efficient and can be easily integrated into the decision-making process of parallel compilers.
6.7. Related Work
There arc many existing performance prediction tools for parallel architectures. For example, Faust
[GGJMG89] and IPS [MiYa87] allow program behavior to be described at many levels of detail and abstrac-
tion including the program, process, procedure and iru;truction levels. PIE, developed at eMU [SeRu8S], uses
a metalanguage to provide support for an efficient manipulation of parallel modules and programming for
observability. These systems are designed to be used as a semi-aulamatic performance evaluation 1001 for lhe
user. The Parafrase system [AbKw85] provides a performance prediction module based on a similar program
hierarchy but is designed to be used in the compiler and is inexpensive 10 compute. The "Ioad/Slare" modeling
melhod is used in [GJMW89, BWIALG90] to characterize the performance of shared memory architecture by
a set of template sequences of vcc(or load, store and "nop" instructions. There arc many other environments
lhat evaluate the performance of a particular type of aIChilecture or characterize the potential parallelism of the
program on the archilecture. However, none of the existing systems can be used to predict program perfor-
mance for a wide range of aIChiteclures accurately and inexpensively. Also, none of the systems provide a
systematic framework that is flexible enough to be utilized in an intelligent parallel compiler. Our framework
fills this gap by providing a flexible mechanism for the knowledge base sys(~m to adjust the prediction model
dynamically to suit different optimization objectives and different architectures. [Leung90] derived several
equations for static compile-time estimation of the overhead costs and execution time for several shared-
memory machine models. However, our work is more general, flexible, and accurate.
84
6.8. Conclusion
To summarize. our framework for the perfonnance prediction of parallel computers has the following
advantages:
1. Different classes of parallel computers can be handled under the same fiamework. The perfonnance
prediction model can be adjusted to suit different architectures.
2. The eslimation can be tuned 10 suil different objectives by adjusting the perfonnance combining pro-
cedure and Ihe weights associated with the evaluation functions.
3. Different amounts of resources can be commitled at different stages of the compiling process by using
evaluation functions of different complexities.
4. The selection of evaluation functions and weights offers a good opportunity for lhe compiler to learn to
improve ilsclf. This point will be investigated in [Wang91].
5. The prediction update process is inexpensive: only the effects of the related trnnsfonnations are cam-
puled. It can be applied repeatedly during Ute program optimization process.
Intelligent parallel compilers need an accurate, inexpensive, and flexible perfonnance prediction model 10
make critical decisions. Our framework is simple and yet .flexible enough to be integrated inLo intelligent
parallel compilers with multiple target machines. It can be used by syslematic state-space search algoriLhms
such as A· and other best-first search algorithms to find optimal program transformation sequences. It can also





The key issue in optimizing a program for a target architecture is to match the program parallelism with
(he available machine parallelism. Chapter 4 discussed different approaches for syslematic analysis and infer-
ence to match the program and target architecture. In this chapler we discuss methodologies based on perfor-
mance estimation, evaluation functions and heuristics to support this matching process. The heuristic functions
discussed in the next section quantify the match between the program parallelism and the machine parallelism
and form the basis for performance estimation and program optimization.
7.1. Improving Parallelism with Feature-Directed Program Optimization
In this section we discuss the methodologies and heuristics for program optimization based on lite
fealure-direcled program restructuring model.
The program optimization process first decomposes the program into units of execution call tasks. To
parallelize a program or increase the concurrency of a program. the program is restructured for concurrent exe-
cution by converting loops into parallel loops and executing tasks concurrently. The tasks and inter-lask
dependence relations fonn a graph that is called the task graph. To preserve the correctness of the program,
synchronization is needed for concurrent execution. There are two kinds of synchronization: explicit synchron·
ization, defined by synchronization statements in the control flow (such as barrier synchronization. conditional
statements), and implicit synchronization, defined by the data dependence relation beLween a pair of tasks allo-
cated 10 different processors.
For shared-memory machines. explicit synchronization insbuctions such as locks, shared semaphores.
protection bits or words. etc. are generated for inter-task data and control dependencies. For do-across loops,
do-across delays [Cytron84] are also inserted to minimize the stress of constant polling at the shared variable
on the network and memory modules.
For distributed-memory machines that use a message passing paradigm, the inter-task dependence rela-
tions arc replaced by explicit message read/write insbuctions that transmit the data between processors. The
overhead for sending and receiving messages is usually very high (when compared (0 computation, with the
exception of Pringle. in which sending a word costs about the same as a floating point operation). Thus,
minimizing the number of messages is of major concern,
7.1.1. Focus of Program Optimization
Since a program is usually fairly large, it is divided roughly into modules that we call program focuses.
A program focus is a program unit (usually loops or a large block of code) that is the central focus of the
pnrallclism·optimization. Program focuses usually define a sequential thread of executions. However, focuses
can be merged or the target machine can be decomposed and parts of the machine assigned to different focuses
for concurrent execution.
One major reason for decomposing a program into program focuses is for better utilization of the limited
resources. For most programs, the major part of the execution time is spent in a small portion of the program;
therefore. concentrating limited computing resources on the most important program focus is critical to the suc-
cess of the compiler. A program focus which is small in size also makes the optimization process easier. The
decomposition of the program can be guided either by profiling or by heuristic evaluation functions, The
profiling of a program can usually expose the most computation or communication intensive portion of the
program for programs whose pcrfonnance is not sensitive to input data. The advantage of using heuristic
86
evaluation functions is that the results of the evaluation functions can be functions of variables whose values
are only available at run time. This allows the generation of mulliple thread controls that can be decided at run
time based on the values of those variables. Also, heuristic evaluation functions such as the operation counls
are simple to compute and often do a fairly good job wilhont the program actually being executed.
Our heuristic for selecting focuses in a program is as follows:
1. Use profiling to find Ute few subroutines or functions that take the longest execution lime.
2. The top level loops in these subroutines form focuses of their own. Slatements between these loops also
form focuses.
3. After lhe focuses are decided, they are ordered on lhe basis of heuristic evaluation functions such as
operation counts or the estimated execution time.
4. The program optimization process is applied to each of the focuses starting from the program focuses
with highest priorities until the compuLer resources are exhausted. The allocation of lhe computer
resources can be detennined by heuristics based on the optimization degree specified by lhe user or
dynamically detennined by lhe estimated execution time of lhe focus.
Decomposing a program into a set of focuses has the following advanlages:
1. It is easier 10 optimize since lhe size of lhe program focuses is much smaller than lhe original program.
2. The resources are spent on the most important part of the program.
The disadvantage of this approach is that local optimization of lhe focuses may produce conflicls among
focuses. To solve lhis problem, a global optimization phase to resolve conflicts between the focuses can be
applied after all the focuses are optimized. Our approach of ordering the focuses also serves to prevent some
of the problems since conflicts can be avoided by respecting the decisions made by the optimization on the
most "important focus" before the current focus is selected.
7.2. Heuristic Guided Program Optimization
A hierarchical framework for feature-based program optimization was discussed in section 3.4.2. This
rramework classifies the program optimization process inlo the following Jive groups: general program parallel-
ism optimization, task decomposition, processor allocation and assignment. memory utilization, and synchroni.
zalion minimization. Based on lhe hierarchical problem decomposition, each of these five processes utilizes a
different set of program transformation heuristics 10 explore different aspects of lhe concurrency. These five
processes cooperate 10 achieve global optimization. The architecture of lhe hierarchical-blackboard allows
non-detenninism to be exploited at different levels of the hierarchy in the form of knowledge sources. We will
discuss some methodologies for program paraIIelization and optimization based on the hierarchical slruclure or
Ihe program optimiz.ation process. Our methodology for controlling lhe decision flow is to use heuristic-driven
rules at the upper levels of the hierarchy and use the hierarchical blackboard control to exploit the concurrency
and non-determinism at the lower levels.
7.2.1. General Optimization of the Program Parallelism
The first step in program optimization is to apply machine-independent program restructuring transfor-
mations to expose all the parallelism inherent in Ihe program and UlUS improve its potential for further
machine-specific optimization. General program optimizations include dependence cycle breaking, locality
improvement. and eliminating redundant instructions. Breaking program dependence eycles or eliminating
dependence relations is always beneficial because a program dependence creates synchronizalion delays and a
dependence cycle forces the statements in the cycle to be serialized. For most parallel an:hilectures, beUer
locality means faster data accesses and fewer processing delays. For these reasons, a fixed sequence of appli-
cable machine-independent program lransformalions can be applied to the program focus. The following is a
lisl of program optimization techniques that are applied to improve the parallelism of the program.
• Scalar expansion: breaks oUlput- and anti-dependence associated willi the expanded variable.
• Variable renaming: breaks output- and anti-dependence associated with the expanded vector.
• Statement splitting: breaks the dependence cycle by repositioning the anti-dependence arcs.
• Forward substilution: removes flow dependencies associated with the expression.
87
• Statement reordering: moves slatements out of dependence cycles, improves locality, cle.
• Array gathering: compresses the array to improve locality.
• Array reshaping: reshape the array to improve locality (discussed in the next section).
• Loop interchanging: improves locality
• Code motion: moves loop invariant code outside of loops.
• Dead-code elimination: eliminates redundant code
• Loop merging: saves loop overhead.
7.2.2. Task Decomposition and Processor Assignment
Program fragments that are executed as a unit sequentially are called tasks. A task is the basic unit for
processor allocation and scheduling. To decompose the program focus into a series of sequential tasks. lhe fol-
lowing simple heuristic is used:
HelUislic 6.1. (Task definition)
1. Instances of the top level loops form tasks by themselves.
2. Conditional or unconditional branch statements (if and exit stalements) fonn tasks.
3. All other stalemenls form a task of their own. We then apply topological sorting to these statements
based on lhe program dependence graph and the strongly connected components in (he program depen-
dence graph form !aSks.
The above heuristic decomposes the program into a graph of !aSks connected by inter-task dependencies.
Slatements in strongly connected components are grouped into a task because dependence relations require
control and dala synchronization. This implies lhat the group of statements has little concurrency and gives rise
to (he following heuristic:
Heuristic 6.2. The communication cosr resulting from executing a srrongly connected component on multiple
processors usually overshadows the benefit from the concurrent execution.
Tasks that are not top-level loops can be assigned to processors by a high-level task-spreading algorilhm.
For example, lhere is a task composition algorithm TACOM in [Poly86l lhat merges tasks on lhe basis of a
shared-memory model. The aIgorilhm merges two tasks when their concurrent execution causes large commun-
ication costs. Since loops have regular patterns, contain abundant paraIlelism, and are easier to parallelize,
they are usually handled differenUy than other tasks. A loop is parallelized when its loop instances are selected
as parallel tasks. The selection of parallclloops is usually based on the estimated parallel execution time.
The task assignment and memory utilization fonn a classical chicken-and-egg problem; the optimal task
assignment must take the memory access time into consideration, and the estimation of memory access time is
only possible when the tasks are allocated. In our approach, we take a two-phase approach for the task crea-
tion and allocation problem. The tasks are first created by using some criteria such as lhe oncs listed in heuris-
tic 6.1, and are adjusted by an algorithm such as TACOM or a parallel-loop selection algorithm. Then the
memory access optimization and synchronization minimization is perfonned based on the tentative tasks and
processor assignmenl During the program optimization process, Ute tasks may also be merged to fonn larger
tasks or be further decomposed into smaller tasks, depending on the requirements of the current goals. The
task composition and processor assignment are finalized after this process. This two phase approach is neces~
sary because optimizing memory access and minimizing synchronization need to be based on Ute task assign-
ments, but, on the other hand, optimal task assignment needs to have good estimation of the execution time of
the restructured program to balance the computation load.
7.2.3. Memory Utilization for Shared-Memory Architectures
In this section we invesligate some heuristics and techniques for optimizing the memory utilization on
shared-memory architectures. In [Husm86J Husmann uses the fonowing heuristics for array allocation on
shared-memory machines:
Hewistic 63. Assuming the tasks have been created and allocated, Ute following heuristics are used to allo-
cate lhe arrays.
1. An array can be allocated to a local memory only when it is neiUter declared as a global variable nor
referenced by multiple CPUs.
BB
2. An array can be allocated to clusLer memory only when it is neither declared as a global variable nor
referenced by CPUs in multiple clusters.
3. All other arrays are allocated to the global memory.
This heuristic is simple 10 implement and it can be used by the task creation and assignment algorithms
to estimate the dala reference and program execution time. However, this approach is very conservative and
ignores many possible ways of speeding up the data references. For example. a more aggressive data alloca-
lion strategy involves breaking down the array and allocating portions of the array to different processors.
Another approach is 10 copy a subset of Ute array into local memory and use the local copy instead of the glo-
baL Furthermore. [Husm86] assumes lhat the coslS of aU memory operations. data accesses, and compulation
are constant, which forms the basis of the estimated execution time used to select the parallel loop. This
assumption (especially that Ute coslS of network accesses and memory references are constant) is unrealistic
and affects lite accuracy of Ute result. Data block-access is used in Husmann's algorithm but only under the
condition that no multiple PE write Ute same elements of a block and a PE not write to an element in multiple
blocks. Also, multiple level parallel loops and inleractions of army references in different blocks are not con-
sidered. Husmann's a1gorilhm can be used to generate initial data allocation and other heuristics can be used
to improve Lhe performance. We will list some extensions here.
• Copy repeatedly used shared arrays (such as the variable x in the matrix-vector multiply problem) into
local memory to decrease network traffic and increase locality for target machines that have no cache.
• Use the trnnsformation array reshaping discussed in the next section to move a subset of an array to
local memory to increase locality.
• When any part of an array is updated by multiple processors, Husmann's algoriLhm will not attempt to
block-transfer the array to local memory. We can use array reshape to separate the array into two paris
with one half containing elements that are block-transferable and the other half containing elements that
are not block-transferable (elements that are updated by multiple processors or by a single processor but
in different blocks).
7.3. Array Resbaping _. a Mecbanism for Optimizing Array Usage
In this section, we introduce a program transformation tectmique called array reshaping. Array reshap-
ing is a program transformation technique to modify the storage paltem of an array (called the shape of the
array). The shape of an array is the way the elements of the array are physically stored and is defined by the
declaration of the array. This transformation is not to be confused with some existing transformations such as
index shifting, loop skewing, and linearization, that change the view of an array. A view of an array is an
index set of the elements of the array and is a function defined by subscripts of an array reference. An array
can have many views but only one shape. When an array reference appears inside loops, the subscripts of the
array reference are functions of indices of the loops. Program transformations can change the view of an array
by changing array subscripts or loop indices. For example, an n by n lri-diagonal matrix with only three non-
zcro diagonal elements has a square shape of size n by n . By altering the subscripts of the array reference in
the loop nest, the view of the array can be changed and only the elements that are on the tri-diagonal need to
be referenced. However, the change of the views does not affect the physical pattern of the array stornge. If
array reshaping (as shown in the example in the next section) is applied on Lhc array, the shape of the array
may be changed into a filled array with only non-zcco elements of the original array, i.e. the array has a rec-
tangular shape of size n by 3.
Definition 7.1 The shape of an array a can be characterized by bounds of the array and can be defined as:
shape (a) = [I I .. ud X [/2 .. U2] x ... x [h .. uLl·
where II; and UI: are the lower and upper bounds of the k-th subscripts of the array.
The size of the arruy is then defined 10 be: ,
Size(a) = II (Ui -Ii).
;=1
By nature, array reshaping is a global program transformation that can be applied only after global
analysis of the program to determine which shape and storage pattern of the array can yield the best program
performance or minimal program storage. Whether any elements thrown out by the reshaped array are used
89
anywhere in the original program also needs to be checked. These can be analyzed with the define-use chain
of the program dependence graph and the program perfomlance prediction model that we proposed in chapter
6. Another important usage of array reshaping is to identify the portions of arrays that are used by certain
parts of the program and to make a local copy of these portions. This means that instead of the original array
being replaced, the reshaped array is placed in the local memory of a processor and the code is generated to
move the data from the original array to the new array.
7.3.1. Array Reshaping Functions
Array reshaping represenls a group of one-to-one functions that operate on the shapes of arrays. A
reshaping function defines how the storage of an array is to be changed as well as the relation between Ute ele-
ments of the original and the generated array. There are two kinds of array reshaping that we are particularly
interested in: truncation/extension and linear transformation. Truncation changes the lower and upper bounds
of the subscriplS by removing or adding spaces to the arrays. The extension extends the boundaries of an array
beyond its original bounds by enlarging the bounds of its subscripts. Linear transformation applies linear func-
tions to the subscripts of the array and changes the shape of the array. The difference between these two types
of reshaping functions is that for linear transformation. the same linear function is applied to both subscripts of
lhe elements and bounds of the array: whereas for truncation or extension. only the bounds of lhe array are
changed (an identity function is applied to the subscripts of the elements). A truncation or extension is oftcn
accompanied by a linear transformation function.
A truncating or extending array-reshaping function (represented as [t..u];) that changes lhe bounds of the
i-th subscript of an array a into [I..u]. changes the shape of the array a from
[[, .. ud x ... x [li .. u;] x ... x [lc, .. ud
into
[II .. ud x ... x[l .. u] x ... x[lL" ud.
A linear reshaping function (II (/), h(/), ...• !L(/) maps lhe element a(i " i 2•.•. , iLl in the array
a into arCh (i Ih, ... ,iL), !2(i"i2 , ... ,iLl, ..., fL(i "i2•••• ,h», where ar denotes the new array. And the
shape of lhe array a is mapped from
[11 .. lid X [(2 .. U2] x ... x [lL" ud
into
where Xi denotes a vector whose entries are all zero except the i-th entry, which has the value x. For conveni-
ence, the linear array reshaping function is denoted as:
1---,> ifl (I), [,(1), ... ,!Lel)·
7.3.2. Variations of Array Resbaping
Although array reshaping can be used 10 change the shape of array into many different shapes, the pri-
mary purpose is to reduce the size of the array. Only elements whose images of the reshaping function fall
into the new shape of the array have slots in the new array. Some basic reshaping functions are listed as fol-
lows:
• Projection: (I, j) --> (I) or (i, I) --> (I).
This function can be used when the interval corresponding to !he dropped subscript in the shape of the
array is trivial, that is, it has only one conslant in the interval. If one subscript of an array a in a for
loop is loop invariant, then the array can be projected into a lower dimensional array inside the loop.
For two-dimensional arrays, the projection (iJ) ._> (i) maps the original array into the i-th row and the
projection (ij) _.> (j) maps the original array into the j-th column. The projection can be viewed as n
special case of truncation; when the range of a subscript of the array is truncated into only one integer,
the corresponding dimension of the array can be dropped in the task.
90
• Transposing: (iJ) ••> (i,i).
Two subscripts of the array are exchanged. In this case, an array of shape [1..N]x[l,.M] will be changed
into an array of shape [1..M]x[1..N].
• Compaction:
A linear function that can compact the bounds of the array subscript is called compacting function. For
example, (i) -> (i!2) reduces the size of the array in half.
• Expansion:
A linear function that expands an array into a larger array is called an expansion function. For example.
(i) --> (2"'i) doubles the size of the array. The expansion is usually applied on arrays that were com-
pacled to map them back to the original arrays.
We denote compaction and expansion as:
a' <- «i,})---->(!1 (i,}),j,(i,}»)(a)
and «i,})---->(J,-'Ci,}).[,-, (i,}»)(a ') <- a
respectively, where [,.-1 denotes the inverse function of /;.
To reduce Ute data transmitting cost in distributed systems, for a compacting reshape the compacting is
usually done before the data is sent; and for expanding reshape, the expansion is usually done at the
receiving processor.
• [nteger linear fWlcrions:
An integer linear function is a linear function whose coefficients are all integers. A general form for the
"linear functions is .I:at*i,b where als are integer coefficients and itS are Lhe indices of Lhe array sub-
k=l
scripts.
7.3.3. Opportunities for Applying Array Reshaping
"How can array reshaping be used in program restructuring to improve Lhe parallelism of a user program
on a target architecture?" Array reshaping can be used to copy the data into Lhe local memory of a processor
or change Lhe storage paHcm of an array to reduce space or Lhe access time. The latter includes simplifying
array subscript-calculation, improving cache-hit ratio, changing array strides without interchanging Lhe loops
and reducing unnecessary traffic on the network. We list a few cases here and study these cases through exam-
ples.
Case 1. Minimizing the array storage.
For example, consider a band matrix a of size n by n wiLh widLh 1'.1 declared as:
a: array [l .. n, 1 .. nJ of real:
The matrix uses n2 spaces. By reshaping the band matrix into a rectangular array of size n by 2*w+l, the
storage requirement is decreased to n*(2*w+l). This situation can be recognized by observing that the second
subscript} of the array reference aU,i) is always bounded by i-wand i+w. By transferring Lhe array based on
lhe function (i,i) --> U, i-i), the array a is mapped into a new array a' with the elements being repositioned.
And the declarntion of l.he array becomes:
a: array {1 .. n, -w .. w] of real:
Case 2. Changing the physical reference order of the array to improve performance.
For example, lrnnsposing a veclorizable array that has a long array stride but happens 10 be in a pair of
non-interchangeable loops may yield a stride-! vector. Even if the loop is interchangeable, we still have a case
where in a doubly-nested loop we have two references 10 two arrays for which one has stride-n reference and
another has stridc-! refcrence. By interchanging the loops. we would change one reference inlo stride-l and
another into stride-no
The reshaping function can also be compounded with other reshaping functions to change both the view
and shape of the array.
91
Foe example, for the illustration in case 1, if the array is bclter transposed (for it to be veclorizcd or for
other purposes) for allier parts of the program, the transposing function (I,}) --> U.i) can be applied. By
applying (i,j) --> U.i) to the above example we obtain a combined reshaping function (i,) --> U-i,;), and the
array declaration becomes
a: array {-w .. w,l .. n-l] o/real;
W N -w 0 W
1 1






1 2 3 4
Figure 7.1. Example of reshaping a band TTlfJlrix.
Case 3. Minimizing messages in distributed-memory systems.
The data communication belween processors of dislributed-memory systems can be significantly reduced
when the sub-array that is used is compacted and copied over. If the sub-array is modified, the part of the array
that is modified can be stored back to the original processor by transmiUing the resulls back and expanded into
the original array.
For example, suppose array a is stored in process; but used by processor j in the following loop:
for i := 0 .. P-i do (It: parallelized loop *)
fork:= 0 .. M-i do
bU, k}:= aU-i, 2*k};
end/or
end for
In a dislribulcd-mcmory system, the reference 10 array a will be replaced by a copy of the array a as shown
below:
fora// i := 0 .. P-l do
local aloe: array [0 .. M-i } ofreal;
i/O != 0) aloc[O..M-i} <- a[i-i, O..M-i};
for k := 0 .. M-i do




Then array a has to be sent from processor i to processor j. By reshaping a', the copy of the array a in the
processor i, with the reshaping function i---7(;{l) on the local array aloe. the program becomes:
forall i := 0 .. P-l do
local aloe: array (O .. MI2 I of real;
iff; != 0) alaerO..MI2] <- (a{i+l, 2*j], j=O.MI2J;
fork:= 0 .. M-l do
bO, k) := aloert];
endfor
end/arall
As can be seen in the example, the conversion is done in the sender and the size of the array that is senL
lhrollgh the network is halved and the array subscript computation is also simplified.t
Note that lhis transfonnation is based on the specific shape and usage of the array; numerical analysts
have been doing this by hand for decades. However, this hand optimization usually obscures the clarity of Ute
algoriUun and sometimes it also creates difficulties for the compiler in optimizing the program. Having the
compiler perfann lhese kinds of optimizations automatically not only eases the burden on the programmers but
also makes the optimization easier on the compilers. This point is made even more clear in the following
example:
Case 4. Avoiding obscured algorithms due to optimization.
Excessive program optimization often leads to obscured algorithms. For example, consider factorizing a
band malrix using Gaussian elimination. The following program is a direct coding of the Gaussian elimination
with the additional knowledge that array a is a band matrix.
a: array [Ln, l ..n} of real;
foriin1 .. n-wdo
for j in i+1 .. i+w do
ali,i] := - ali)) I ali,i];
for k := i+1 to i+w do




For large n, most "space" in array a is unused and wasted. In order 10 save space, a "competent" program-
mer would implement the above algorithm as follows:
a: array [l ..n. -w..w] ofreal;
for i in 1 .. n-w do
forjinl .. wdo
a[j+i,-j] := - aU+i,-j] I ali,O];
/ork:=ltowdo




Unfortunately, this program is so obscure that most people have to spend quile a bit of lime (0
comprehend the meaning of the subscripl.S and the program. By examining the above example more carefully,
we will find that the second program can be obtained by shifting the loop indices of loop j and k by i and then
applying the reshaping function (m,o) --> (m,o-m) to the first program. This optimization can be obtained by
a simple heuristic encoded as rules in the rule base. By lelting the compiler optimize the usage of slornge, lhe
program can be specified as close to the original algorithm as possible.
t The same oill"eet may be oblained by using other IJB.nsformlllioDS but arT1lY roshaping is a mort: poweJful and mort:
gene",llramfonnalion.
93
Case 5. Array Copying for the Functional Semantic ofForall Loops.
For languages whose parallel loops have functional semantics. copy-arrays may need to be created in
each loop instance that uses or updates the array to preserve the copy-in and copy-out semantic [Wang86].
Array reshaping can be used to reshape the local copy-array of the original array for each loop irultance. Since
memory accesses inside for loops are usually very regular, lhis means that array reshaping can reduce the size
of the copy-array to lhe minimum. This will guarantee that only the necessary data is copied into the parallel
loops. On dislributed systems, only lhe remote array elements that are used by the local processors need to be
copied. This minimizes the cost of copying arrays in the implementation of functional FORALL loops and
makes it practical.
For example, consider lhe Jacobi iteration shown in the next program fragment:
vo'
A, New_A: array [O.N. D.N] a/real;
pid: integer,.
[aralli in] .. N-l,jin 1 .. N-l do
New_A[iJJ := 025 '" (A{i-l j) + A[i+l j) + A[ij-l] + A[ij+1J);
end[oroll;
Note lhat this forallloop is usually surrounded by an outer loop lhat iterates until the solution converges.
At the end of the iteration there is code to copy contents of the array New_A into A (or an optimization-
minded user might interchange the array New_A with array A to avoid the copying). For shared memory
architecture, the array A and New_A can be in global memory, but it is beneficial to create a local copy of the
array. The local copies of the array Old_A are block-transferred to the processors before the first iteration and
the results are copied back after the last iteration. This model is similar to Ute distributed-memory model. To
execule this program on a distributed-memory machine, one simple decomposition is 10 partition lhe arrays
inlo p'2. blocks of size MxM. whereM =(N + 2) I P. This yields lhe following program:
[oroll k in 0 .. P-l, f in 0 .. P-l do
const
lowl = k*M; fow2 = I$M;
highl = min(lowl+M-l. N+l); high2 = min(low2+M-l, N+l);
vo'
Old_A, New_A: array [lowl-l ..high1 +1, low2-1 ..high2+1] ofreal;
DATAMOVEMENT();
for i in 1 .. M,jin 1..M do




In the above program. Ute array Old_A is the reshaped array of the original array A lhat the iteration uses for
each process. There is an overlapping area of lhe array with processors U1at compute the adjacent blocks. If
we only look at Ute forallioop, the array New_A should be declared as follows:
New A: array [lowl ..highl, low2..high2) of real;
However, since lhe array Old_A is copied into (or switched with) array New_A. after Ihe forallloop, the boun-
daries of lhe array New_A are extended (array reshaping replaced the above declarntion by the extended array).
Based on lhe dependence analysis of the original program, the statement DATAMOVEl\.1ENTO represents the
code 10 move data into the local processor at the first iteration of the outer loop of the forall loop and moving
data from adjacent processors and copying inside the processor in the subsequent iterations. This simple
example aclually represents a very sophisticated process of program restructuring.
7.3.4. Heuristics for Applying Array Reshaping
How does a compiler recognize opportunities for applying array reshaping? When is array reshaping
beneficial? Since array reshaping ehanges lhe declaration of the arrays, it is only applicable when all the
94
expressions involved are resolvable at compile time. In the last section, we listed some opportunities Cor
applying the array reshaping. Here we list some simple heuristics that a parallel compiler can use to decide
when to apply array reshaping.
1. If a subscript of an array is a constant for all references of that array inside a task, then the projection
can be applied to the array 10 map Ute array 10 a new array with Ute subscript dropped. Inside loops, the
precondition means that the said subscript of all references 10 the array is loop invariant
2. If the range of onc subscript in all references of the array is a subset of the bounds of that subscript, then
the shape of the array can be shrunk: by truncation to truncate the bounds of that subscript into the actual
range.
3. If a subscript of the array is a multiple of a loop index by an integer, such as a*i. then the array can be
compacled by the array-reshaping function ita in that subscript This heuristic was used in the last exam-
ple.
4. If the expression of a subscript appears in anolber subscript of the array, this expression can be elim-
inated by array reshaping, For example, for the program in the example in case 4 in the last section, the
index of loop i appears in the loop boWlds of loops j and k. Consequently, array references in the loop
that use indices j and k are detennined by the value of i implicitly. Just 10 see the relations more expli-
citly, we apply the index shifting transfonnation to the original band matrix factorization program shown
in the above example and obtain:
for i ill 1 .. II-W do
for}in1 .. wdo
a{j+i,i] := - a{j+i,iJ I ari,i] ..
fork:=l towdo




From the above trnnsformcd program it is clear that the index i appears in every subscript of every refer-
ence. For each loop instance of loop i, i is a conslant for the loop instance. Therefore, Ute shape of Ute
array a in the loop instance is:
[i+1..i+w] x [i] U [i] x [i] U [i+1..i+w] x [i+1..i+w] U [i]x [i+1..i+w]
= [i+1..i+w] x [i+1..i+w].
The shape of all references for a in the loop is [i+l..i+w]x[i+1..i+w], where I:;;; i $ n. This implies
that the array references of a in the loop all fall into a band of width w. Our heuristic says that by apply-
ing the function (i,}) --> (i,}-i) to Ute subscripts of the array reference, the shape becomes
[i+l..i+w]x [-w..w]. And for loop i, the shape of the array becomes [1..n ]x [w.. -w], achieving a reduc-
tion of (n-2"'w-I)"'n in space. The second subscript in the new array represents the distance of the ele-
ment to the diagonal in the original array.
7.3.5. Some Remarks for Array Reshaping
The methodologies that we described in this paper can serve as a starting point for studying array reshap-
ing for program optimization, Array reshaping is <l powerful but complicated program Irnnsformation (ech-
nique. It is particularly useful for minimizing data communication cost for architectures that have non-trivial
data transmission costs and programs that only utilize parts of the arrays. It can also be used to minimize dam
storage for machines that have limited memory available (such as the hypercube computers). Since army
reshaping has significant effects on data references and communication casts, its usage should be carefully
planned to avoid disastrous counter effects, The effects of array reshaping on the performance of the program
depend on the architectural features of the target machine and can be estimated by the perfonnancc prediction
model. This warrants a more thorough study about tlJe potential benefit of array reshaping in optimizing pro-




Automatic program generation for dislributed memory parallel computers is a very difficult problem and
has been largely ignored unill recently. Nevertheless, the difficulties in programming dislributed memory
parallel computers make this problem ever more important as the distributed memory archiLectures such as
nCUBE 2 [NCUBE90J, iPSC/2. iPSC/860 [Inte19Ol. and Intel Touchslone, ele. become more powerful and
popular. The major issue in programming distributed memory parallel computers lies in the distribution and
communication of the data One approach being followed by several groups is to support a global shared name
space at Ute program level and to automatically generate the communications required for non-local references
[CoKe88J. [Koe190J. [MeVR89aI. [MeVR89bJ. [RoScWe89J. [RoPi901. [RuAn90J. [ZiBaGe88J. Although
such a dislributed shared memory approach provides users wilh a global memory space and allows them to
program distributed memory systems in a style close to shared memory computers, the fundamental problem of
reducing the commwtication and synchronization overhead remains to be solved by the underlying compilers.
The leChniques we describe in this paper can be used to minimize the communication and synchronization
overheads in compilers that support program paraJ1elization or distributed shared memory models.
For distributed memory architectures that use the message-passing paradigm, non-local data references
need to be converted into explicit send/receive instructions. The simplest approach is to generntc a pair of
send/receive statements for each data dependence and ulilize a set of control libraries that use message passing
for control dependencies. The problem is that message passing is an expensive operation and the communica-
tion overhead along with the serialization effects of the send and receive operations might destroy any benefits
of parnllel execution.
One possible way of reducing the communication cost is to consolidate the messages into longer mes-
sages. This technique has been practiced by many parallel programmers in programming distributed parallel
computers for a long lime, but its use in parallel compilers for the automatic program optimization of distri-
buled parallel computers has come into use only recently. This optimization is currently being studied by
Several distributed memory compiler research groups arc currently studying this optimization in the contexL of
loops [CaKe881, [RoPi90J, [Gemdt90]. The basic approach followed by these groups is to first spread the
iterations of a loop across the processors. Any non·local reference within the iterations is then preceded by a
send/receive pair so as to communicale the appropriate data element. Where possible, such communication
statements are extracted out of the loop and then "vectorized" into a single communication statement. Thus,
instead of each iteration generating its own message, a single message is utilized between any two pairs of pro-
cessors to exchange the required non-local data. This optimization of messages in loops has been incorporaled
in some of the above compiler efforts, however, the theoretical foundations and technical details of this
approach for general code have not yet been fully investigated.
Consolidating messages has two apparent effects: decreasing message passing overheads and increasing
data synchronization delays. Whether two messages can be pro.lilably consolidated depends on the tradeoff
between the above two faclors and the data dependence. Careless message-merging may slow down the pro-
gram, generate incorrect results or cause communication deadlocks. In this paper we examine the lrndcoff of
consolidating messages and present an algorithm for deciding the optimal cIusLering of the messages.
7.4.2. Foundation of Message Consolidation
In the following discussion, we presume that a compiler has already generntcd the tasks to be executed
asynchronously on a distributed memory machine. In such a situation, each cross·ta.sk data dependence gives
rise to a dala synchronization point so as to ensure the correctness of the concurrent execution. Enforcing
these data synchronization points has a high overhead on distributed memory syslems because data communi-
cation is a very expensive operation on these machines. For example, the cost of sending a 4-byte number to a
neighbor on an nCUBE 2 processor costs about 160 microseconds as compared 10 a floating point multiply
which costs only 0.35 microsecond. In order to minimize the overhead of data and control synchronization,
[hc compiler can attempt 10 merge multiple messages into single longer messages and overlap communication
with computation. In other words, a single data synchronization point for multiple data dependence between
two tasks is preferred. Unfortunately, merging data synchronization points means delaying the startup lime of
the data transfer and this decreases the overlapping of the data transmission of the message with the computa-
tion at the receiving processor and thus increases the data synchronization cosl It is therefore necessary to
96
derive an algorilhm that decides when and how data synchronization points can be merged beneficially.
Before we discuss the algorithm for perfonning such an optimization, it is necessary to examine some
theoretical foundations for the approach. In the following discussion, we assume that the architecture supports
read, write and test operations. The wrile sla1emcnl sends the message, Ute read statement receives the mes-
sage, and the test operation checks if lhc designated message has arrived. We further assume that the write is
non-blocking and the read is blocking, that is. the sending processor can proceed with other computation after
the message is tmnsfcrred to the underlying network transport hardware, but the receiving processor will have
10 wait in the receive statement until the message has arrived.
In lhe following discussion, we assume that T I and T2 are two tasks and B is a flow or OUlput depen-
dence from statement S I in T I to S2 in T 2- Let t (5 I) be the execution lime of the statements belween the first
statement in T 1 and SI including that of S., and t'(S2) be the execution time of the statements between the
first statement in T2 and S2, ezcluding that of S2. Let M be the size of the data that causes the data depen-
dence, and trans(M) be the cost of transmitting data of size M.
Lemma 7.1. The data synchronization delay in the receiving processoc caused by lhe data dependence ~ is
given by:
delay = max { O. t(S Jl + trans (M) - f(S ,)} (7.1)
Proof:
The data sent by task T, will arrive at task T2 at time r(S.) + trans (M I)' As shown in figure 7.1 on the
next page, if t(St> + trans(M,) s: t'(S2) then the data arrives before the statement S2 is reached, so there
is no synchronization delay. On the oilier hand, if t (S1) + trans (Md > t'(S2) then the processor thaI
runs task T2 will have to be idle until the data arrives, so the idle time is t(S I) + trans(M1)-((S2).
Combining re two cases, the synchronization delay caused by the dependence 15 is lhen
delay = max O. t(S I) + trans (M) - f(S,+
QED.
Conventional dntl dependence gmphs [Wolfe89] have no provision for representing the concept of merg-
ing multiple data dependencies in a single synchronization point. Thus, we introduce here a new dependence
relation called the data dependence cluster.
Definition 7.2. A data dependence cluster, 'P, is a quadruple (0,.6., S10 S2) where 0 is a set of dnta depen-
dencies from task T, to task T2, .6. is the union of the data involved in the dependencies in 0, S1 the last state-
ment in T, that must be executed before the data can be sent to T2 , and S2 is the first statement in T2 where
the data involved in the dependencies in n must arrive or the execution of T2 will be blocked.
The data dependence cluster is a generalization of a single data dependence. A data dependence 15
involving the dala d from slalement S I to stalement S2 defines a data dependence cluster: ({ 15I, {d), Sl' 52)'
And the effect of this data dependence cluster on the performance of the program is the same as that of the sin-
gle data dependence.
Corollary 7.1.1 The data synchronization delay for a data dependence cluster that contains one data depen-
dence is the same as the data synchronization delay caused by the dependence.
Many operations can be defined on the data dependence cluster, but here we will discuss only the union
(called merge below) operation. We define the union of two data dependence clusters (0", 8", 5i, 52.) and
(Ob,lib, st, S~) to be (n" U nb,li" U lib, S;, S;), where S; is Si oc st, depending on which slatement
occurs later lexicographically, and S; is 52 or st depending on which statement occurs first lexicographically.
Definition 73. Two data dependence clusters 'P" =(n",Ii", Si, Si) and 'Pb = (nb , lib, st, S~) are called
mergeable if there is no dependence 15 from statement S, to Sk such that S, is between S2 and S~ in T 2 and 5t
is between statements Sf and st in T ••
Trying to merge two data dependence clusters that are not mergeable would violate the data dependence
by moving the source of a data dependence beyond a statement that depends on it Furthermore, this would
cause the two tasks to deadlock since task T 1 would have to wait for data from T 2 before it could process the










delay =0 d,lay = I(SI) +"""[M) 01'(82)
Cases CBseb
delay =max{O, 1(51) + trans{M) -1'(52) }
Figure 7.2. Snchronization delay caused by inter-tosk dependency.
The data dependence cluster can be used to guide the code generation for distributed memory architec-
lures. For example, a write statement is generated after statement S I which sends the data in Ii. and a read
slalcmeDt is generated in front of the statement S2 to receive the data. Note that the data dependence clusters
have mutually exclusive data dependence sels and form a partition of the data dependencies. A cluster Clus-
ter I is defined to be '<' Cluster2 if all statements in T I that are involved in lite dependencies in the Cluster I
are before those in Cluster2. This defines a partial order of the clusters. Two clusters are said to be adjacent
(0 each other if there are no clusters between them.
Lemma 7.2. Let a' and a2 be two data dependencies from lasks T j to T 2. where a' is from statements 51 in
T j to S2 in T 2 and a2 is from statements 53 in T j to 54 in T 2. Let 1(5;) be the execution time of the stale-
ments between the first statement in T 1 and Si including that of Si, and (Sj) be the execution time of the slate·
ments between the firs~ statement in ~2 and Sj. excluding that of Sj. Let M i be the si~ of the data that causes
the data dependence S' • and trans (M') be the cost of transmitting the da1a of size M'. Then, the delay in T2
caused by the two data dependencies is the maximwn of the two delays. In other words, the delay for lask T2
is:
delay = max { '(S,) + "ans(M') - t'(S,). I(S,) + 'rans (M') - "(S,)}
= max { delay' • delay'}
Proof:
By lemma 7.1, we know that the delay caused by dependenceS} is
delay' = max { 0, '(S ,) + 'rans(M') - t'(S,)}
(7.2)
The problem may be divided into two cases based on the order of statements 52 and 54.
Case 1: (S2) :5t'(S4) (as shown in figure 7.2):
Due to the dependence a', a delay of delay' has to be inserted before the statement S2: as a result. every
statement in task T2 after S2 is delayed by this amOllllt of time. So the statement 54 will be reached at
time 1'(54 ) + delayl. and by lemma 7.1. the delay caused by the dependence [,2 becomes
delay" = max { O. '(S,) + rrans(M') - (t'(S,) + dday')}
98
=max { 0, (I(S,) + trans (M') - (S,» - delay'}
= max { 0, delay' - delay'}
Therefore, the delay for both dependencies is given by
delay = delay' + delay' = delay' + max { 0, delay' - delay'}
=max{ delay', delay'}
Case 2: ((S2) > t'(S4) (as shown in figure 7.3):
Since the statement 54 is in front of Ute statement 52. delay2 has to be inserted before statement S4
delaying the lime statement S2 gels executed. So the new delay for statement S2 is:
delay" = max { 0, I(S,) + trans (M') - «((S,) + delay')}
= max { 0, (r(S I) + trans (M I ) - (t'(S2» - delay2}
= max { 0, delay' - delay'}'
This implies that the synchronization delay caused by dependencies Sl and rl is
delay = delay' + delay" = delay' + max { 0, delay' - delay'}
= max{ delay', delay'}
This concludes the proof.
QED.
Note that the case b in figure 7.3 can occur on architectures that support alternative routing when the
message size M' is very large. For other machines Utat use fixed rouling between two processors, the first
message will block the second message so this case is not possible. The lemma sliIl holds because the
trnnsmission time for sending the second message will be much longer, thus causing a longer delay.
A direct generalization of the above lemma leads to the following lemma
Lelluna 73. When there is more than one data dependence between two tasks, the synchronization delay in
task T2 is determined by the dependence that causes the longest delay. That is. if there are n data dependen-
cies from ~ks TI to T2 where the j-th dependence 0; is from statements sj 10 sf, and the j·th delay, de/al, is
caused by 0' • !hen the delay for task T 2 caused by the dala dependencies from task T 1 is:




The lemma can be proved by induction. The base case is when there are two data dependencies and is
proven in lemma 7.2. Assuming I1ull the lemma is true for any m data dependencies where m < II, we now
proceed to prove for Ihe case of n data dependencies. We also assume that the dependencies are ordered by
99
1::1 = 1(81) t2 = 1'(82) II~ta Inmsmition .............
13 = 1(S3) 14 = 1'(S4) dara depmdence -
/~ "
r-
" " "> dclay1., S delay1 " S" __ ill
S












........~ ~(M2) --.....:::~......~ \}~(M'
¢ D delay2-delay1
Case b 14:> 12 and delay1 < delay2 =:>
delay = delay1 + max{O, (13 + Irans(M2)· 14· delay1»
= delay1 + max(O, delay2-delayl) = delay2 = max(delay1, delay2)
Figure 7.3. The synchronization-delay caused by lWO data dependencies belween rasks T 1 and Tz (sratemem
5z is before 54)'
the order of the slaLements in task Tz. For the fIrst n-l dala dependencies between the two tasks, assume thal
defayk is the largest delay in the delays caused by the dependencies; then by the assumption, the combined
delay for the fIrst n-l dependencies is delayk. Now consider the dependence S", the statement S~ is delayed
for defayk by the previous n -1 dependencies. The delay for the n-th dependence is then:
delay"' = max { 0, r(S7) + 'mns(M")-{(S,") + delal)}
= max { 0, delay" - delay'}
So the overall delay of all 11 dala dependencies is
delay = delay' + max { 0, delay" - delay'}
== max{ defayk, defay" }
=max{ ~;{ dela/} , delay"}
100
I~ "t(51) t2" f(S2) II~ta tranlmltlon .........-
13" t(S3) t4 =f(S4) dale dependence _
"P,










---~ ¢ ~ '.
;,
Case a t4 < t2 and delay2 )0 delay1 => i:sdelay =deley2 =max(delay1, delay2)
!s- f- " ,






~~)--~ ~trans(Ml) (M1)-- " 1nJ:>.oI(M1) "•
I ¢ deIayl.ddayZ !
Case b 14 < t2 and delay2 < delay1 =>
[5
delay = delay1 = msx(delay1, delay2)
Figure 7.4. The synchronization-delay caused by the two dala dependencies be/ween tasks T 1 and Tz (state-
ment S4 is before Sz).
= "!~{ d'lay'}
This completes Ute proof.
QED.
The following fonnula detennines the new delay in task T 2 when lhe first message is merged into (he
second. The two cases that can occur arc depicted in figure 7.4.
Lemma 7.4. If the two messages as defined in lemma 7.2 are merged into one.lhen the synchronization delay
for task T 2 becomes:
(7.4)
Proof:
When the first message is merged into the second message, the data dependence from statement S, (0
slntcmenl S2 is changed into a dependence from stalement S3 to stalement 52, and Ute size of Ute data to be
sent from task T, 10 task T2 is increased into M l+M2. At time 1(53) + (rons (M' + M 2) the message will be
available on task T 2 so Ihe delay is this time minus the time the first statement involved in Ute data dependen-
cies is reached which is min(r'(5z), (54»' This proves the fonnula.
101
QED.
Itl_t(Sl) U -t'(S2~ IldalatnmJmidOll. ----I
t3 = t(S3) 14 = t'(S4) data dependence _
" U=" ---- ~ '. " s TS ~dda11 t" > S ~-'" lraw{Ml) 3Ml delayS >dl!l'ay2. S
..---- ~¢ '~~ tr......(Ml+M2)Mi' "'.
- ~
CSCases. t4>12 =;>delay;;l3+tranac(M1+M2)-t2.
- - '. - c-
" /S " sS ,
" 17m S
trSJ3!(Ml)
\[5~elaYl delayS delay2 S-........ IraIllI(M2)
~
tnm..(Ml+M2)
M2 '.¢ '\ ~ "
~
Case b. 14 < t2 "CO" delay Ie: t3 + transc(M1+M2) - t4. b
Figure 7.5. The synchronizatioll delay for merging two messages into one.
The drawback of merging two messages is lhat the corresponding stalements in task T 2 have 10 wail for
the message to anive, which might increase the data synchronization time.
On most distributed systems, long messages are preferred over short messages since Ute overhead for
message slartup is generally very high. The message transmission time trans 0 is given by:
Irans(N. hops) =a(hops) + ~(hops) '" N (7.5)
where N is the size of the message, hops is the distance between lhe two processors, a(hops) is the message
startup lime and P(hops) is the unit cost for data transmission.
Below we seek to determine the condition for which two mergeable messages can be profitably merged.
Assuming that the same conditions exist as defined in lemma 7.2 and the two message are mergeable, let the
dclay belwcen Ute two tasks after the two messages are merged, lo be delay. We define em, the difference in
delays before and afler merging the messages, to be the difference between the new and the old delays. In
other words, em is defined to be delay - max {delayl, delay2J. The lWO messages can be profitably merged
only if em < O.
Note that if lhe original delay between the two tasks, max (delayl, defay2 J, is less than or equallo 0,
lhat is there is no delay due 10 synchronization, then there is no point in merging the lWo messages.
The next lemma describes how em can be computed.
Lemma 7.5. If max(delai, delay2) > 0 then em is given by:
I(S3) - reS1) + P(hops) '" M 2
(5,) - (5,) + ~(hops)' M l
1(5,) -1(5,) + (5,) - (5,) + ~(hops) • M l
~(hops) '" M 1
if (52):S; (54) and delayl :<!: delay2
if(S 2) :s; «54) and delal < delay 2.
if «52) > (54) and delay2 2. de/ay2




By definition em = delay - ma:x(delayl, delayl).
If (54) 2. (S2) and delay I 2. delay2 (as in figure 7.2 case a) lhen the delay caused by the two data
dependencies is delayl. Since delay! > O. the exira cost of merging Ihe two messages is
em = (t(53) + rroTlS(M1 + M 2) - (S2» - (I(S I) + trans (M 1)-r'(SZ»
=(1(5 3) - t(Sd) + (trans (M 1 + M 2) - trans (M1»
= 1(53) - I(SI) + P(hops) '" M 2.
If (54) 2. «52) and delayl < delay2 (as in figure 7.2 case b) then !he delay caused by the two data
dependencies is delay2. Since delay'2 > 0, the exlra cost of merging the two messages is lIten
em = (/(S3) + trons(M I + M 2) - (S2» - (1 (S3) + trans (M2)-r'(S4»
= «((54) - (S2» + (lrans(M1 +M2) - trans (M2»
~ 1'(5,) - (5,) + ~(hops) • M 1.
If t'(S4) < t'(S2) and delayl O!: delay2 (as in figure 7.3 case a) lIten the delay caused by the two dala
dependencies is delayl. Since delayl > 0, lite extra cost of merging lite two messages is
em = (l(S]) + lrans(Ml + M 2) - (S4» - (r(S I) + trans (M 1}-(S2))
~ (, (5 ,) - '(5 I)) - (((5,) - 1'(5,)) + (trans (M l + M') - trons (M'))
=(1(5,) - '(5,)) - «((5,) - (5,)) + ~(hops) • M'.
If (S4) < t'(S,J and delay 1 < delay 2 (as in figure 7.3 case b) lIten lite delay caused by lite two data
dependencies is delay2. Since delay2 > 0, the extra cost of merging lite two messages is
em = (r(S]) + rrans(MI +M2) -1(S4» - (r(S]) + rrans(M2}-(S4))
= (rrans(M I + M 2) -lrans(M2)
= P(hops). MI.
QED.
Merging two messages into a single message has two distinct effects:
1. It increases the synchronization delay of the receiving processor that runs task T2 as described in lemma
7.5 by delaying lite sender.
2. It decreases the execution lime of the processor running task T l by CS~M (overhead of sending a mes-
sage).
Although there is no clear-cut method for estimating the combined effects on the overall performance of
the program. lemma 7.5 can be used as a guideline for deciding when to merge two messages. Messages
should be merged only when the overhead for sending a message justifies the extra data synchronization cost
caused by delaying the sending of lite first data.
Heurisric 7.1. Assuming em is the overhead in task T 2 for merging the two messages and eUM is the over-
head for sending a message then the messages can be merged when the following condition is satisfied:
em = delay - max(delayl, delay2):s; eulld (7.7)
where em is defined in lemma 7.5.
103
The above heuristic assumes that the delays in different tasks have the same effects on the overall perfor-
mance of the program. 'This assumption is again conservative. Although changes in the execution time of any
[ask would affect all tasks that interact with it, more than likely the overhead in task T 2 can be masked by
overlapping lbe communication with computation. On the other hand, the saving by eliminating one message
communication
will directly decrease the execution time of T I'
Based on heuristic 7.1 and lemma 7.5, we can derive a heuristic-guided algorithm for deciding when
messages can be profitably merged. The problem of minimizing data synchronization can be defined as a
problem of partitioning the data dependencies into data dependence clusters. IniLially, we assume that each
data dependence forms a data dependence clusler by itself. We then proceed to merge (union) the clusters inLo
larger clusters until the merge is no longer beneficial. This algoriUun is applied to each pair of tasks {T I , Tz I
that have cross-task dependencies belween Utem.
Algorithm 7.1. Message merging.
For each pair of tasks T I and T1 wilh dala dependence from T I to T2 do
1. Initialize each data dependence to fonn a data dependence cluster of its own.
2. For each data dependence cluster pair rF' 'Pi) do
IT 'Pi and 'Pj are mergeable then
Calculate em I the cost of merging 'Pi and its adjacent dala dependence cluster q.<i, and set
BiJ = eunJ _ em
else
Set BiJ to be-oo.
end if
cnd for
3. Sort the pair of data dependence cluslers rF I 'Pj ) in a heap based on B i,i
4. While the minimum B over all pairs of data dependence clusters is positive do
Merge 'Pi and 'Pi into 'P r•
Recompule lhe value Bi',j for all 'Pj and adjust Ihe heap.
end while.
END.
This algorithm has the complexily of O(n l '" log(n» where n is the number of cross-task data dependen-
cies. TItis is because in step 2 lhe cost calculation was repealed (n_I)2 times. In step 3, the cost of sorting nl
numbers is O(n 2 * 10g(II». There will be at most n-l merges in step 4. So the overall complexity is
0(11 2 * log(n».
One can improve the above algorithm by applying a heuristic that merges only adjacent data dependence
cluslcrs. This heuristic makes good sense, because Ihe synchronization delay in T2 is directly proportional to
the distance between statements Sj and st.
Algorithm 7.2. Message merging (merge only adjacent dependencies).
For each pair of tasks T I and T2 that have data dependence from T I to T1 do
1. Initialize each data dependence to form a data dependence cluster of its own.
2. For each data dependence cluslcr 'P" do
Calculale em, lhe cost of merging 'Pi and ils adjacent data dependence cluster 'Pi+1 ( if Ihey are
mergeable) and set Bi = esenJ _ em
end for
3. Sort the data dependence clusters in a heap based on B i
4. While the minimum B over all data dependence clusters is positive do
Merge 'Pi and its adjacent data dependence cluster 'J'i+l inlo 'P'.




Lemma 7.6. The complexity of the algorithm 7.2 is 0(11 ... log(n)), where n is the nwnbec of cross-task depen-
dencies.
Proof
Initially there arc n dependence clusters, and after each merge there is one less data dependence cluster.
So in the worst case, the algorithm can execute the while loop at step 4 at most n-1 times. In step 3 sorting n
numbers takes O(n * log(n)). Thus the overall complexity of the algorithm is (O(n * log(n».
QED.
Note that if we omit the sorting in step 3 in algorithm 7.2 and merge the data dependence clusters 'Pi
and 'Pi +1 if B; is positive, then, the algorithm 7.2 is linear. The drawback of this heuristics is that we lose the
optimality claim of algorithm 7.1 when speeding up lite algorithm.
When loops are parallelized, it is usually done by blocking the loops into a parallel loop and a sequential
inner loop. The non-local data references inside tlJe sequential loop often define a very regular pattern of
accesses. Thus the message consolidation algorithm can take advantage of lhis regularity and does not need to
unroll the loop to consolidate the messages. Instead, the algorithm can work on the statements iJlClide the loop
as well as statements at the end of Ute statement blocks by assuming thal the loop wraps around once. The
resuHs can Uten be generalized 10 all loop inslances. This implies that the complexity of the algorithm for
parallel loops is the number of cross-task data dependencies in the inner·loops.
Statement reordering can be used to move the definition of the data that is the source of a cross-task
dependence to as early in Ute code as possible and move the use of the cross-task data to as late as possible.
This has the effect of minimizing the synchronization delay.
The transformation array reshaping that we discussed in section 7.3 can be used to condense the size of
the data to be moved across the network: to further decrease the data transmission time. The dependence graph
is adjusted so that the data references in the receiving task depend on the local variables that hold the arriving
messages inslead of the original variables. This keeps the dependence graph in a consislent state.
7.4.3. Summary
The major problems in parallelizing sequential programs for dislribu(ed memory parallel computers lie in
dislribuling data into local memories and converting data dependencies into messages. In lhis paper, we have
discussed the problem of reducing cost of data synchronization by consolidating data dependencies and thus
the corresponding messages. We introduced a special kind of data dependence called the data dependence
c1us(cr. The data dependence c1uslers represent the generalized data dependence relations for programs under
Ihe message-passing paradigm. We analyzed performance of merging data dependence clusters and messages
based on the estimated parallel execution time of the program. The message consolidation lechniques intro-
duced in lhis paper can reduce the communication overhead for parallel computers significantly when they are
applied appropriately. The algorithm for message consolidation that we have presented here utilizes a
heuristic-guided performance prediction model (see chapter 6) to decide whether two messages can be
beneficially merged. The algorithm is optimal in the sense that it minimizes the perfonnance prediCled by the
model.
7.5. Algorithm Substitution and Fine-Tuning with Pre-Optimized Algorithms
Some widely used numerical algorithms have been explored extensively on different multiprocessor sys-
tems. The optimizations of these algorithms often involve fundamental algorithm changes in order to utilize
the special parallelism features provided by the target machines. Some of the techniques and heuristics of
modifying these algorithms are problem-dependent and are applicable only for some particular problems and
architectures. The importance of these algorithms makes optimizing them important, but one does not want (0
add the specific heuristics and techniques used in optimizing them into the knowledge base unless they can be
used by other problems. One solution to lhis problem is to include the pre-optimized algorithms for these
kinds of problems in Ute knowledge base of the system. When a program matches the pattern of a pre-
optimized algorilhm, rather than performing a series of transfonnations, the algorithm of the program is
replaced by the pre-optimized version.
105
Some of the pre-optimized algorithms can actually be achieved by a series of basic transformations. In
these cases, the use of transformations or prc-optimized algorithms is debatable. The pre-optimized algorithms
eliminate the lengthy intermediate transformations. Once the pattern is recognized. the system can directly
substitute lhe parameters of the algoriUun 10 translate the program into the desired forms. Also. pre-optimized
algoriUuns can achieve optimal results foc the particular problems that might well be beyond the ability of Ute
transformation system. Using pre-optimized algorithms helps (0 cut down the size of the rules applied 10 gen-
eral problems. because those heuristics that are only applicable (0 the special pre-optimized algoriUuns need
not be included. On the oilier hand, the pre-optimized algorithms increase the size of the knowledge base. and
lhe paltem recognition test for a pre-opLimized algorithm adds penalties to all problems that may not be related
to the algorilhm at all.
The choice between fine grain heuristic-directed transformations and the paltem-directed pre-optimized
algorithms relies solely on the state of the art decisions of the system engineers. Our approach is that heuris-
tics that are only applicable to very limited classes of applications are not included, and pre-optimized algo-
rithms are only used for problems that cannot be optimized by the general heuristics and transformations.
Due 10 the nalure of this approach, we will explain the ideas through an example. The example we con-
sider here is the array accumulation problem. This is an interesting problem because the program is simple but
it contains data dependencies and memory accesses that may serialize the computation. This example allows
us to illuslrale the heuristics of selecting and fine-tuning the pre-optimized algorithms as well as their applica-
bility. It also shows the importancc of the resolution of memory contentions. The following is the general
fonn of the array accumulation problems.




In the above program, f is a function that maps the range of the loop index to the range of index of array
Q. For the sake of simplicity, we assume that the accwnulation statement is enclosed in a single loop. In morc
general cases, the accumulation statement may be nested in multiple loops or the array may be a multi·
dimensional array.
If the index function f is a onc-te-one function, then fis a pennutation on a subset of interval [1 ..8]. In
this case, values of g (i) are accumulated into distinct elements of array Q. When this program is run on a mul-
tiprocessor system, no two processors will update the same memory cell at the same time. However,
mcmory/bus contention problems may slill exist because of the limit of bus or network bandwidth, but memory
locks are not needed.
For machines with P processing units, one trivial approach is to divide the program inlo P equal sized
tasks and run them on the P processors as shown below.
var Q,- array fl ..B] of integer;
forall i in 1 .. P do
var s .- integer;
s.-=N/P;




For this approach, no pre-optimized algorithm is needed but the speed-up may not be impressive because
the efficiency of the approach depends on the values of the index function f, the memory and bus bandwidths
as well as the degree of memory interleaving. If more infonnation about the index function is available, lhen
better optimizations are possible.
For example, if the index function is a linear function with respect to the loop indices and the right hand
side of the assignment takes alxmt the same amount of time to be processed for all loop instances. then the
memory updates may be regulated by updating the memory according to the memory interleaving. This can be
accomplished by "index shifting" to shift the index function such that the memory update requests are evenly
distributed to all the memory modules. This approach might be the fastest that we can achieve on some
106
particular machines since the memory updates are processed at their maximum extent.
ITf is a consl:ant function lhen lhe values of g (i) are accumulated at the same memory cell a (C J, where
C = f (i). In this case, the problem is called. a one bin accumulation problem. If this program is Lo be exe-
cuted on a multiprocessor computer with P processors, each instance of the loop will have to compete to
update the same memory cell a [C]. On machines willi a combining network that supports the "fetch and
add" operations such as the NYU Ullracomputer [Schw80l. the accumulation requests can be combined in the
network as the requests are routed to the memory. When the requests arrive at the memory module, there will
be only one combined memory update request remaining; thus no memory lock is needed. The function call
barrierO synchronizes the processors and guarantees that each processor starts the next operation at the same
lime which is essential for the coordination of the "fetch and add" operations.
var a: array [1..8J o/integer;








fetch and add(a[indexl. value);
end for
endforall
For machines withoul a combining network and a "fetch-and-add" operation capability, mutually
exclusive accesses to a [C] nc.cd to be enforced 10 avoid memory updale conflicts. This might serialize lhe
memory updatcs and lose all the parallelism of the machine. One possible solution is to block the loop inlo P
chunks and allocate Ihem 10 P processors; each processor accumulales the values of g (i) in a local counter.
These counlers are summed up through a lrcc-sum algorithm after all processors finish their local accumula-
tions. Only one synchronization is needed before the lree-sum operation is started.
var a: array[l ..Bj of integer;
private .- array [l ..P] of integer;










If the expression g U) contains no function caUs then the privale accumulations can be computed in NIP
cycles and the lree-sum operation can be computed in 0 (logP) cycles.
If the function f is neither a constant function nor a one-ta-one function, this problem is called a multiple
billS accunwfalion problem because the values of g{i) are accumulated inlo many elements of array a simul-
laneously. A significant number of synchronizations are needed because of the unpredictability of the values
of the function f O. Processors may compete to update any of the array elements at any instance, so memory
locks need to be placed on all elements of the array a 10 guard correct memory updates.
All these three kinds of array accumulation problems are common in practical applications. One particu-
larly interesting example of these is the image processing algorithm called "hislogramming." The hislogram-
ming problem is a special case of the multiple bins accumulation problem whose index function is array refer-
ence. In image processing, a piclure frame is represented by a lwo dimensional array of points called pixels.
Each pixel has a small value belween O..(b-l) (typically an 8-bit number) thai rcpresenls the grey scale value
107
or the color RGB value of the point. The histogramming involves keeping track of the occurrence of each grey
scale value in the picture.
Throughout this section, the multiple bins accumulation problem is used 10 demonstrate the general
ideas. but at limes when more detailed illustration is needed the histogramming problem will be used
The multiple bins problem can be deLected by lhe system by matching the program with the general form
of the multiple bin described above. If tlJe index function f involves only loops indices. lhen the syslem can
detect the type of the problem at hand by applying array subscript tests. However, when lhere are variables
other lhan the loop indices involved, delennining whether the index function is a constant function or a one-
to-one function may not be trivial for the system. For example, the index may be an array reference whose i-th
clement has value i. Although the user may know lhal the index function is an identity functioR, the system
will not be able to determine this fact simply by the static analysis. In these cases, the system will have to rely
on the user interaction to provide help. If the subscript test fails and the user cannot provide help, the more
complicated multiple bin problem is assumed
When a multiple bins accumulation problem is recognized by the system, pre-optimized algorithms for
this problem are considered When there is no pre-optimized version for the target architecture. the computa-
tional model needs to be analyzed to sec if any pre-optimized version can be applied to the target architecture.
Heuristics are also considered in fine-tuning the pre-optimized algorithm to match with the target machine.
For machines that support combining networks and "fetch and add" operations, the accumulation opera-
tions in the multiple bins accumulation problem can be translated into "fetch and add" operations. If we
divide the program into P tasks by loop blocking and run lhc tasks by P processors. lhere will be at most P
"fetch and add" requests for the same memory location at each cycle. Requests that have the same destina-
lions are combined in the network and merged into one request when arriving at the memory cell. The "fetch
and add" operations eliminate the need for memory locks so the overall performance of the algorilhm will be
improved. However. the speed-up of this approach may not be significant. because lhere may be many
memory cells to reference in the same operation cycle. Depending on the reference paHerns, the memory
update requests may pile up in the network and memory modules may thus cause network saturation.
The two-phase private counters approach we used in the one-bin accumulation problem can be extended
lo solve the multiple bins accumulation problem on multiprocessor machines that have several memory
modules. During lhe fust phase, an array of private accumulation counters is created for each processor that
execules the program. Each processor gets a share of the job and updales the private counters independenlly.
In the second phase. these private counters are sununed up eilher by tree-sum or other available parallel sum-
mation algorilhms.
The memory updates of a (f (indices)] in the original program are irregular and unpredictable. In lhe
first phase of the two-phase approach, Ute memory access pattern of privare_a is unpredictable. Bul, since
each processor exclusively owns irs private counters. the private counters can be updaled simultaneously and
independently. One possible problem is lhat when the machine has a small cache. the irregular updates of lhe
array privalc_a may produce a high cache miss rano. However. in most problems, Ute privale counters arc
fairly small. For example, the private counler array in the hislogramming problem is of size 2"'*b. When b is
equal to 8, the size of the private counter array is 256 so the entire array can reside in the cache throughout
phase 1.
On the other hand. the memory update pattern in phase 2 is very regular. Very few synchronizations are
needed. Processors can cooperale to sum up lhc privale counters. On most machines, the pre-defined tree-sum
algorilhm will provide a reasonably good speed-up.
No(e lhat Ihis approach of introducing private counters and dividing the memory accesses inlo (wo
phases is based on the expertise about this particular array accumulation problem and may not be applicable to
other problems. Therefore. the pre-optimized algorithm is used. The pre-optimized algorithm did not specify
whcre lhe private accumulation counlers should be allocated because this problem falls into lhe category of (he
array allocation problem and the general heuristics about array allocation can be used to decide how the privatc
counlers should be allocated. In general. not all the delails in pre-optimized algorilhms need to be imple-
mented because some of them are covered by general heuristics. As a result. some of the variations of the
pre-optimized algorithm can be left undelennined until the actual program is substituted. The unspecified part
can then be obtained by applying the general transfonnations.
108
var a: array[l ..B1 of inl
private_a: array [l ..P. 1..8] a/integer;
s : integer;
s :=N f P;
Jorall i in 1 .. P do






o[j] := tree_sum(private_of·, jlJ;
end/or;
This approach should cut down the number of pre-optimized algoriUuns that the system needs to store.
The implementation of the function tree_sumO may vary from machine to machine. For machines that have
combined networks, (he summations can be accomplished by merging the "fetch and add" requeslS in lhe net·
work. In this case, the last loop that perfoons the tree-sum can be IIanslated into:
[orall i in 1 .. P




At each cycle, the P processes will generate P "felch and add" requests to the same memory location.
These reqlleslS are combined in the network and merged into one request when arriving at the memory. Since
each processor accesses the same memory location at the same cycle, no hot spot nelwork saturation problem
will ever occur. This lasl statemenl will be valid only when alllhe processors are homogeneous and Ihe sys-
lem provides a barrier synchronization routine lhat can start all processors atlhe same time
On olher machines Utat do not support "fetch and add" operations, the tree-swn operations can be han-
dled by simulating the binary tree network. We can parallelize the outennost loop by synchronizing the
memory accesses of the tree-swn operations.
for pid in P .. 1 step -1 do
local private_a: array [l ..P, 1..8J afinteger;
foriinl .. Bdo
if (pid == 1) then
a[i] := private_a[pid,iJ + private_a[pid*2,iJ + private_a[pid*2+l,iJ ..
else if (pid <= P 12) then 1* is not a leaf *1




In the above program, there are two pairs of dependencies from private a [pid* 2,i J and
private_a fpid* 2+l,il 10 private_a fpid,i]. These dependencies correspond to the inherent characteristic of Ihe
tree operations, and the operations in the intermediate nodes of the tree cannot be execuled until the children of
Ihe node finish the computations. If the array private_a is allocated in the shared memory, then semaphores
need 10 be inserted 10 enforce Ihe order of the tree computations.
109
var syn: array[l ..P] a/semaphore;
torall pid in P .. 1 step -1 do
local private_a: array [1 ..P, LBI a/integer;
fori in 1 .. B do
if (pid == 1) then
ali]:= private_a[pid,iJ + privQte_a[pid*2,i} + private_a[pid*2+1,iJ;
else
if (pid <= P 12) then 1* is not a leaf */
wait(syn[pid*2]); wail(syn[pid*2+IJ);






Here signal 0 and wait 0 are primitives for synchronization. If the array private a is in the local memory, then
the values can be passed to other processors by either copying them to shared memory with synchronization
locks or sending lhem through the inter-processor communication channels.
For a non-shared memory machine like Pringle, the "tree-sum" operations are pipelined through a tree
configuration of the machine. The swilches in the network are configured such that the processor with proces-
sor id pid is connected to its parent whose id is pid/2 and its two ch.ildren whose processor ids are pid* 2 and
pid* 2+1. The processor with processor id 1 is the root of the tree. The synchronization and the buffers are
hidden in the implemenlation of the channel variables on the machine.
The pipelined tree-sum algoriLhm can be executed in time C * 10gP * B where C is a small constant.
[oriinl .. Bdo
if (is_leaf(pid)) then
CHyarent <- privare_a[pid, II;
else
1<- CH_/_chifd; r <- CH-,_chi/d
CHyarent <-1 + r + private_a[pid, iI;
etuI if;
end for;
The unprediClable memory access pattern in these kinds of programs causes loop optimization techniques
to be ineffective no matter how the control structure is modified. In this example, creating private accumula-
tion counters regulates the memory update patterns and eliminates the need for memory locking. The lrndcoIT
is that more memory cells and computing cycles are used to store and sum up the private counters. These
cosls are conslant with respect to the number of processors P and the size, B, of the army a. Also, these costs
may be compensated for by minimizing synchronization and resolving memory contentions.
For problems like the histogramming program, lhe memory accesses pattern is highly dala dependent.
There is no easy way for the compiler to tell whelher the extra cost of manipulating the private bins justifies
the synchronization costs it saves. Heuristics are used in deciding whether pre·optimized algorithms are suH-
able and beneficial. Use of the heuristics also allows lhe compiler to be more aggressive in parnllclizing the
program.
7.5.1. Summary
In this section, the use of pre-optimized algorithms is demonstrated; variations of the algorithm are used
for different arch.itecture configurations. The choice of the variations in implementation is determined by a set
of rules based on the computational model. After the algoriUun substitution, basic transformations are applied
Lo match the algorithm wilh the computational model beUer.
A pre-optimized algorithm may use other pre-defined algoriUuns. For example. the two-phase approach
in solving the multiple-bin accumulation problem uses the pre-optimized pipelincd tree-sum algoriLhm. The
actual implementation of thc tree·sum operation is based on the computational model.
110
The pre-optimized algorithms differ from the fine grain heuristics in the abstract level of the knowledge:
the pre-optimized algorilhms are pre-packed special purpose hewistics which are only suitable for the particu-
lar problem they are designed for. Better optimizations may be achieved because the pre-oplimized code has
been optimized extensively for (he particular target machine. The fine-tuning process we described above can
be used to ulilize previously optimized code even lhough the target machine may not completely malch the
architecture for which the program is optimized.
111
CHAPTER 8
IMPLEMENTATION AND EXPERIMENTAL RESULTS
In this chapler we describe an implementation of an intelligent parallel programming environment
designed 10 demonslrare the ideas we have proposed in this thesis. Some experimental results are also
presented.
8.1. An Experimental Intelligent Parallel Programming Environment
In lhis experiment. we constructed a prototype intelligent parallel programming environment, called
IlllelliCompifer. InlclliCompiler incorporates some new concepts and distinct features:
1. No pre-selected program transformation sequences are assumed; rather, the program optimization unit
analyzes features of lhe program and the larget machine and utilizes heuristics in the knowledge base 10
choose the program transfonnation sequences dynamically.
2. When there are factors lhat carmel be resolved by Ute static analysis, multiple paths of conltol flow are
evalualed. If differences in perl'onnance are significant among different control flows, run-lime tests are
generated to decide the best control flow at run-time.
3. The systems provide supports for representing features and knowledge of the archilectures and programs
explicitly.
4. A perfonnance prediction unit is incorporated in the decision-making process to estimale the effects of
the transfonnation.
5. The system supports various interaction and optimization degrees and is designed to incorporate self-
learning modules and to accommodate different an:hilectures.
8.1.1. The System Architecture
The major components of the parallel programming environment include:
• Multiple front-ends for different programming languages.
• Multiple back-ends.
• An intelligent program optimization system.
• A machine knowledge manipulation system.
• A user interface.
8.1.1.1. The Front-Ends and the Back-Ends to the Programming Environment
Currently, front-ends to the system include parsers and program dependence graph generators for Blaze,
C, and Fortrant. All front-ends generate BLAZE program dependence graphs as the input to the intelligent
program optimization system; this allows the latler to be language independent. The BLAZE program depen-
dence graphs are generated by the program dependence graph generator in the front-ends:j:.
The back-ends of the system include unparsers that generate high-level programming languages for
Blaze, E-Blazc, C, Fortran, and EPEX-C programs§ and code generators for Sequenls and Suns.
t The Blaze fronl-end was buill with I:CXImbulions from the following individuals: P. Mchrolra, K. Wang, C. Koelbel,
G. Shannon, and K. Miller. The C and Fortran frol'lL-el'lds were developed by D. Gannon and his swdenu QI Indillll8
University.
*The Blaze program d<:pc:ndcnce glllph gcnellltoris wrillen by K.. Wang and C. Koelbel.
§ The BlllZe and E-BlJIzc unpQrser were written by K.. Wang. The C and POJ1ran lInpilIllerwere written by D. Gannon
112
Fortnm programs BIaxeJ&.Blau programs C programs
I Fomao~.., I I Blaze parser C,.,..,
I
I I program dependence
graph analyZer
I From ~nd
I~nrllrenee-ImllCh.ina Blaze prog~r I
























j 1 j j
Fortran I I ~"i""' I I Blaze I 000, Iunparser unparser generalDr--. I t Bru:k_ern
Figure 8.1. The structure of the parallel programming environmenf.
8.1.1.2. The Machine Knowledge Manipulation Syslem
The machine knowledge manipulation system, as discussed in chapter 5, uses an object-oriented
knowledge rcprcsenlation scheme and is equipped with an inference engine to perform reasoning on lhe
features of the machines. Not all features of lhe machine need to be specified by lhe system programmer.
Based on some basic features, the system runs a lest program to collect some machine-dependenl features
(such as the size and magnitude of floating points, unit time for floating point operations. and memory refer-
ence costs) or language-dependent features (such as overheads for loops, array address calculations. and pro-
cedure calls). Some features may be derived by the inference engine of the system based on the exisling
knowledge of similar architectures. The machine knowledge manipulation syslem supports three differenl
modes:
and his llUdl:l1ts. A dilfcrent vCl"5ion of thc C unpal"5cr for o.CUBE and iPSC was wrillcn by C. Koelbel and P. Mchrotl1l.
Thc EPEXoC unpal"5cr was wrillcn by K. Wang and C. Koelbel while Ihey were supported as ,ummer students at the
IBM T. J. Watson Research Center, 1986.
113
• Query mode. This is the usual interface to the programming environment. Given a name (or alias) of a
target machine, the system returns a list of the features currently know about the machine.
• Knowledge update mode. This starts an interactive session to insla11 new machines, new features of a
known machine, or modify the ex.isting knowledge.
• Inference mode. This starts an interactive session with an interface to !he SQL database. Its reasoning
capability allows the user to compare relations between features of the machines.
8.1.1.3. The Intelligent Program Optimization System
The structure of Ute intelligent program optimization system includes a machine feature interface, a pro-
gram feature analyzer, a program lransfocmation system, an intelligent program restructuring control system, a
learning interface module, and a user interface module. A figure of the program optimization system is shown
in figure 8.2.
17llUhine[eolures prog,QI1I rkpentknu graph
1





jmowkd~ I--- infcrmce r ~;...... """"" XIIdepo>l"", t-mOOole "'" "line "",.0& grnpb "'"M.... sketcher inlmace..... inleWgent
leamfug program restructuring





Jmelligent program optimiuJtion sySlem
restructured progrtll/l dependence graph
Figure 8.2. The structure of the intelligent program optimization system.
The machine feature interface slores the information generated by the machine knowledge manipulation
system. It also acts as the interface between the program optimization system and the machine knowledge
manipulation system. The system can use the interface to query the machine knowledge manipulation system
for properties or relationships of machine features.
The program feature analyzer abstracts the features of the program based on the program dependence
graph. Examples of such information include whether the program fragment matches a known algorithm (that
has been previously optimized), which arrays are most critical from the point of optimization, whether the pro-
gram has cross lask dependence. etc.
114
The program transfonnation system contains various program transformation techniques. The tmnsfor-
matioos are organized into groups based on the objectives of the program-reslructuring process (as discussed in
chapter 4). Each program transConnation is a module that consists of the knowledge and procedures for testing
the applicability of the lransCormation, detecting opportunities for applying the transfonnation. evaluating
effects of the transformation on the program, deciding the parameters of the transformation, and carrying oul
lhe transformation (by modifying the program dependence graph).
Table 8.1 shows !he list of program transformations that are implemented and have heuristics for uliliz·
ing them encoded in the knowledge base of the system.
Table 8.1. List of the program transformations that are currently implemented in the system.
Name GO VC TC PA MU AT MS
Statement reorderino • • • •
Statement splitting • • •
Forward substitution • •
Loon blockino • • •
Loop interchanging • • • • • •
Loon merpinp • • • •
Loon distribution • • • •
LoOD unrollinf! • • • •
Index shiftinp • •
Vectori'zation •
Gycle shrinking •
Arrav block tran~er •
Arrav convin P • •
Array reshapino • •
Arravlocafi'zation •
Messape consolidation •
Run-time scheduling • •
Do-across schedll/inJ! • • •
The abbreviations used in the above figure are explained in the following rable.
Abbrev. Sub oaf Abbrev. Sub oal
EI' Enable other Transformation GO General Optimization
MS Minimizing Synchronization TG Task Grea/ion
MU Memory Utilization VG Vectorization
PA Processor Allocation
8.1.1.4. The Intelligent Program Restructuring Control System
The intelligent program restructuring conleol system utilizes the feature-directed program optimization
paradigm and the mer-blackboard architecture. Its duties include choosing the appropriate optimization
focuses, selecting and carrying out applicable program tcansfonnations. giving necessary explanation, and
evaluating the result.
The intelligent program restructuring control system consists of an inference engine, a knowledge base,
and a perfonnance prediction subsystem. An explanation mechanism and help utility is also supported for
users who choose to interact with the system. The inference engine is based on the blackboard architecture
and features opportunistic reasoning. The knowledge base contains heuristic-oriented rules for guiding the pro-
gram transfonnation system during the program restructuring process. The rules arc organized into the
knowledge sources based on the slruclurc discussed in chapter 4. The perfonnance prediction subsystem con-
lains a set of evaluation functions which can be dynamically integrated to estimate the pcrfonnance of the
115
current program on the target machine to aid the decision-making process. Different sets of evaluation func-
tions can be used in different stages and parts of Ihe program optimization process based on the available
resources and the optimization degree. The details of the performance prediction system are described in
chapter 6.
8.1.1.5. The User Interface
The user interface module contains a BIazeJKali unparser, a textural user interface. an X Window user
interface. and a program dependence graph sketcher. The dependence graph sketcher and the Blaze unparser
are used to show inlermediale stales of the optimized program in graphic and textural forms. The graphic
sketcher can display the program dependence graph on workstations lhat support the X Window system or
print hard copies of the program dependence graphs. The unparser translates the program dependence graph of
the program being optimized into llser-llnderslandable Blaze-like code. Both tools are intended for users who
want to have a high degree of interaction to control the program restructuring process. It also helps the user 10
understand lhe reasons behind Ute decision-making process of the system.
·'M
p ...... '_ 1'_·O'hl".jt ;
bolJ~
ul: ~o,.,...,.no""C ... o[_.11) •
• 2: v[_.11,'.C_.l1/no,.,...;
.J: for k In 2 .... loop
031: 'OrJln1 •. k-11oop
-Jill tAp r • 1 , ....... r • ] • lrvloO"",odoJot ( ... 0 [ k •• 1 • v r •• J 1
I • v r •• J 1 ,
end;.:12: ""p [ _ J :m • [ _ • k 1 - ...... [ • 1 ,.:Il' MM" ,u ~o.... ( ... tAp r • 1 1 ,
134: " [ •• k 1 ,. "'p [ • 1 I no""" •
• nd,
end,
Wlllell blod<l.""t U opu......n? ebbd or {,"",. up. _n. ne~t. prevll: 0
l!!Illhoclccript
I -I~--> ------..--- -
.. .. --
Figure 8.3. The tat and graphic form of the dependence graph generated by the Blaze unparser and the Blaze
dependence graph sketcher.
The system uses the UNIX sockel·primitives to commwticate with Ute Prolog process which is running
in the background. The communication between the program optimization system and Ute front-end is through
UNIX files in Ute fonn of BLAZE program dependence graphs.
The texl user-interface is based on a Prolog library that we developed for building a general text based
user-interface for Prolog programs. The Prolog library supports cursor movements and screen updates.
dynamic menu-action selection mechanism. explanation and help utilities. The cursor movements and screen
updates allow Ute menu selection to be used on dumb tenninals. It provides full support for utilizing lhe lenni-
nal capability by consulting the terminal capability database. The dynamic menu (with actions associated with
the enlries) is a versatile text-driven menu mechanism for interactive control of the syslem. Items or aClions
116
of lhe menu can be modified wilh olber built-in menus or created at run time. A caching scheme in the Exp-
Shell (described below) supports dynamic loading of the menu so (hat menus can be loaded at run time dynam-
ically.
8.1.1.6. The Structure of the System
The entire program reslrucluring system is implemented on top of a hierarchy of expert syslem lools.
These lools include a hier-blackboard system, an expert system shell called ExpShell, and a C-Prolog inter-
preter which runs on the UNIXt operating syslem. The architecture of the underlying system is shown in lh.e
following figure.
Figure 8.4. The architecture of the underlying systemsjor the program restructuring system.
The mer-blackboard system was described in detail in chapler 4.
The ExpSheU was built for lhe implementation of the intelligent programming environment, but it was
carefully designed so that it can be used (0 conslruct other expert systems by simply supplying it with
appropriate domain knowledge. The ExpShell library contains utilities for supporting unification and infer-
ence, list manipulation, interface to the UNIX operation system, input/output and cursor movement control,
dynamic menu manipulation, user interface, knowledge manipulation and inslaUation, rule caching, and Prolog
program analysis and debugging. The system debugging tool can discover errors where undefined functions
are called. It also warns about the following potential problems: finding a vector that appears only once in a
predicate (possibly due to a misspelled name), two functions with the same name (same or different number of
arguments) but which have other functions defined in between (misspelled name or poor choice of names), a
function whose predicates appear in two or more .liles, and asserting a function that is also defined in a file.
The debugger also builds up a cross reference map of the predicates, so that optimization and reference checks
can be done.
Table 8.2 shows the size of each of the components of the program restructuring system.
8.2. Examples aDd Experiments
8.2.1. Remarks about the Experimenls
Before we discuss the experimental results, we would like to address several controversial issues. First,
wha' are "hard" problems and what are "simple" problems for parallel compilers? Programs that a com·
piler faces can be roughly divided into two groups: programs with few data dependencies and programs with
t UNIX is a lrBdcmarlo: of Bclllaooralories.
117
Table 8.2. Sizes of components of the intelligent program restructuring system.
Component La~ge Lines
Front-End & back-end C 18019
ExDShell Prolog 3198
librarv for EJ;nShel/ Prolop 2543
Rier-Blackboard Prolop 1612
Machine knowledge manipulation Prolog 2097
Propram transformation svstem PrO/Of! 4377
Performance nrediction PrO/Or! 2162
Intelligent program restructlUin p control ProlOR 2353
User interface CIPro!of! 2058
Total 38419
many dependence relations. The former is normally considered to be simple because there is more parallelism
presenled in the program. The matrix-vector multiplication example shown in section 3.1 demonslraleS that
this task is not necessarily as easy as many people would like to believe. The decision trees for such programs
tend 10 be large since Ihere are many possible alternatives at each stage. The task of Ute compiler is to pick
the most plausible solution path efficiently. This task is important because the perfonnance of a program on an
architecture depends not only on the degree of parallelism inherent in the program but also on the match
between the program and the architecture.
On the other hand, programs wilh many dependence relations are oflen considered to be "hard," since
good speedup is usually difficult to obtain for these programs. We noled that if the goal of the compiler is to
find the best path among lhe applicable program transfonnations for the given program, the complexity of the
decision making is actually simpler for programs wilh more data dependencies Ulan those with few. We arc
not claiming that this kind of problem is easy. Really difficult problems to the compilers are the programs for
which some critical information cannot be easily guessed by the compilers.
For the aoove reasons, we pick two problems to demonslrate our ideas. The first example is the matrix-
vector multiply example that we described in chapler 3. It is used here to show the complel':: decision-making
process and the power of the system. The second example is an LU-factorization problem which has con-
structs that are inherently sequential, and as such it is difficult to parallelize for distributed machines. We will
show how our system parallclizcd it to obtain some satisfactory results.
Another issue is that the results we show below only demonslrate the quality of the implemenl.ation of
the framework. An important question is "what arc lhe advantages of our framework over e1Usling models?"
Clearly, the same heuristics may be encoded in other systems to produce similar results. What cannot be
shown in the experimenlal data is even more important to us. First, our framework allows us to implement
multiple target parallel compilers; the system knowledge can be reused and transferred between different target
machines. For example, all heuristics that we used in the second example below el'::cept those that are related
to message passing and array distribution are collected from our experience on the shared-memory machines.
They are transferred over and reused without reprogramming because the heuristics are encoded based on the
machine features. Secondly, the hier-blackboard approach allows us 10 decompose the optimization problem
into smaller but specialized modules whose interactions are specified in a higher level blackboard subsystem.
Programming these modules is easier because the modules are more focused and less complex than the original
problem. Programming the interaction among the modules is simpler because we are working on a higher
level abstraction. Third, our system provides a heuristic analysis tool that can help to distill machine features
from heuristics at the knowledge-acquisition phase. Regrettably, some advantages of this framework are not
demonslraled because of time constraints. For example. our framework is designed to be integrated with self·
learning modules, but we have not finished a learning module to show the result. Our preliminary study about
learning modules will be discussed in the next chapler.
118
8.2.2. The Matrix-Vector Multiply Example Revisited
The matrix-vector multiply problem is considered by most to be simple to parallelize. because no depen-
dence relations are present On the oUler hand, lacking dependence relation implies lhat there are many
different ways to parallelize the program and presents a challenge 10 automatic parallelizing compilers to find
lhe most effective path. In this example. we illustrate how a heuristic hierarchy may be applied to the program
parallelism optimization process. Based on the subproblem decomposition we described in section 4.3. the
program reslrucluring process starts by examining the rules on the lOp layer of the hierarchy. After the focus
of the program is chosen, the tIansfonnation subgoaIs on the next layer are selected. The rules associaled willi
the subgoal are ulilized to choose tlJe applicable transformations. Similarly, when a lrnnsfonnation is chosen,
lhe rules associated with it are applied to decide the merits and methods of performing lhe I:ransfonnation on
lhe program focus.
The flow of control is decided by the rules in the heuristic hierarchy. We wiil illuslrate the decision-
making process of the system with the matrix-vector multiply example that we used to illustrate the complexily
of the optimization process for different architectures: in chapter 3. We will now examine how the methodolo-
gies discussed above can be used 10 guide Ute coopiler to generate Ute programs Utat were shown in figure 3.1.
To simplify the discussion we assume that the result vector y has been previously inilialized to zero. We
seek 10 transfonn this program to programs suitable for three different machines: the BBN Bulterfiy, the Pur-
due Pringle, and Ute Alliant FX/8. The rules used in this example are listed in Appendix A.
8.2.2.1. Mapping onto the BBN Butter.Dy
The system starts Ute program optimization process by consulting the machine knowledge manipulation
system for the list of machine features for the target machine. For example, the fact "parallelize outennost
loop without blocking" is added inlo the knowledge base by rule a.l (listed in Appendix A) because the
Butterfly provides a mechanism, GenOnIndex, which can schedule the loops automatically. The system dis-
covers, among other facts, that memory optimization dominates instruction minimization (rule a.5), locality is
important, and local memory should be used whenever possible (rule a.6). These facts are added to the
system's state space in the working memory.
Next, the transfonnation heuristic hierarchy is used to optimize the program. Firsl, the parallelism-
matching control layer directs the restructuring of the program. In this example, it is trivial to select Ute pro-
gram focus. By rule b.l, the whole subroutine is chosen as the program focus, since the original program con·
sists only of a single statement inside the doubly nested loop.
The next step is for the program-restructuring control layer to decide which sequence of program-
restructuring subgoals to achieve. Due to the simplicity of the dependence graph of this program, none of the
transformalions which are used 10 break the data dependence cycles are needed. Thus, the parallelism
improvement subgoal is skipped (rule c.l). For the sake of flexibility, it is best to do processor assignment
toward the end of the transformation process. However, array decomposition can be done only after tasks are
created. So there is a conflict in deciding which of the two subgoals, task creation and processor alloea/iOIl
subgoal or memory access optimization subgoaf, should be done first. Our solution to this problem is as fol-
lows. First, we find the tentative process allocation scheme and block the outermost loop to create
"processes." The newly created outennost loop is marked, but is not actually parallelized The loop instances
of this marked loop form the tentative processes, and this information will be used 10 guide the array decompo-
sitions in the memory access optimization subgoal. The actual processor allocation is carried out at Ute end of
the transformation process if the marked loop remains marked till then. This heuristic is encapsulated in the
default ordering of rules c.4, c.5, and c.7.
After the task creation and processor allocation subgoaI is picked, the system concenl:rates its restructur-
ing efforts on the loop structures. At this stage, applicable transfonnations include loop interchanging and loop
blocking (10 create processes). According to the heuristic (rule e.l), if the program focus is a nested loop, then
loop inlerchanging is checked to find the best order of Ute loops before the processes are created.
Therefore, the control goes down to the lower level transfonnation layer, and rules associated with loop
interchanging are applied. We assume that the arrays in the Butterfly are stored in row order. There are no
dependence relations that prevent us from interchanging the loop, so the loop is interchangeable. However, if
loop j is changed to be the outennost loop, the array a will be accessed in columns no mailer how we block
the outer loop to form processes. This is not attractive because it increases the inter-task communications
significantly. Therefore, based on the rules associated with loop interchange, the system decides thal the
119
original loop order is the best and that no loop inl.erChange is needed.
The next step is to find a tentative way of allocating tlJe processes to Ihe processors. Since the Butterfly
has an instruction, GenOn[ndex, that can schedule the loops automatically, we can parallelize the outermost
loop without blocking (rule a.1). As a result, the onter loop j is marked (0 fonn lasks (rule eA). There are n
inslances of the loop j I so n tasks are fonned if each loop instance is viewed as a task. This information will
be used to guide the array decompositions when the memory access optimization snbgoal is pursued.
After lhe processor allocation phase. rule c.3 chooses the memory access optimization snbgoaI. Since
local memory access is faster than global memory access on the Butterfly, locality is important (rule a6).
Also, the BuUerfly supports a "block-transfer" instruction, which allows a block of memory to be transferred
to, or from, the local memory to speed up Ute data bansfer. This makes copying array references inside of
loops into local memory beneficial. In the mablx-vector multiply program, there are two array references in
the nested loops. Each element of array.x is accessed once by every instance of the loop j. Also, elements of
the i-th row of the array a are accessed exclusively by loop instance i. Since loop i is marked to be paral1el-
ized in the processor allocation subgoal, every processor Utat runs loop instance i will have to access every ele-
ment of lbe array x and the i-lh row of array a once. Rule f.l suggests we copy array x and array a into local
memory willi block lransfer operations. Since lbe ;-lb iteration accesses only the i-th row of the array a, there
is no need to copy the whole array. The block transfer operation on array a is later changed by rule f.2 into a
block transfer operation on row i of the array a in loop i. This gives us (by applying rule f.3):
for; ;n [1 .. Nl do
block_transfer(x. x_local. sizeof(x));
block_transferral;. *1. a_local. sizeof(a[i, *J));
for j in [l .. Ml do
y[i] := a_Iocal{j] * x_locaIW;
endfor
endfor
Since the block transfer statement of copying array x docs not depend on loop i, it can be moved outside
loop i to fonn anolber parallelizcd loop of P instances, where P is the number of the processors (rule fA). In
this way. the array is copied P limes instead of N limes, as it was in the original fonn.
After the memory optimizations are complete. the parallelism-improving subgoal is hied to see if there is
any chance to improve the program further. It is relatively easy for the system to recognize Utat lhe inner loop
j is an inner-product operation (rule d.l), so the loop is replaced by an irrner-product operation (rule d.2). The
final step involves the processor allocation subgoal again. Since no bansfonnation that might prevent the
parallelizing of the outennost loop i (which is marked for parallelizing) has been perfonned, lbe loop is
directly parallelized as shown below.
forall k in [l .. P] do
block_'ransjer(x, x_local, sizeoj(x));
endJorall
Jorall i in [1 .. Nl do
block_trans[er(a[i, *], a_local, sizeo/(a[i. *J));
y[i] := inner-produet(a_local[*l. x_loeal[*});
end/oroll
8.2.2.2. Mapping onto the Pringle/CHiP Architecture
The Pringle/CHiP architecture consists of an array of 64 processors which communicate with each other
via a packet-switched message network. There is no shared memory, and each processor runs one process.
The communication pattern of messages between processors. defined at compile time as a communication
graph, is used to configure the switch network at load lime. Each of the memory modules is dual ported. One
port goes 10 the processor while the other goes to a global bus: this allows the local memory of each processor
10 be a page of the global address of the front-end host Programs and data arc down-loaded to each processor
and the results of a compulation arc loaded to the host over this bus.
120
For lite same reason as in the case of lhe BuUerfly, the system decides not 10 change the original order of
loops after lite rules in the transformation module, loop interchange, are used to decide the order of loop
headers. The program-reslrucluring task is different here because the process creation time on the Pringle is
expensive. and no self·scheduling primitive is available. The best strategy for processor allocation on lhe Prin-
gle is to create P processes to run on lite P processors that the Pringle has (rule a.2). So the n irntances of lhe
oulermost loop i are blocked 10 Conn P tasks (rule e.3). The result is shown below:
forall k in [0 .. P-l] do
for i in [k*Tl!P .. (k+l)*nIPJ do
forjin[l .. m]do




Next. lite memory access optimization subgoal is invoked to allocate the data Since the Pringle is a
non-shared memory machine, all the data must be distributed among the processors. Array decompositions arc
done by means of inter-process dependence analysis. By checking the bounds of Ute loops. the syslem discov-
ers lhat the processor which runs process k (k-th iteration of the rorall loop) accesses only rows k*nlP to
(k+l)"'nIP of the array a. In tenns of Ihe dependence relations, lhis means lhat no out-of-bounds dependence
(dependence edge ihat has only one end in Ihe loops) or cross-iteration dependence (dependence whose source
and sink are in different loop iterations) of Ihe array a exist It is best to store Ihese rows of the array in the
local memory of the processor that runs the task. By rule f.ll, lhe array a is divided into P blocks according
to Ihe memory access pattern, and lhe P blocks are allocated 10 local memories in lhe corresponding proces-
sors. Similarly, army y can be blocked into P "chunks" and stored in the local memories of the processors.
ThercfofC, each of the processors computes nIP components of the y vector.
Since each process uses allihe elements of array x, each processor needs to access the whole array x no
mailer where the army is allocated If we are free to allocate the array x anywhere, the most direct melhod is
to pUl it in one processor, say PED, and Ihen "broadcast" it 10 olher processors by means of a pipeline process
(rule f.12). To accomplish this. each element of x is passed from one processor to lhe next by using a "chan·
nel" variable. This lransfonnation is lermed "pipeiining," which is a modified version of the transfonnation
"scalar expansion" 10 pass the data through "channel_variables" instead of temporary variables. The channel
variable Chy[k] implements a communication channel between processor k and processor k+l. Processor
k = 0 reads the value of xU] and puts it in Ch_x[O]. Processor k=1 reads Ihe value in Ch_x [0] and puts it
inlo Ch_x[l], etc. This approach is possible because Ihe ratio of computation and communication is about 1
for Pringle. The overhead of lhis approach is the initial setup time for lhe pipeline, p pairs of write-read opera-
lions. The result of the transfonnation is shown below:
forall k in [0 .. pol] do
local Imp;
for jin fl .. m] do
rmp = if (k==0) then x{jJ else Ch_x[k-l);
Ch_x[k] = tmp;
for i in [k·n1p .. (k+1)*nlp] do




On some non-shared memory machines it is too costly to send a message consisting of only one word
(for example. lhe Intel iPSC/2 and Ute nCUBE 2). In lhis case, it is best to broadcast large segments of the x
veclor by using a tree sbUcture.
8.2.2.3. Mapping onto the Alliant FXl8
In Ute case of the Alliant FXI8 there are three important programming issues. First, because of lhe
powerful vector ins!ruction set in each processor, one should exploit as many vector operations as possible.
Second, since cache access is twice as fast as a memory access, the programmer must force as many memory
121
accesses to be from lhe shared data cache as possible. Third, because only one operand in a vector instruction
may come from memory or cache, it is important 10 keep vector operands that are used repeatedly in vector
registers.
Most parallel compilers can recognize the inner-product operation in Ute original matrix vector multiply
program and translate the program into the following fonn:
forjjnl .. ndo
y[i] = iTUleryToduc/(Ali, ·1. x);
Allhaugh the Alliant supports fast inner-product operations, this transfonnation does not really utilize the
parallelism capabilities of the Alliant FX/8. The army x is accessed n times on each processor; thus the array
x needs to be brought into Ute cache repeatedly. Since each vector register in the Alliant FX/8 can hold only
lhirty-two words of data, Ute veclor x and the malrix a in the sample program need to be loaded inlo the vec·
lor registers repeatedly. This data traffic floods the bus and slows down the computations significantly.
In general, without intelligent program analysis, lhis communication bottleneck problem is hard to solve.
OUf system tries to improve Ute matching between Ute program and Ute computational model of Ute Alliant by
examining and managing Ute memory accesses intelligently.
As in Ute c<c>e of Ute BUl1erfly, task creation and processor allocation is the first subgoal selected. Since
the Alliant has a vector capability, both Ute vector processing parallelism in the innermost loop and the multi-
processing parallelism in the outermost loop need to be explored. Before Ute outer loop is blocked to form
lasks and Ute inner loop is blocked to form vector operations, loop interchange is considered to .lind the best
ordering of the loop headers (rule e.l). Thus control goes down to the trnnsformalion layer, and the rules asso-
ciated with the transfonnation "loop interchange" are applied. First, the nested loops i and } in the original
source are checked, and as before, it is concluded that they are interchangeable. Next, rules relating to loop
orders are applied to decide the best order of the loop headers. Program size matching and memory utilization
matching indices can be used to select the loop order. Rule a.5 suggests that memory optimization dominates
the instruction minimization, so memory optimization matching is considered.
The matrix-vector multiply program accesses vector x in tolal n times, once for each loop instance of
loop i. Loop} is the loop thaI scans through vector x. If loop} is the inner loop, and loop i is the outer loop,
thcn cach value of the vector x will be accessed once by every loop instance of loop i. Therefore, the vector
needs to be brought into the cache repeatedly. On the other hand, if loop i is the inner loop and loop} is the
outcr loop, the value x[j] is brought into the cache and used by all loop instances of the inner loop i for each
loop instance of the outer loop}. In this loop order, Ute network traffic for references of vector x is decreased
significantly. Therefore, the loop order where loop} is outside is preferred according to the memory allocation
matching function. In other words, the loops need to be interchanged.
After the loops are interchanged, the innennost loop is blocked to fonn vector operations, and the ouler-
most loop is trnnslated inlo tasks and may be blocked to fonn processes. For the vector loop blocking, the
inner loop j is blocked according to the vector register size of the Altiant (rule e.2). The vector operation is
created by veclorizing the innennost loop after the blocking. The resulting program is shown below. Each
loop instancc of the outennost loop} tonns a task. Since the Alliant instruction set can automatically allocate
the processes to the eight processors, no loop blocking is needed to match the number of processes with the
number of processors (rule al). Subsequently, loop} is marked to be parallelized.
for} in fl .. m] do
for k in fO .. nI32-J] do
kl =k*32+1;
k2 = (k+1) * 32;
yfkl .. k2] sum= afk1 .. k2,}] * x[j];
endfor;
end/or;
The next step is to optimize memory access. Rule a.7 suggests that keeping one vector operand in a vec-
tor register is beneficial. Since vector segment y[k*32+1 .. (k+I)*32] is used repetltedly by each instance of
the ouler loop), it is best to keep this segment in the vector register. This can be accomplished by interchang-
ing loops} and k (rule t.13). Note that in the previous lask creation and processor allocation subgoal, the loop
} is marked as "to be parallelized." However, according to rule f.14, the utilization of vector registers and
122
vector operations is weighted to be more important. So the previous decision is revoked, and the loops are
interchanged. Loop k becomes the outermost loop and is thus parallclizcd. The resulting program is:
Joral/ k in [0 .. nI32-1] do
local kl. k2 : int.-
kl=k*31+1;
k2 =(k+l) '" 32,-
for jin [1 .. m] do
yfkI .. k2] sum= afk! .. k2,jJ '" x[j];
endfor;
endfarall;
In lhe final version. each 32 word y vector segment can be saved in a register for the lifetime of the pro-
cess and can be written to memory only at the end of the compulation. Experimenls performed in collabora-
Lion willi Dan Sorensen [Sore] at the Illinois Center for Supercomputer Research and Development have shown
lIml this implemcnlation of the program is the fastest version of a matrix-vector multiply available for the
machine.
The matrix-vector ffilllliply example described above served three purposes:
1. It demonslraled how the inference engine works.
2. It illustrated that a different sequence of transformations was required to produce an optimal program
for each machine.
3. It showed lhe complexity of lhe program parallelism optimization process.
Many heuristics were needed even for this simple program. This reinforces our view that an expert sys-
tems approach is a more flexible and extensible approach than lhe conventional hard-wired heuristics approach.
On lhe olher hand, the example described above is far too simple to illustrate many of lhe most interest-
ing and important issues in program reslructuring. In particular, it fails to illustrate lhe issues relating (0 the
inlroduction of synchronization needed in many problems to satisfy data dependence constrainls between paral-
lel tasks. This topic is considered in lhe next example.
8.2.3. A More Realistic Example: LV-Factorization
OUf nex.t experiment concerns parallelizing programs for dislributed-memory parallel computers. The
ex.ample that we choose involves solving a syslem of linear equations with Gaussian elimination. This exam-
ple was used in [Karp87l as a realistic example 10 show "the state of the art of parallel programming and what
a sorry slale that art is in"t. Here we chose the nCUBE 2, a distributed-memory architecture, to show how
heuristics and architecture properties influence the decision making of the compiler.
The procedure for solving the linear system is:
1. LV faclorization: Use Gaussian elimination to compute two biangular malrices L and U such lhat
A =LV. The original problem becomes: LV x =b or L y =b where y is a solution to U x =y.
2. Forward eliminarion: Solve the lower triangular linear system L y = b by forward elimination.
3. Back subslitulion: Solve the upper lriangular linear system U x = y by back substitution.
Since the time complexity for the three steps is 0 (N3), 0 (N2 ), and 0 (N 2), respectively, (he LU faclori-
zation consisls of (he major time in solving Ihe linear system. Therefore, we will only discuss step 1 factoring
the matrix A into ils LV componenls. For linear systems that have small or zero diagonal clemenls, Gaussian
elimination may generate disastrous results. To avoid computational errors, a technique called partial pivOlillg
is used to interchange the rows of the matrix so lhat the largest remaining clement in the klh column is used as
the pivot.
In the program shown in figure 8.5, (he following procedure is applied 10 each column k in turn:
a) Find pivot: Compare lhe elemenls on or below the diagonal of the current column and find the row index
of (he element whose absolute value is the maximum. This element is called Ihe pivot.
t Karp derives dilfc~nt versions of this progl1llD for scvt:nU models of panillel arcllitecturc and explains difficulties
cncounlered for cadi. modeh.
123
procedure lujactorization(n, tolerance. a) rebm1s: (a. ipvt, info);
parnm
n: integer; -- sizes of the rows and columns
tolerance: real; .- minimize size for acceptable pivot
a: amy[l ..n, 1..n] of real; -- array to be factorized
ipv!: amy[!..n) of inleger; -- record pivoting rows
info: integer; - how many nonzero pivots do we have
var
nnax, I : real;
imax, in: integer;
begin
in = 0; info = 0;
Corkin 1 .. n-lloop
imax := k; max := abs(a(k,kD;
for i in k+l .. n loop
if (abs(a[i,k]) > rmax) then
imax := i; nnax := abs(a[i,k));
end;
end;
if (rmax < tolerance) lhen in = k; end;
ipvt(k) := iroax;
--- find pivot row
if (in != k) lIten
info := info + 1;
if (iroax != k) Ihen
for j in 1 .. n loop --- interchange row k and imax
ll[lql, a[imaxj] := swap(a[imaxj], a[kJl);
end;
end;
t:= -1.0 I a[k,k];
fori in k+l .. n loop
a[i,k] := a[i,kJ ole t;
end;
._- scale lhe k-lh column
for i in k+l _. n loop --- apply muHipliers to rows
for j in k+l .. n loop






Figure 8.5. The LUjaclorizatioll program represented in Blau.
b) Move pivot 10 diagollal: Interchange the diagonal row and the row that conlains the pivot so that the
pivot element will be moved to the diagonal.
c) Compule lhe mulripliers: Divide the elements below the pivot by the pivot to produce a set of multi-
pliers.
d) Apply multipliers to all rows below rhe diagonal. For each row below the diagonal, multiply the ele-
ments to the right of the pivot with the multipliers of that row and subtract the product from the
corresponding part of the row.
Some obstacles in paralIelizing this procedure are as follows:
124
1. The procedure is inherently sequential in lhe sense that the kth iteration of the outermost loop cannot be
performed until after the k-lth iteration is finished.
2. On a distributed-memory machine. the cost of the procedure is very sensitive to the distribution of the
arraya.
3. Sleps (a) and (e) operate on columns and sleps (b) and (d) operate on rows. None of the steps (a) - (d)
can be overlapped. The mix of sequential and parallel parts presented in this example is typical of many
numerical procedures. We deliberately chose this "non-perfect" algorilhm (0 see how an aUlomatic
parallel compiler can utilize some' 'generic" heuristics (as 0PIXlsed to heuristics restricted to a particular
example) to paraUelize lhe program.
The SPMD (single program, multiple data) model is used for distributed-memory machines because it is
convenient 10 have only one program for all the processors. (The control flow of the program in each proces-
sor is normally decided by the processor identifier.) To generate programs for Ute distributed-memory SPMD
model, we assume that the oulermost loop will be distributed across the processors either in blocks or in
cycles. Initially, a pair of I/O statements will be generated for all cross task dependence relations. The I/O
statements between two processors may be merged with the message consolidation algorithm we described in
chapter 6. Computation that uses only local data of a particular processor will be done on that processor only.
For computation that needs data from more than one processor, the distribution of the computation depends on
the cost of getting the external data and the cost of distributing the results to the processors that use them. The
synchronization points of the program are taken into consideration. The computation will be duplicated in all
processors that use the data if the computation can be done locally before the expected arrival time of the data
when the data are computed by other processors.
The process of the transformation that the system went through will be explained below:
In the LU factorization program, the oulermost loop k is distributed into processors in cyclic distribution
because the cross·iteration dependence relations are of distance one. ('Ibis means that if the loop is distributed
in blocks the execution will have to be serialized.)
One of the most important decisions for the compiler to make is how to distribute the data; for this pro-
gram, the array a is the most critical array in the subroutine. There are many ways to distribute a non-sparse
array on distributed-memory computers; most commonly used methods include block disrribulion (each proces-
sor gets a chunk of the array, either row or column), cyclic distribution (the rows or columns of the array are
distributed to the processors in a round·robin scheme, for example, processor i gets column i, P +i, 2p +i, ...), or
hybrid dislribulion (applying either block or cyclic distribution to each dimension of the array; such as block-
block or cyclic-block distribulion on a 2-dimensional array).
The objective is to distribute arrays so as to minimize the communicalion between processors. A heuris-
lic for distributing arrays is that if the index of the parallel loop appears as simple subscripts (contain no opera-
tions) in all references of an array, then the array is to be distributed based on the distribution pattern of the
loop. As stated above, the LU factorization program is decomposed into four steps with data dependencies
between each of these four steps. For step (d), it does not matler which subscript we choose to distribute,
because the shape of the array a used in step (d) is symmetric. For steps (a) and (c), the computation will need
data of size n-k and n -1 from every processor respectively; but it allows the computation to be executed in
p p
parnllel if array a is distributed in rows. If (he array a is distributed in columns, then aU the data used in the
computation belong to the processor ik that has the column k. 'This means that the computation can be done
sequentially in processor ik without reading from other processors. For step (a), imax and in have to be sent to
all other processors. For step (d), elements of array a [k..n,k] need to be distributed to aU processors in both
kinds of array distributions. For step (c), dislribuling by rows would involve two processors exchanging data
of size n with each other, whereas the column distribution will allow all processors to work on their own data
in parallel. As a result, cyclic distribution of the columns is chosen.
Since the array a is distributed in columns, the process of selecling the pivot in slep (a) involves only
local data of processor ik =k mod p at the kth iteration. So the computation is localized to processor ik and
the values imax and in are broadcast to all processors by processor ik. Similarly, scaling the column k at the
kth iteration of the outermost loop in step (c) involves only local variables in processor ik but the result is
needed by all processors. So the computation is again restricted to processor ik and the column a [k..n,k] is
then broadcast to all processors by processor ik. The resulting parallclizcd program is called P 1 and is shown
in figure 8.6.
125
For the program shown in figure 8.6, there are two broadcasting sta1emenls issued by processor ik, one
inituitive way to improve the performance is 10 merge the two broadcast statements. This move is non-trivial
for the compiler since the dependencies in step (b) (which swaps rows k and imax) prevents the two broadcast-
ing slatements from been merged. Although this restriction can be overwritten through user interaction, the
speedup results from merging the two broadcasting statements is not very significant. This is due to the over-
head of copying the data into a temporary buffer and the additional data synchronization involved. TItis vcr·
sion of the program is called P 2 and its result is shown in in column P 2 of tables 8.4 and 8.5.
The current nCUBE 2 compiler produces very slow code for the calculation of addresses of multi-
dimensional array references. This implies that whenever there is a choice, making army references of higher
dimensions loop invariant is better than making array references of lower dimensions loop invariant When
applied to our example, in step (d) w[i] is loop invariant in loop j, and hence can be replaced by a scalar vari-
able tmp through vector scalarization. On the other hand, if we interchange loop i and loop j, the reference
a [k,j] becomes loop invariant 10 the new inner loop i and can be sca1arized. The above heuristic says that
interchanging the loops i and j and scalarizing the reference a [k,j] is better than scalarizing reference w[j] in
loop j. Additional speedup of about five percent was obtained through this IJansfonnation. The resulting pro-
gram is called P 3 and is shown in figure 8.7.
The sequence of the transformations that are applied in the above example is listed in table 8.3.
We translated the final optimized Blaze program into Fortran and tested it on a 64 node nCUBE 2. The
nCUBE 2 we used has 4 mega-bytes of memory in i1s first 16 processors and 1 mega-byte in the rest of pro-
cessors. We ran the tests with different sizes of array a, starting with 100 and doubled the size at each step to
the largest size which would fit on lhe machine (800 for cube of dimension S; I, and 1600 and 3200 for larger
cubes). For sequential cases, we can only run lhe program with an array of size 800, so only the speedups for
array of size up to 800 are reported. The timing is reported for all tests.
We can see lhat the speedup is slowly going down for cubes of larger size. The cost of broadcasting
increases as the size of the cube increases and the operations between communication points decrease. Consid-
ering that the program at hand has sequential control Row, the speedup we obtain is very satisfactory. The
rules that are used in this example are listed in Appendix A.
8.3. Summary
In this chapler, we presented a prototype intelligent parallel programming environment and its com-
ponents. Two examples and experimental results were also given.
procedure lujactorization(n, m, tolerance, a, p. pid) relums: (a. ipvt, info);
param
n. m: integer, .- sizes afrow and columns (m=n/p)
tolerance: real; .- minimize size for acceptable pivot
a: array[l..n, Lm] of real;·- array to be factorized
ipvt array[l..n] of integer, _. record pivoting rows
info: integer, _. how many non-zero pivots do we have
P, pid: integer, -- number of processors and processor id
var




info =0; maxc = (n+p.l) I p;
if (pid > moden-I, p» then maxc = maxe - 1; end;
for k in 1..n-l loop
ik = processor that has the column k of aD.
ij = column number of k (of aD) in a ofprocessof ik
ik:= mod(k-l, p); ij:= (k+p-l)lp:
if (pid = ik) then
imax := k; nnax := abs(a[k,ij)); --- find pivoting row
for i in k+l..n loop
if (abs(a[i,ij]) > rmax) then
imax := i; rmax := abs(a[i,ij]);
end;
end;
if (rmax < tolerance) then in := k; end;
w[l] := imax; w[2];= in;
end;
imax, in := broadcasl(ik, w[1..2], 2);
ipvt[k] := imax;
if (in != k) then
info := info + 1;
if (imax != k) then --- interchange row k and imax
for j in l..maxe loop
a[kj], a{imaxj] := swap(a[imaxj], a[lcj]);
end;
end;
if (pid = ik) then --- scale the k-th column
t := -1.0 / a[k,ij];
for i in k+1..n loop
a[i,ij] := a[i;j] ,. 1;
end;
end;
w[k..n] := broadcast(ik, a[k:..n,jj], n-k+l);
if (pid != ik) then ij:= ij + 1; endif;
for i in k+l..n loop --- apply multipliers 10 rows
Imp:= w[i];
for j in ij..maxe loop
a[ij] := a[ij] + Imp" a[kj];
end;
end;
Figure 8.6. The paralle/ized LU factorization program for distributed-memory computers.
126
procedure lu_factorizalion(n, m, tolerance. at p. pid) reLurns: (a. ipvt, info);
same as in figure 8.7 except Ute last loop:
if (pid != ik) then ij := ij + 1; endif;
for j in ij..maxc loop --- apply multipliers to rows
Imp = a(lql;
for i in k+1..0 loop
a[ij] := a[ij] + w[i} '" tmp;
end;
end;
Figure 8.7. The optimized LU factorization program with last loop interchanged.
Table 8.3. Sequence of transformations applied in the example.
er Loop blocking(k, cyclic)
-- Cyclic distribution of columns ofarray a
er wcalize computations offinding pivot and scale krh column
er Convert cross-Iask dependencies into messages
_ Change read-write pairs into broadcasting routines
er Consolidate messages for imax and in and messages for a [k..n,k1
(this gives us PI).
Apply statement reordering and message consolidation to merge the
above two broadcast calls (this gives us P2).
Interchange loop i and j and then scalarize vector a{kj]
(this gives us P3).
127
128
Table 8.4. Test results for the LUJactorizatioll (time in seconds) where Sl is the sequential program, and
PI. P2, and P3 are the paral/elized programs as discussed above.






2 50 0.205675 0.199552 0.192359
2 100 1.418245 1.405851 1351634
2 200 10.600395 10575588 10.156595
2 400 82.200417 82.150757 78.856606
2 800 648.020020 647.920776 621.806885
4 50 0.141532 0.125691 0.119751
4 100 0.818316 0.786300 0.749784
4 200 5.648145 5583807 5336084
4 400 42314747 42.184917 40385593
4 800 328515936 328.253601 314585968
8 50 0.131961 0.096773 0.091504
8 100 0564934 0.493654 0.466063
8 200 3.272318 3.129107 2.967397
8 400 22.621096 22332933 21.280478
8 800 169.469940 168.888718 161.444214
16 50 0.130673 0.095632 0.090602
16 100 0.454801 0383533 0360316
16 200 2.145916 2.002870 1.883705
16 400 12.973400 12.687386 12.006158
16 800 90.642464 90.074875 85.726631
32 50 0.136305 0.094692 0.089864
32 100 0.409748 0324679 0303406
32 200 1.600865 1.429922 1331797
32 400 8.199018 7.855660 7359412
32 800 51325867 50.639671 47.839237
64 50 0.144678 0.096747 0.091986
64 100 0.404622 0306533 0.286258
64 200 1364162 1.165164 1.077079
64 400 5.909894 5512052 5.106190
64 800 32.021828 31.221504 29.192616
Table 8.5. Speedup for the LVJactorization program.
# ofprocessors T' array size PI P2 P3TP
2 50 1.629231 1.679222 1.742014
2 100 1.83101f7 1.847230 1.921326
2 200 1.918095 1.922594 2.001908
2 400 1.955501 1.956683 2.038421
2 800 1.972322 1!)72624 2.055468
4 50 2367606 2.665998 2.798240
4 100 3.173505 3302722 3.463571
4 200 3599866 3.641345 3.810391
4 400 3.798746 3.810437 3.980206
4 800 3.890540 3.893649 4.062814
8 50 2.539326 3.462660 3.662048
8 100 4596873 5260628 5572058
8 200 6.213506 6.497881 6.851987
8 400 7.105888 7.197575 7553541
8 800 7541776 7567731 7.916694
16 50 2564355 3503974 3.698506
16 100 5.710036 6.771073 7207368
16 200 9.475005 10.151715 10.797922
16 400 12390196 12.669511 13388377
16 800 14.100503 14.189355 14.909070
32 50 2.458398 3538757 3.728879
32 100 6337871 7.998454 8559257
32 200 12.700987 14.219353 15.267015
32 400 19.605150 20.462057 21.841821
32 800 24.901759 25239192 26.716654
64 50 2316123 3.463591 3.642859
64 100 6.418163 8.471943 9.071991
64 200 14.904803 17.450389 18.877505
64 400 27.198959 29.162091 31.480019





9.1. Summary of the Thesis
In this thesis we have discussed issues related to the conslruction of intelligent paralIcl compilers for
different parallel arehilectures.
First, we introduced a new program optimization model called !he feature-directed program optimizatioll
model. Under this model, the program optimization process is driven by architectural features and program
dependence graphs. The major differences between this model and existing models are the following:
1. Both features of the program and Ute machine influence the decision-making process. The model can
respond well to different programs and architectures.
2. The whole decision tree of lhe program optimization process is considered. Thus, it is possible for the
optimal version of a program for a target machine 10 be discovered
3. Systematic state-space search algorithms (such as A *) can be easily integrated for fully aulomatic pro-
gram optimization and pruning non-promising branches in the decision tree.
4. Machine features are explicitly encoded in the heuristics. This allows systematic analysis of the heuris-
tics and provides hooks for organizing knowledge and integrating self-learning modules.
Next, a framework for realizing this model into intelligent parallel compilers is introduced. The frame-
work is based on the following essential components:
• Flexible machine feature manipulation. An object-oriented machine feature represenlation and manipu-
lation scheme is designed. The machine feature manipulation model is lhe foundation for knowledge
organization and generalization.
• Accurate performance prediction. A performance prediction model based on machine features and
performance-characterizing factors has been designed. The prediction model is used in heuristic-based
state-space search algorithms to compute heuristic functions, and in the rule-based system implemenla-
tion to discard non-promising paths. The prediction model is highly flexible and can be adjusted to suit
different objectives at different stages of compiling and accommodate different classes of parallel com-
puters. Many useful evaluation functions for the performance characterization factors are listed in
chapter 6.
• Inference capability. Inference capability and some AI tcchniques are integrated to improve efficiency of
lhe heuristic-based search algorithms.
• Effective run-lime tests. Run-time tcsts can help to optimize programs thaI fail slatic analysis. However,
excessive run-time tests may do morc harm than good to the performance of the program. A technique
using constraint propagating and constant-folding to minimize run-time tcsts is introduced.
• Modular knowledge encapsulation. Under this framework, program transformations are viewed as intel-
ligent modules that contain both heuristics and techniques about lhe transformations. Each module can
evaluate its applicability and possible contributions based on the current program structures and
hardware features. This makes it possible to dynamically decide the sequence of transformations based
on features of the program and the target hardware.
A prototype intelligent program optimization system is built to demonslrate the ideas we have derived in
this thesis. The system is implemented as a collection of expert systems and is based on lhe feature-directed
program optimization framework and the hier-blackboard model that we discussed in chapler 4. The systcm
integrates state-of-the-art AI technologies with advanced program-restructuring techniques. Twenly-one
131
program transfonnalion techniques and a host of heuristics to utilize these transfonnalions are implemented in
the system. The system can dynamically select the most appropriate program transfonnation at compile time
or generate minimal run-time teslS to detennine lhe control .flow at run-time. New target machines and pro-
gram transformation heuristics can be easily incmporated into (he system. This allows the system to be used
as a test bed for new program transfonnation heuristics or new hardware designs of parallel computers.
The prototype system is realized by the implemenlation of the following subsystems:
• A machine knowledge manipulation system as discussed in chapter 5.
• A bier-blackboard simulator that feamres hierarchical problem-solving control and opportunistic reason-
ing is constructed.
• An efficient pcrfonnance prediction subsystem that can accurately predict Ute performance of programs
and performance changes caused by program transformations.
• A knowledge base of transformation knowledge. Various heuristics for loop optimization, memory
hierarchy utilization, program partitioning, scheduling, and interprocess communication and synchroniza-
tion for shared- and dislributed-memory multi-processor computers are included in the system
knowledge base.
• A program transformation subsystem. Twenty-one program transfonnation techniques are included and
the number is steadily growing.
Theoretical foundations of two program optimization techniques, message consolidation and
aTruy reshaping, are inlroduced. Message consolidation can be used to minimize communication cost for
distributed-memory parallel computers. An algorithm that can find lhe optimal grouping of the messages and
is dead-lock free is also presented. Array reshaping is a generic mapping to modify army storage patterns. II
can be applied in many different ways to improve the data storage and communication processors. Some of
these methodologies were exploited in chapter 6.
Different AI techniques to increase the degree of intelligence and improve the efficiency of compilers are
also discussed.
9.2. Contributions
Our contribution to the field of parallel compiling can be itemized as follows:
1. We have laid down a foundation for systematic and automatic program optimization.
2. We have developed a practical framework for the construction of multiple-target parallel com-
pilers.
3. We have formulated the program optimization problem into the planning problem and have
dcrived several systematic algorithms for optimizing parallel programs.
4. We have developed a hierarchical blackboard problem-solving model which is highly parallel and
flexible and is suitable for program optimization.
5. We have designed a new machine classification and knowledge manipulation scheme. This
scheme allows better organization of the program-restructuring heuristics and makes porting the
system 10 new parallel machines easier.
6. We have designed an accurate, efficient, and flexible pedonnance prediction model 10 estimate the
pedonnance of programs on different parallel compulers. The perfonnance prediction system
fealures symbolic processing and forms a basis for the decision making of intelligent parallel com-
pilers.
7. We have studied two program transfonnation techniques, messageconsolidation and arruyreshap-
ing and their application in improving program parallelism. We gave a lheoretica1 foundation for
these lwo transformations and derived conditions and algorithms for utilizing these two transfor-
mations profitably on distribuled-memory architectures.
g. We have buill a prototype parallel programming environment to demonstrate the ideas and to
experiment with different heuristics.
132
9.3. Future Work
The following problems are logical follow-up of lhis resean:h and should be probed further.
1. Study Ute orchestration effects of linking several program transfonnalion techniques in Ute sys-
tematic program optimization process.
2. Explore the parallelism in the hierarchical problem-solving model.
3. Inlcgrnlc self-learning modules to improve Ute intelligence level of the system.
9.3.1. Chaining Multiple Program Transformations
Allhough heuristics may combine multiple b'ansfonnations to achieve certain specific goals. the pro-
cedure we used in chaplet 3 does not address this problem explicitly. Instead, Ute group of the transformations
is treated as a new transfonnation. The Jack of systematic treatment of the accumulated effects of program
transfonnation is a potential problem for Ute aIgorilhm. One possible way of addressing this problem is to
allow several sleps of look-ahead. This will allow the system to discover accumulated effects not specified in
the heuristics. On the oUter hand, the known relation between the lr3nsfonnalions in Ute heuristics should not
be abandoned so lhat a representation scheme (0 inlegrate lhe knowledge wilhout lumping Ute multiple
transformations into a new lr3nsfonnation should be developed.
9.3.2. ParaDel E"ecution of the Compiling Process
To execute the compiler in parallel is not difficult under our model. To achieve this, we need 10 imple-
menllhc hier-blackboard architecture instead of using the simulator.
9.3.3. Self-Learning Modules
The essential characteristic for a system to be inlelligent lies in the ability to learn -- lhe ability to
enhance its capability through lhe acquisition of new knowledge. Depending on lhe degree of human assis-
tance, Ute learning of software systems can be classified as knowledge acquisition (with help from knowledge
engineers) or machine learning (self-learning).
9.3.3.1. Knowledge Acquisition
After decades of intensive sludies by knowledge engineers and psychologists, knowledge acquisition is
still in a state thai is more an art Ulan a science. On the other hand, past research has given us some principles,
melhodologies, and advice for conducting knowledge acquisition.
Knowledge acquisition is the activity of gathering information from any source. Techniques in
knowledge acquisition depend on lhe source of the knowledge. IT the source is a human expert. then Ute (sub-
)~k is called knowledge elicitation. If the source of the knowledge is the documents or liternture, then the
process is called literature summarization. Knowledge elicitation is difficult because it involves not only the
skill of the knowledge engineers and domain experts, but also human interaction and psychological problems.
Literature summarization is a very time-consuming task, but it can also yield a great deal of knowledge and is
effective when no experts are available. We build up the knowledge base mainly from literature summariza-
lion and our own experiments.
The problem domain of optimizing parallel programs for different classes of parallel computers has the
following characteristics:
• Bound 10 hardware architecture. Most heuristics are developed from a particular type of architeclure
and need some analysis before lhey can be applied to other Iypes of architeclures or simply be encoded.
• Partial and fragmented knowledge. Most experts have only partial knowledge about the optimization.
They are familiar with only a few lypes of archilectures and only few types of problems such as solving
PDE problems or computer graphics problems on parallel computers.
• Not program transformation oriented. Decomposing !he process of program optimization inlo program
lr3nsfonnations is suitable for a compiler or automatic program optimization but is not the natural way
that experts altack !he problem. Raw heuristics need to be examined and associated with program
transformation techniques before they can be applied.
133
9.3.3.2. Knowledge Refinement
The development of an expert system is typically an iterative process. It consists of knowledge
acquisilion-refinement cycles that are repeated until the intended perfonnance of the system is achieved. In the
refinement cycles, the machine's performance is compared wilh the perfonnance of the human expert, and the
machine representation of lhe knowledge is compared with the knowledge of the original expert This
refinement cycle extends the role of expert systems to expert support systems since both man and machine
learn lhrough repeated knowledge acquisition-refinement cycles.
9.3.3.3. Self-Learning Modules
Based on the above understanding, the following self-learning modules are being studied.
• Generalization ofheuristics by relaxation. This is a simple systematic way of generalizing heuristics. A
heuristic may be generalized by systematically relaxing the conditions of the heuristics one by one. The
effects of the resulting knowledge are then tested wilh a library of programs. A limit to this approach is
that only minor relaxation may produce meaningful results, since the heuristics are supposed 10 have
been studied extensively by the knowledge engineer.
• Neural networks. The design of the syslem makes a neural network implementation an ideal candidate
for learning. The input to the neural network is a set of slales of the system that includes the list of
predicalcs that cncode the machine features and the program features, the weights of the evaluation func-
tions. Ihe certainly factors in the mles. Ihe program transfonnations under consideration. and a feedback
of the resulls of applying Ihe transfonnations. After the system gets to an equilibrium slate. Ihe weights
of evaluation functions and certainty factors of mlcs derived by the neural network can be used to solve
the particular types of programs and the target architecture represented in the input. The input dala can
be collected through Ihe hislory files generaled by the system.
• Causal-based learning. Causal-based learning is an incrementalleaming method that learns from exam-
ples and instructor guidance. The module learns by watching an inslruclor perfonn optimization on the
programs. It uses an underlying model to explain some behavior of the experts and to use that explana-
tion as a hint for learning. Expert guidance can be used to increase the learning efficiency. (0 pennit
continuous learning and reduce system brittleness. The objective of the learning is to learn relationships
between features of the program and machine and the program transformation sequences.
• Model-based learning. Model-based learning is a generate and test learning model. The system learns
by examining examples generated automatically by a qualitative model. In our case, Ihe system can gen-
crnte sequences of program transformations and validate the quality of the sequence by measuring Ihe
performance of Ihe resulting programs.
9.4. Closing Remarks
Will users be freed from tedious parallel program optimization in the near future? Judging from Ihe rate
of progress of the field and the nature of the problem, the answer is probably a no. This prompts uS to search
for alternative methodologies for syslemalic program optimization 10 utilize existing technologies and 10 find
better ways 10 analyze and integrate heuristics. There will be always programs that will require extensive user
interaction because of Ihe data·dependent nature of the programs. Our goal is 10 search for methodologies lhat
may replace the human programmer willi a software syslem in the guess-lest cycle of Ihe program optimization
process and find a more efficient framework for such process.
The resulls Ihat wc have reponed in this thesis are still somewhat primitive and warrant further sludy.
We have demonstrated the importance and potential of systematic analysis and AI techniques in parallel com·
pilers. However, lhe realization of real intelligent parallel compilers requires joint efforts from belli compilers
and AI researchers. We believe this work will evolve inlo a multiple discipline, coordinated research that has a
high potential for re-shaping Ihe focus of future research and parallel compiler systems. Our rcsulls in lhis
thesis provide a foundation for further probes and give a very practical implementation model as a slart.
134
BIBLIOGRAPHY
{AbKw851 W. Abu-Sufah. and A. Kwok, "Performance Predication Tools for Cedar. A Multiprocessor
Supercompulcr," in Proceeding a/the 12th International Symposium on Computer Architecture,
1985,406-413.
[AhSeUl86] A. Aho. R. Sethi, and J. Ullman, "Compilers: Principles, Techniques, and Tools," Addison
Wesley, 1986.
[ABCCF88] F. Allen, M. Burke, P. Charles. R. Cytron, and J. Ferrante, "An Overview of the PTRAN
analysis system for Multiprocessing," in Proceedings of the 19lJ7 International Conference on
Supercomputing, LNCS, February, 1988.
[Allcn74] F. Allen, "Interprocedural Data Flow Analysis," in Information Processing 74. North Holland
Publishing, Amsterdam, 1974,398402.
[Allen861 F. Allen "Compiling for ParaUelism," in G. Almasi, R. Hackney, and G. Paul, editors,
Proceedings of the IBM Institute Europe. North-Holland Press, 1986.
[Allen83J I.R. Allen, "Dc~ndence Analysis for Subscripted Variables and Its Application to Program
Transformations, ' Ph.D. Thesis, Rice University, Houslon, Texas, April 1983.
[AIBaKe86] l.R. Allen, D. Baumgartner, K. Kennedy, and A. Porterfield, "PTOOL: A Semi-Aulomatic
Parallel Programming Assislant," in Proceedings of the 1986 International Conference on
Parallel Processing, August 1986, 164·170.
[AlKe84al J.R. Allen and K. Kennedy, "A ParnlIel Programming Environmenl," technical report, TR-84-3,
Department of Computer Science, Rice University, July 1984.
[AlKe84b] l.R. Allen and K. Kennedy, "PFC: A Program 10 Convert Fortran to Parallel Form," in Super.
computers: Design and Applications, IEEE Computer Society Press, Silver Spring, MD. , 1984,
186-205.
[AIKe87] l.R. Allen and K. Kennedy, "Automatic Translation of Fortran Programs to Vector Form," in
ACM Transactions on Programming Language and Systems, Vol 9, No.4, October 1987.
[ASKL79] W. Abu-Surah, D. Kuck, and D. Lawrie, "Automatic Program Transfonnations for Virtual
Memory Computers," in Proceedings of the 1979 National Compwer Conference, June 1979,
969-974.
[Anna90] M. Annaratone el al., "The K2 Parallel Processor: Architecture and Hardware Implementation,"
in Proceedings of the 17th Symposium on Computer Architecture, Seattle, June 1990.
[Bane76] U. Banerjee, "Data Dependence in Ordinary Programs," Technical Report, Department of Com·
puter Science, University of illinois at Urbana-ehampion, RpL No. 76-837.
[Bane88] U. Banerjee, "Dependence Analysis for Supercomputing," Kluwer Academic Publishers,
Norwell, Mass., 1988.




K. Birman cl al., "The ISIS System Manual. Version 2.1," The ISIS Project, Department of
Computer Science. Cornell University, 1990.
[BDHLW87] M. Byler, J. Davies. C. Huson, B. Leasure. and M. Wolfe, "Mulliple Version Loops," in
Proceedings of the lnlernat;onal Conference on Parallel Processing, 1987.312-318.
[BeDeWe85] J. Beclem, M. Denneau, and D. Weingarten, "The GFll Supercomputer," in IEEE Proceedings
of the 12th Annual International Symposium on CompuJer Architecture, Boslen, Mass. June
1985, 108-113.
[BcLaLe88] BN. Bershad, E.n. Lazowska. and H.M. Levy. "PRESTO: A System for Object-oriented Parnl-
lei Programming," in Software - Practice and Experience. Vol. 18. No.8, 1988,713-732.
[Bem86] AJ. Bernstein, •• Analysis of Programs for Parallel Processing," in IEEE Transactions on Com-
puters. 746-757, October 1986.
[BWJALG90] F. Bodin, D. Windheiser. W. Jalby, D. Atapaltu, M. Lee, and D Gannon, "PerfonnanceEvalua-
tion and Prediction for Parallel Algorithms on the BBN GPl000," in Proceedings of the 1990














S. Brokar et aI., "iWa.rp: an Integrated Solution to High Speed Parallel Computation," in
Proceedings ofSupercomputing 88, November 1988.
W. Brantley, K. McAuliffe, and J. Weiss, "RP3 Processor-Memory Element," in Proceedings
of the 1985 International Conference on Parallel Processing, 1985,782-789.
M. Burke and R. Cytron, "Inlerprocedurnl Dependence Analysis and Parallelization," in ACM
SIGPLAN Symposium on Compiler Construction, ACM SlGPLAN, Notice, Vol 21, No 7, July
1986, 162-175.
C.D. Callahan, K.D. Cooper, R.T. Hood, K. Kennedy, and L. Torczon, "Parascope: A Parallel
Programming Environment," in The International Journal of Supercomputer Applications, Vol
2. No.4, Winter, 1988, 84-99.
C.D. Callahan and K. Kennedy, "Compiling Programs for Distributed-Memory Multiproces-
sors," in The Journal ofSupercomputing, Vol. 2, No.2, 1988, 151-169.
N. Carriero and D. Gelemler, "Applications Experience with Linda," in Proceedings of the
ACM Symposium on Parallel Programming, ACM, July 1988, 173-187.
N. Carricro and D. Gelemter, "Linda in Context," in CACM 32, 4, April 1989, 444458.
F. Chow, "A Portable Machine-Independent Global Optimizer," Technical Report, Number
CSL-86-289, Standford University, May 1986.
P. Cohen and E. Feigenbaum, "The Handbook of Artificial Intelligence," Vol. 3, William Kauf-
mann, 1981.
K. Cooper, K. Kennedy, and L. Torezon, "Interprocedural Side-Effect Analysis in Linear
Time," in SIGPLAN Notices, Vol. 21,No. 7,1988, V57-66.
R. Cytron, "Compile-lime Scheduling and Optimization for Asynchronous Machines," Ph.D.
Thesis, Department of CS, University of Illinois, Urbana-ehampaign Report No. UIUCDCS-R-
84-1177, August 1984.
w. Dally, "Fine-Grain Message-Passing Concurrent Computers," in Proceedings of the Third
Hypercube Conference, Vol. I, ACM, 1988,2-12.
Dongarra, JJ., "Performance of Various Computers Using Slamlard Linear Equations Software




















Simulation Councils Inc. San Diego. CA. January 1987. 15-33.
L. D. Erman, F. Hayes-Rolh, V. R. Lesser, and D. R. Reddy, "The HEARSAY II Speech-
Understanding System: Integrating Knowledge to Resolve Uncertainty," Computing Survey,
Vol. 12, No.2, 1980.213-253.
J. Ellis, "Bulldog: A Compiler for VLIW Architectures," 1985 ACM Doctoral Dissertation
Award. The MIT Press, 1986.
R. Engelmore and A. Terry, "Structure and Function of the CRYSALIS System," in Proceed-
ings of Sixth International Joint Conference on Artificial Intelligence. Vol. I, Tokyo. Japan,
August 1979.250-256.
L. D. Erman and V. R. Lesser, "A Multi-Level Organization for Problem Solving Using Many,
Diverse, Cooperating Sources of Knowledge," in Proceedings of Fourth International Joinl
Conference on Artificial llltelligence, Vol. 1, Thilisi, USSR, 1975,483-490.
R. D. Fennell and V. R. Lesser, •'Parallelism in Artificial Intelligence Problem Solving: A case
Sludy of HEARSAY II," in IEEE Transactions 011 CompUlers. Vol. C26, No.2, 1977, 98-111.
T. Y. Feng, "Some Characteristics of Associative/Para1lel Processing," Proc. 1972 Sagamore
Computer Conference, Syracuse University. 1972,5-16.
J. Ferrante, K. Ouenstein, J. Warren, "The Program Dependence Graph and Its Uses in Optimi-
zation," mM Technical Report RC 10543, August 1983.
J. Fisher, "The VLIW machine: A multiprocessor for Compiling Scienlific Code," IEEE Com-
pUler, July 1984,45-53.
M. J. Flynn, "Very High Speed Computing Systems," Proceedings of IEEE, Vol. 54, 1966,
1901-1909.
K. Fucm, and M Nivat (edilors), "Programming of Future Generation Computers," North-
Holland, 1988.
V. Guama Jr., D. Gannon, D. Jablonowski, A. Malony, and Y. Gaur, "Fausl: an Integrated
Environment for lhe development of Parallel Programs," IEEE software, July 1989.
K. Gallivan, W. Jalby, A. Malony, and H. Wijshoff, "Performance Prediction of Loop Con-
structs on Multiprocessor Hierarchical-Memory Systems," Technical Report. CSRD Rpt No.
853, Center for Supercomputer Research and Development, Universily of Illinois, 1989.
D. Gannon and J. Van Rosendale, "On the Communication Complexily of Parallel Numerical
Algorilhms," IEEE Transactions 011 CompUlers, December 1984, C·33 #12, 1180-1194.
D. Gannon, W. Jalby, and K. Gallivan, "Strntegies for Cache and Local Memory Management
by Global Program Transformation," in Proceedings of the 1987 International Conference on
Supercomputing, 1987,229-254.
HM. Gemdt, "Automatic Parallelization for Dis!ributed-Memory Multiprocessing Systems,"
Ph.D. lhesis, University Borm, December 1989.
H. M. Gemdt, "Updating Dis!ributed Variables in Local Computations," Concurrency: Practice
and Experience, Vo12(3), Sept 1990, 171-193.
S. Graham and M. Wegman, "A Fast and Usually Linear Algorithm for Global Flow
Analysis," in JACM, Vol. 23, No.1, Janwuy 1976, 172-202.
T. Gruber, "The Acquisition of Slralegic Knowledge," Academic Press, 1989.
137
[GFNW86l A. Gupta, C. Forgy, A Newell, and R. Wedig, "Parallel Algorithm and Architectures for Rule-
Based Systems," in Proceedings o/the 13th Symposium on Computer Architecture. June 1986.
[HHRC79j B. Hayes-Rolh. F. Hayes-Roth, S. Rosenschein, and S. Cammara1a, "Modeling Planning as an
Incremental, Opportunistic Process." in Proceedings ofSixth International Joint Conference on
Artificial Intelligence. Vol. 1, Tokyo, Japan. August 1979, 375-383.
[HQndler77J W. Handler, "The Impact of Classification Schemes on Computer Architecture," in Proceed-
ings of the 1977 International Conference 011 Parallel Processing, 1977.7-15.
{HaLe??] F. Hayes-Roth and V. R. Lesser, "Focus of Attention in lhe HEARSAY II Speech Understand-
ing System," Proceedings of Fifth International Joint Conference on Artificial Intelligence,
Vol. I, Boston, Massachusetts, USA, 1977,27-35.
[Hayes79] B. Hayes-Roth and F. Hayes-Roth, "A Cognitive Model of Planning," Cognitive Science, Vol.
3,1979,275-310.
[Hayes83] B. Hayes-Roth, "The Blackboard Architecture: A General Framework for Problem Solving?,"
Technical Report, No. HPP-83-30, Department of Computer Science, Standford Univ., Stand-
ford, May 1983.
[Hayes85] B. Hayes-Roth, "A Blackboard Architecture for Control," in Artificial Intelligence, Vol. 26.
No.3, July 1985, 251-321.
[HePa90] J.L. Hennessy and D.A. Patterson, "Compuler Architecture: A Quantitative Approach," Morgan
Kaufmann Publishers. Inc., 1990.
[Hcchl77] M. Hecht, "Flow Analysis of Computer Programs:' North Holland, 1977.
[Hillis85] W. Hillis, "The Connection Machine," The MIT Press. 1985.
[HoRe77] R. Han, and D.R. Reddy, "The Effects of Computer Architecture on Algorithm Decomposition
and Performance," in Kuck, et al. (edilors). High-Speed Computers and Algorithm Organi'zalion
Academic Press, 1977,411421.
[Husm86] H. Husmann, "Compiler Memory Management and CompoWld Function Definition for Mul-
tiprocessors," Ph.D. Thesis, Departmenl of Computer Science, University of Illinois, CSRD
Rpt. No. 575.
[Hwang84] K. Hwang "Supercomputers - Design and Applications," McGraw-Hill. 1984.
[HwBr84] K. Hwang and F. Briggs, "Computer Architecture and Parallel Processing," McGraw-Hill,
1984.
[HwDe89] K. Hwang and D. DeGroot (editors), "ParaJlel Processing For Supercomputers & Artificial
Intelligence," McGraw-HilI, 1989.
[InleI90] Intel Corporation, "jpSC/2 and iPSC/860 User's Guide," Intel Corporation, June, 1990.
[KGSF84] A. Kapauan. D. Gannon, L. Snyder, and T. Field, "The Pringle Parallel Computer," in
Proceedings of the 11th International Symposiwn on Computer Architecrure, IEEE, 1984, 12-
20.
[KWGCS84] A. Kapauan, K. Wang, D. Gannon, J. Cuny, and L. Snyder. "The Pringle: An Experimental
System for Parallel Algorithm Design and Testing," in Proceedings of the 1984 International
Conference on Parallel Processing, 1984, 1-8.
[KaUI76] J. Kam and J. Ullman, "Global Data Flow Analysis and Iterative Algorithms." in JACM, Vol.
23, No.1, January 1976, 158-171.
138
[Karp87j A KaIp, "Programming For Parallelism," in Compurer, May 1987,43-57.
{Kcnn80] K. Kennedy, "Automatic Trnnslation of Fortran Programs 10 Vector Fann," Rice Technical
RepoI1476-029-4, Rice University, October 1980
[Kers881 L. Kerschbcrg (editor), "Expert Database Systems," Benjamin/Cunnnbgs, 1988.
[KlWa86] P. Klahr and D. Waterman (editors), "Expert Syslems: Techniques. Tools. and Applications,"
Addison Wesley, 1986.
[Koe1901 C. Koelbcl, "Compiling Programs For Non-Shared Memory Machines," Ph.D. thesis, Depart-
ment of Computer Science. Purdue University, December 1990. Technical report no. CSD-TR-
1037.1990.
[KoMe89j C. Koelbel and P. Mehrotra, "Compiler Transformations for Non-Shared Memory Machines,"
in Proceedings of the 4th Emernational Conference on Supercomputing, Vol. 1, 1989, 390-397.
[KoMeVR90J C. Koelbel, P. Mehrotra, and J. Van Rosendale, "Supporting Shared Data Slructures on the Dis-
bibuled Memory Archilectures," in Proceedings of the 2nd ACM SlGPLAN Symposiwn 011













J. Kowalik, "Para1Iel MIMD Compulation: Hep Supercomputer and Its Applications," The MlT
Press. 1985.
D. Kuck, R. Kuhn, B. Leasme and M Wolfe, "The Structure of an Advanced Vectorizer for
Pipelined Processors," in Proceedings of the 4th International CompuJer Software and Applica-
tion Conference, Oclober 1980, 709-715.
D. J. Kuck, R. H. Kuhn, B. Leasure, D. H. Padua and M. Wolfe, "Dependence graphs and com-
piler optimizations," in Proceedings of the 8th Annual ACM Symposium on Principles Of Pro-
gramming Languages, Williamsburg, VA., January 1981.
D. Kuck, M. Wolfe, and J. McGraw, "A Debate: Retire FORmAN'!," in Physics Today, May
1984.67-75.
D. Kuck, E. Davidson, 0, Lawrie, and A. Sameh, "Parallel Supercomputing Today and lhe
Cedar Approach," in Science Vol. 231, February 1986, 967-974.
V.R. Lesser and D.O. Corkill, "The Dislributed Vehicle Monitoring Test Bed," in AI Maga-
2ine, Fa1l1983, 15-33.
V. R. Lesser and L. D. Erman, "A Relrospective View of Ute HEARSAY IT Architecture," in
Proceedings of Fifth International Joint Collferellce on Artificial Intelligence, Boston, USA,
1977.790-800.
B. Leung, "Issues on Ute Design of ParaUelizing Compilers," CSRD Report No 1012, Cenler
for Supercomputer Research and Development, University of Illinois, June 1990.
J. Levesque, and J. Williamson, "A Guidebook to Fortran on Supercomputers," Academic
Press. 1987.
Z. Li and P. Yew, "Inlerprocedurnl Analysis for Parallel Computing," in Proceedings of the
Inrernational Conference on Parallel Processing, 1988.
Z. Li and P. Yew, "Efficient Interprocedural Analysis for Parallel Parallelization and Restructur-
ing," Technical Report, Cenler for Supercomputer Research and Development, CSRD Rpt. No.
804.
P. Mehrotra, J. R. Van Rosendale, "The BLAZE Language: A Parallel Language for Scientific









P. Mehrotra and I. Van Rosendale, "Compiling High Level Conslructs 10 Djstributed Memory
Architeclures," in Proceedings of the Fourth Conference on Hypercube Concurrent Computers
and Applications, March 1989.
P. Mehrotra and J. Van Rosendale, "Parnllel Language Constructs for Tensor Product Compula·
lions on Loosely Coupled Architectures," in Proceedings of Supercomputing' 89. Reno NY.
Nov. 1989,616-626.
E. Myers, "A Precise Inter-Procedural Data Flow Algorithm," in Proceedings of the 8th Annual
ACM Symposium on Principle ofProgramming Languages. 1981. 219-230.
B. Miller and C. Yang, "IPS: an Interactive and Automatic Perfonnance Tool for Parnllel and
Distributed Programs," in Proceedings of the 7th International Coriference on Distributed Com-
puting Systelm, 1987,482489.
M. Minsky. "A Framework for Representing Knowledge," in P. WinSlOn (ediwr), The Psyc/w/-
ogy ofCompUler Vision. McGraw-Hill,I975, 211- 277.
S. Midkiff and D. Padua, "Issues in the Compile-Time Optimization of Parallel Programs,"
Technical Report, Center for Supercomputer Resean:h and Development, Rpt No. 993. Univer-
sity of lllinois, May 1990.
D. Nau. "Expert Computer Syslems," in IEEE Compurer, February, 1983.63-85.
[NCUBE87] NCUBE Corporation, "NCUBE Users Handbook," NCUBE Corporation, 1987.
[NCUBE90] NCUBE Corporation, "nCUBE 2 Processor Manual." NCUEE Corporation. 1990.
[NU85] H. P. Nii, "Research on Blackboard Architecture at the Heuristic Programming Project,"
Technical Report No_ KSL-85-24, Department of Computer Science, Standford University.
Standiord, May 1985.
[Nii86a] H. P. Nii•• 'CAGE and POUGON: Two frameworks for Blackboard based Concurrent Problem
Solving," Technical Report No. KSL-86-41, Department of Computer Science, Standford
University. Standford, April 1986.
[Nii86b] H. P. Nil, "Blackboard Systems: The Blackboard Model of Problem Solving and Evolution of
Blackboard Architectures," Part I, in AI Magazine, August 1986, 38-53.
[Nii86cl H. P. Nii, "Blackboard Systems: Blackboard Application Systems, Blackboard Systems from a
Knowledge Engineering Perspective," Part 2. in AI Magazine, August 1986, 82-106.
[Nilsson80] N. J. Nilsson, "Problem-solving Melhods in Artificial Inlelligence," McGraw-Hill, 1980.
[padua791 D. Padua, "Multiprocessors: Discussion of Some Theoretical and Practical Problems," Ph.D.
Thesis, University of Illinois, Urbana-ehampaign, November 1979.
[paGuLa87] D. Padua, V. A. Guama Jr. and D. Lawrie "Supercomputer Programming Environmcnls," in
Parallel Computations and Their Impact on Mechanics. Vol. 86. December 1987, 55-79.
[PaKu80J D. Padua and D. Kuck, "High-Speed Multiprocessors and Compilation Techniques," in IEEE
Transactions on CompUlers, Vol. C·29, No.9, September, 1980,763-776.
[para89] Parasoft Co., "EXPRESS: A Communication Environment for Parallel Computers," Parnsoft
Corporation, 27415 Trabuco Circle, Mission Viejo, CA., 1988.
[pfBrNo85J G. Pfister, W. Brantley, D. George. S. Halvey, and W. Kleinfelder, K. McAuliffe, E. Mellon,
V.A. Norton. J. Weiss, "The IBM Research Parallel Processor Prototype (RP3): Introduction
and Architecture,", in PrOc. of the 1985 International Coriference on Parallel Processing, 1985,
764-771.
140
[PfNo85J G. Pfister, V.A. NOl100, "Hot Spot Contention and Combining in Multistage Interconnection
Networks," in Proc. of the 1985 Inrernatio1lll1 Conference on Parallel Processing, 1985, 790-
797.
[poly861 C. Polychronopoulos. "On Program Restructuring, Scheduling, and Communication for Parallel
Processor Systems,' I Ph.D. Thesis. University of Illinois Center for Supercomputer Research
and Development, CSRD lR-595, August 1986.
[poly8S] C. Polychronopoulos. "Toward Aulo-Scheduling Compilers," CSRD Report No. 789, Center
for Supercomputer Research and Development, University of Illinois. May 1988.
[POID..LS89] C.D. Polychronopoulos, M. Girkar. MR. Haghighat, CL. Lee. B. Leung, and D. Schouten,
"ParaIrase-2: An Environment for ParaUelizing, Partitioning, Synchronizing, and Scheduling
Programs on Multiprocessors," in Proceedings of the 1989 lnrernational Conference on Paral-
lel Processing, 1989.
[RoPi90J A. Rogers and K. Pingali, "Process Decomposition through Locality of Reference," in Proceed-
ings of the SfGPLAN '89 Conference on Programming Language Design and Implementation.
1989.
[RoScWe89] M. Rosing, R. Schnabel, and R. Weaver, "Expressing Complex Parallel Algorithms in DING,"
in Proceedings of the 4th Conference on Hypercubes, Concurrent Compmers and Applications.
1989.553-560.
[RjjAn88l R. Ruhl and M. Annaratone, 'ParaUelization of Fortran Code on Distributed-Memory Parallel
Processors," in Proceedings of the 1990 Inrernatio!IfJ/ Conference on Supercomputing, June,
1990.342-353.
[Sacer77] E.D. Sacecdoti, "A Structure for Plans and Behavior," Elsevier Science Publishers, 1977.
[SCMB90] J. Sallz, K. Crowley, R. Mirchandaney, and H. Berryman, "Run-Time Scheduling and Execu-
tion of Loops on Message Passing Machines," in Journal of Parallel and Distributed Comput-
ing, Vol. 8, No.4, April 1990, 303-312.
[Sarkar87] V. Sarkar, "Partitioning and Scheduling Parallel Programs for Multiprocessors, Ph.D. lhesis,
Slanford University, 1987.
[Schw80] J. Schwartz, "Ultracomputer," in ACM Transactions on Programming Languages and Systems,
Vol. 2, No.4, October 1980, 484-521.
[ScRu85] Z. Segall and L. Rudolph, "PIE: a Programming and Instrumentation Environment for Parallel
Processing," in IEEE Software, Vol. 2, No.6, November 1985.
[Smith82] A. Smith, "Cache Memories,» in Computing Surveys. Vol. 14, No.3, September 1982, 473-
530.
[Soren] D. Sorensen, Personal communication.
[Stone90] H. Stone, "High-Performance Computer Architecture, Second Edition" Addison Wesley, 1990.
[Tate76] A. Tale, "Project Planning Using a Hierarchic Non-Linear Planner," Technical Report No. 25,
Department of Artificial Intelligence, University of Edinburgh, 1976.
[Terr83l A. Terry, "The CRYSALIS Projecl: Hierarchical Control of Production Systems," Memo
HPP-83-19, Department of Computer Science, Standford UniversilY, Slandford, May 1983.
[TsLaKu88] P.S. Tseng, M. Lam, and H.T. Kung, "The Domain Parallel Computation Model on Warp," in



















A. Veidenbaum, "Compiler Optimizations and Architecture Design Issues For Multiproces-
sors," Ph.D. Thesis, Department of Computer Science, University of Illinois, Urbana-
Champaign, CSRD Rpt No. 520, Center for Supercomputer Research and Development, 1985.
A. Walker, M. McCord, J. Sowa, and W. Wilson, "Knowledge Systems and Prolog," Addison-
Wesley, 1987.
K. Wang, "An Experiment in Parallel Programming Environment: The Expert Systems
Approach," in K. S. Fu (editor), Some Prototype Examples for Expert Systems, lR-EE 85-1,
Purdue University, March 1985, 591-624.
K. Wang, "A Fast Program Dependence Analysis Algorithm for Blaze," CS690V Class Report,
Department of Computer Science, Purdue University, August 1985.
K. Wang, "Array Protection and Copying Issues in the Blaze and E-BIaze Implementalion,"
Internal memo, the RP3 group, mM, August 1986.
K. Wang, "Machine Knowledge Representation and Manipulation For Parallel Compilers,"
Technical Report CSD-lR-843, Department of Computer Sciences, Purdue University,
December 1988.
K. Wang and D. Gannon, "Applying AI Techniques to Program Optimizations For Parallel
Computers," in K. Hwang and D. DeGroot, editors, Parallel Processing for Supercompurers
and Artificial Intelligence, McGraw-Hill, 1989,441-485.
K. Wang, "A Perfonnance Prediction Model For Parallel Compilers," Technical Report, CSD-
lR-I04l, CAPO Report, CER-90-43, Department of Computer Science, Purdue University,
November 1990.
K. Wang, "Array Reshaping - A Mechanism for Optimizing Array Storage On Parallel Archi-
tectures," Technical Report, CSD-lR-I042, CAPO Report, CER-90M, Department of Com-
puter Science, Purdue University, November 1990.
K. Wang, "Managing Dala Synchronization Automatically For Distributed-Memory Architec-
tures," Technical Report, CSD-TR-I043, CAPO Report CER-9045, Department of Computer
Science, Purdue University, November 1990.
K. Wang, "A Framework For Intelligent Parallel Compilers," Technical Report, CSD·lR·I044,
Department of Computer Science, Purdue University, November 1990.
K. Wang, "Heuristic Guided Pre-Optimized Algorithm Substitution For Parallel Computers,"
Technical Report, CSD-TR.-1055, CAPO Report, CER·90-50, Department of Computer Science,
Purdue University, December 1990.
S. Weiss and C. Kulilowski, "A Practical Guide lo Designing Expert Systems," Rowman and
Allanhcld publishers, 19&4.
D. Wilkins, "Domain Independent Planning: Representation and Plan Generation," in Arrificial
Intelligence. Vol. 22, 19&4,269-301.
M.A. Williams, "Distributed, Cooperating Expert Systems for Signal Understanding," In
Proceedings ofSeminar on AI Application to Battlefield, 341-346.
T. Winograd, "Frame Representations and the Declarative/Procedural Conl:rOversy," in Bobrow
and Collins (editors), Representation and Understanding: Srudies in Cognitive Science.
Academic Press, 1975, 185-210.
M. Wolfe, "Optimizing SupercompiIers for Supercomputers," Ph.D. Thesis, Department of
Computer Science, University of Illinois, Urbana-ehampaign, 1982, Report no. UIUCDCS·R-
82-110S.
[42
[Wolfc89] M. Wolfe•• 'Optimizing Sllpercompilers for Supercomputers," The lillT Press. 1989.
[Ycw88] P-C Yew, "Architecture of the CedarParnllel Supercomputer," in Parallel Systems and Compu-
tation Paul and G.S. Almasi (editors), Elsevier Science Publishers, 1988, 137-148.
[ZiBaGe88] H. P. Zima, H.-I. Bast and H. M. Gemdl:, "SUPERB: A Tool for Semi-automatic MIMD/SIMD
Parallelization," Parallel CompuJing, Vol 6. 1988. 1-18.
[ZiCh90] H. Zima and B. Chapman, "Supercompilers for Parallel and Vector Computers," Addlson-
Wesley publishing, 1990.
APPENDIX
Appendix A.I. Sample Rules Used In Chapter 8
Process Creation.
[Rule a.1, ['computational model construction']]
if (has 'self-scheduling-loop primitivcs')
then
assert('paralIelize oulennoslloop without blocking').
[Rule a.2, ['computational model construction')]
if ('process creation cost' is high) and
(number-oC-processors is P)
lben
assert('number of processes LQ create' is Pl.
[Rule a.3, ['computational model conslruction'JJ
if ('process creation cost' is low)
then
assert('parallelize Dntennast loop wilhout blocking').
Locality




[Rule a.5. ['computational model construction']]
if ('data access/process costralio' is high)
then
assert('memory optimization dominates instruction minimization').
[Rule a.6, ['computational model construction']]
if ('shared/local memory access ratio' is high)
then
(assert('localily is important'» and
(assert('use local variable whenever possible'».
[Rule a.7 ['computational model construction']]
if (has 'vector register')
lben
(try 'keep vector operand in register')
The Program Focus Selection Subgoal
[Rule b.l, ['program focus selection')]





The Transformation Selection Subgoal.
[Rule c.l, ['program restructuring subgoal selection')]
143




select('task: creation and processor allocation').








[Rule cA, ['program restructuring subgoal selection']]
:. (select('task creation and processor allocation'».
[Rule c.5, ['program restructuring subgoal selection']]
if «'has cache') or ('has arrays in'(Focus» or ('locality is importanl')
then
select('memory access optimization').
[Rule c.6, ['program restructuring sllbgoal selection'J]
if ('multiple tasks are created')
then
select('paralIelism improvement').
[Rule c.7, ['program restructuring subgoal selection']}
if «('<ask Crealed'(FOCUS» and (no' 'parallelized'(FOCUS»)
then
select('task-creation and processor a1loca1ion')
Parallelism Improyement Subgoof
[Rule d.l, ['parallelism improving']]
if (;s-a-loop(L» and
(L = (for i in [RANGE] do A += B[i] * C[i]; end for»
then
is-inner-product(L)
[Rule d.2, ['parallelism improving'JJ





[Rule d.3, ['parallelism improving')]




[Rule dA. ['parallelism improving']]
If ('nested-loops'(Focus» and
(not 'perfectly-ncstcd-loops'(FocllS» and




Rules about task creation and processor allocation





[Rule c.2, ['task creation and processor allocatioo']J
if (is-nested-loop(FOCUS» and
('has vector operations') and
('size of vector registers'(V) and














[Rule e.4, ['task creation and processor allocation']]






[Rule f.1, ['memory access optimization']]
(Assume L2 is the innermost loop lhat is nesled in Ll such
lhal array references of X depends on lhe loop index of L2.
Also let X-sub be the part of lhe array X whose references





(innermost-<lepends-on-loop(Ll, X, L2)) and
(sub-depends-onex, X-sub, L2» and
(N = sizcof(X) and
('minimal number of references (0 justify cost of block-transfer' = B) and
(N) B)
then
(applyCblock transfer' (X-sub. L2»).
[Rule C.2, ['memory access optimization']]
if (apply('block transfer'(X, L» and
(parnlle1ize(L» and




[Rule f.3, ['memory access optimization']]
if (applyChlock trnnsfec'(X, L») and
Cnes'ed in'(L, LO))
lhcn
('create temporary array'(lmp, LO) and
('create Slatement'(S. block-transfcr(X, tmp, sizeof(X)) and
('insert in front oPCS, L2» and
(substitule(X, tmp, L».
145
[Rule fA, ('memory access optimization']]











('insert in front or(LL, LO».
[Rule f.5, ['memory access optimization']]












[Rule f.6. ['memory access optimization')]
if ('has cache') and
('mostly used array'CA. FOCUS»
then
('keep in eache'(A».
[Rule f.? ['memory access optimization')]
if ('locality is important') and
('has local memory') and
('data accessing ratio of shared memory-local memory' > 2) and
(shared-anay(A»
then
('allocale array A to lhe local memory of each processor').
[Rule f.S, ['memory access optimization']]
if (has-local-memory)
('mostly used array'CA, FOCUS»
(shared-anay(A»
(appears-in(A, S» and
('in nested loops'CS. [1..1 .. Ln])) and
('not depends on loops'(A, Ll»
then
('create tmp'(tmp, Ll)) and
('creale statement'(Sl, (A:= Imp))) and
('insert in fronl of(Sl, S)),
(suhstitute(A, Imp, LI».
[Rule f.9, ['memory access optimization']]
if ('moslIy used array'(A, FOCUS)) and
(shared(A» and
(appears-in(A, S)) and
('in nested loops'(S, [Ll.. Ln])) and
('depends on loops'(A, Ll))
lhen
(find the plausible loop order ORD with most inner loops that A depends on) and
('loop interchange'(LI, ORD)) and
(innennost-depends-on-loop{Ll, X, LL)) and
('create Imp'(tmp, LL» and
('creale slatement'(Sl, (A:= tmp))) and
146
('insert in front oP(SI, S» and
(subslitute(A, lmp, LL».
[Rule f.10. ['memory access optimization']]
if ('has local memory') and






('creme bTIp'(bnp, FOCUS» and
(scalarize(A, bnp)).
[Rule r.ll, ['memory access optimization')]








[Rule f.12, ['memory access optimization']]




('has inter task dependence in'CA, L»
then
('pipelining references'CA, L».
[Rule £.13, ['memory access optimization'])
if ('has veclQr regisler') and
('is a vecIOC'(V» and
(appears-in(V, S» and
('in nested loops'(S, LList) and
(mcmber(LL, LList» and
('not depends on'CA, LL»
lhen
('interchange loops La move LL inlo Ihe innermosl').
[Rule £.14, ['memory access optimization']]
if ('has vector register')
then
('vector regisler oplimization dominatcs memory access optimization')
Appendix A.2. Sample Listing of Encoded Rules
% File: tran/cshrinking.a.
% rules for cycle shrinking.
% Cycle shrinking is a special case of loop blocking!
% We can use cycle shrinking to squeeze parallelism out of serial loop.
% For heuristics and rules see file cshrinking.
% We actually use loop_blocking(Loop, parallel_row2, RcductionFactor)
% 10 perform cycle shrinking.
% For non-perfectly nested loops, loop unrolling for !he first or last
% iteration of !he inner loop may be needed.
% For true--dependence shrinking, we may be better of by collapsing the
% loops into one single loop and block the resulting loop.
% module(cycle_shrinking).
% transformation_file(lblocking).
% loop blocking is used 10 squeeze parallelism out of serial loops.
147
% Loop is the autennast loop of a nesled loop (perfect or non-perfect)
% Rule 1.
% if apply-transformation(cycIc_shrinking, Loop, Type, RcductionFaclor)
% then call(perfofTn_transConnation(cyclc_shrinking. [Loop, Type, ReductionFactor]))
rule(l. cycle_shrinking,
apply-transfonnation(cycIc_shrinking, Loop, Type, ReducUonFaclor),
cal1(perfomuransfonnalion(loop_blocking, [Loop, Type, ReductionFacLorJ»,
1.0).
% Rule 2.
% if serial(Loop) and 1* form a slrongly connected graph */
% alUlistancc_are_known(Loop)
% then
% try_transfonnation(loop_shrinking, Loop, BlockingFactor)
rule(2.cyclc_shrinking,
(seria1(Loop), no_unknown_distance(Loop»,
try_transformation(cyclc_shrinking, Loop, Type, ReduclionFactor),
1.0).
% Rule 3.
% if try_transformation(loop_shrinking, Loop. D) and
% nOl_nesled_loop(Loop) and 1* there IS no loops nesled inside */
% D = min{DDep: dislance(Dep, DDep),
% for all cross loop dependence Dep in Loop}
% then
% appIY_lransformation(loop_shrinking, Loop, D).
%/* ic applyCloop blocking'. parnllelJow2) and parallelize(newjnner) */
rule(3. cycle_shrinking,




apply-transfonnation(1oop_shrinking, Loop, simple. ReductionFac(or),
1.0).
% Rule 4.
% Assuming DistanceVector = (dl, d2•..., 00) has the minimum truc-distance
% in loop Loop. d sub rl is the first poslivc distance in D. and
% loop Loop is a perfect nested loop.
% if there is a r2 s.t
% all dr = 0 for r in (rl+l •...• 1'2-1) and dr2 < 0
% then apply selective shrinking is better than apply true distance shrinking
rule(4. cycie_shrinking.





apply_transfonnation(1oop_shrinking, Loop. selctivc, ReductionFacrtor),
1.0).
% Rule 5.
% For the same assumption as in rule 4.
% if there is no r2 s.l
% all dr=O forrin (rl+I, ...• r2-I) anddr2 < 0











% if the loop is not perfectly nesled;
% only the common outer loop is considered.
% lhe loops inside the outer loop can be considered seperately.
rule(6. cycle_shrinking,





apply_transformation(Ioop_shrinking, Loop, non-J)erfect_nested. ReduclionFaclor),
1.0).
Appendix B. Program Transformation Techniques
* Transformations included:
See table 8.1.
* Steps to Add New Transformations To the System.
1. Implement the transformation in three modules:
b. applicability test - routine to tesl if the transformation is applicable.
'transfonnation_name'_test(Focus, Arg).
b. argument selection - routine to find suilable argument for the transformation:
'lrnnsfonnalion_name'_ar~(Focus. Arg).
c. algorithm application· roUbRe to perform the transformation:
'lransfonnatioR_name'(Focus, Arg).
2. Declare the transformation.
In .file tran/transformation add the predicate:
transformation(transformation_name, SubGoal. Availbility).
where Availbilily is the slatus of lhe transformation. its value is eilher
ok, noCrcliablc. partial, or no.
Example:
transformation(loop_interchanging, lask_crealion. ok).
3. Add rules for transformation selection:
put Ihe rules to select transformations in selecUransfonnation/6 in
.file tran/transfonnation.
4. Add rules for making things happen
add rules to make the transfonnalion appllicable when it is not applicable to lhe program
these rules should be added to the file tranltransfonnation.
Uy_harder_'lransfonnation_name·(Focus. SubGoal. Arg).
5. Add specialized evaluation functions and weight to the file
evaluation_fuRction3orCtransforrnation_name·. Goal, EvalFunctions).




weight_evaluation(loop_interchange, _, array-stride. Weight) :-
ha.'Lvecioccapability,
(feature_value(operand_stores, in_memory) ->
Weight = 1.0
Weight = 0.5
).
149
