Mapping of portable parallel programs by Chen, Song
New Jersey Institute of Technology
Digital Commons @ NJIT
Dissertations Theses and Dissertations
Spring 1995
Mapping of portable parallel programs
Song Chen
New Jersey Institute of Technology
Follow this and additional works at: https://digitalcommons.njit.edu/dissertations
Part of the Computer Sciences Commons
This Dissertation is brought to you for free and open access by the Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for
inclusion in Dissertations by an authorized administrator of Digital Commons @ NJIT. For more information, please contact
digitalcommons@njit.edu.
Recommended Citation
Chen, Song, "Mapping of portable parallel programs" (1995). Dissertations. 1110.
https://digitalcommons.njit.edu/dissertations/1110
 
Copyright Warning & Restrictions 
 
 
The copyright law of the United States (Title 17, United 
States Code) governs the making of photocopies or other 
reproductions of copyrighted material. 
 
Under certain conditions specified in the law, libraries and 
archives are authorized to furnish a photocopy or other 
reproduction. One of these specified conditions is that the 
photocopy or reproduction is not to be “used for any 
purpose other than private study, scholarship, or research.” 
If a, user makes a request for, or later uses, a photocopy or 
reproduction for purposes in excess of “fair use” that user 
may be liable for copyright infringement, 
 
This institution reserves the right to refuse to accept a 
copying order if, in its judgment, fulfillment of the order 
would involve violation of copyright law. 
 
Please Note:  The author retains the copyright while the 
New Jersey Institute of Technology reserves the right to 
distribute this thesis or dissertation 
 
 
Printing note: If you do not wish to print this page, then select  














The Van Houten library has removed some of the 
personal information and all signatures from the 
approval page and biographical sketches of theses 
and dissertations in order to protect the identity of 
NJIT graduates and faculty.  
 
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
A Bell & Howell Information Company 
300 North Z eeb  R oad. Ann Arbor. Ml 48106-1346 USA 
313 /761 -4700  800/521-0600
DMI Number: 9539580
Copyright 1995 by 
Chen, Song 
All rights reserved.
UMI Microform 9539580 
Copyright 1995, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, United States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
A BSTR AC T
M A P P IN G  OF PORTABLE PARALLEL PRO G RAM S
by 
Song Chen
An efficient parallel program designed for a parallel architecture includes a 
detailed outline of accurate assignments of concurrent computations onto processors, 
and data transfers onto communication links, such that the overall execution time 
is minimized. This process may be complex depending on the application task and 
the target multiprocessor architecture. Furthermore, this process is to be repeated 
for every different architecture even though the application task may be the same. 
Consequently, this has a major impact on the ever increasing cost of software devel­
opment for multiprocessor systems. A remedy for this problem would be to design 
portable parallel programs which can be mapped efficiently onto any computer 
system. In this dissertation, we present a portable programming tool called Cluster- 
M. The three components of Cluster-M are the Specification Module, the Repre­
sentation Module, and the Mapping Module. In the Specification Module, for a 
given problem, a machine-independent program is generated and represented in the 
form of a clustered task graph called Spec graph. Similarly, in the Representation 
Module, for a given architecture or heterogeneous suite of computers, a clustered 
system graph called Rep graph is generated. The Mapping Module is responsible 
for efficient mapping of Spec graphs onto Rep graphs. As part of this module, we 
present the first algorithm which produces a near-optimal mapping of an arbitrary 
non-uniform machine-independent task graph with M  modules, onto an arbitrary 
non-uniform task-independent system graph having N  processors, in O( MP)  time, 
where P =  max(M, N).  Our experimental results indicate that Cluster-M produces
better or similar mapping results compared to other leading techniques which work 
only for restricted task or system graphs.
M A P P IN G  OF PORTABLE PARALLEL PR O G RA M S
by
Song Chen
A Dissertation  
Subm itted  to the Faculty of 
N ew  Jersey Institu te o f Technology  
in Partial Fulfillment o f the R equirem ents for the D egree of
D octor of Philosophy
D epartm ent of C om puter and Inform ation Science
May 1995
Copyright ©  1995 by Song Chen 
ALL RIGHTS RESERVED
APPROVAL PAGE 
MAPPING OF PORTABLE PARALLEL PROGRAMS 
Song Chen 
Dr. Mary M. Eshaghian, Dissertation Advisor 	 Date 
Director of Advanced Computer Architecture and 
Parallel Processing Laboratory 
Assistant Professor of Computer and Information Science 
Assistant Professor of Electrical and Computer Engineering, NJIT 
fir. John D. Carpinelli, Committee Member 	 Date 
Director of Computer Engineering 
Acting Associate Chair of Electrical and Computer Engineering Department 
Associate Professor of Electrical and Computer Engineering, NJIT 
Dr. James/McHugh, Committee Member 	 Date 
Director of Ph.D. Program in Computer Science 
Professor of Computer and Information Science, NJIT 
br. Peter A. Ng,  Committee Member 
	 Date 
Chair of Computer and Information Science Department 
Professor of Computer and Information Science, NJIT 
Dr. Sotirios G. Ziavras, Committee Member 	 Date 
Assistant Professor of Electrical and Computer Engineering, NJIT 
BIOGRAPHICAL SKETCH 
Author: Song Chen 
Degree: Doctor of Philosophy 
Date: May 1995 
Undergraduate and Graduate Education: 
• Doctor of Philosophy in Computer Science, 
New Jersey Institute of Technology, Newark, New Jersey, 1995 
e Master of Science in Systems Engineering, 
Shanghai Jiao Tong University, Shanghai, P. R. China, 1990 
• Bachelor of Science in Computer Science, 
East China Normal University, Shanghai, P. R. China, 1987 
Major: Computer Science 
Presentations and Publications: 
S. Chen and M. M. Eshaghian, "A Fast Recursive Mapping Algorithm," to appear 
in Concurrency: Practice and Experience, August 1995. 
S. Chen, M. M. Eshaghian, R. F. Freund, J. L. Potter, and Y. Wu, "Evaluation 
of Two Programming Paradigms for Heterogeneous Computing," to appear in 
Journal of Parallel and Distributed Computing, 1995/1996. 
S. Chen, M. NI. Eshaghian, and Y. Wu, "Mapping Arbitrary Non-Uniform Task 
Graphs onto Arbitrary Non-Uniform System Graphs," submitted to IEEE 
Transactions on Parallel and Distributed Computing. 
S. Chen and M. M. Eshaghian, "Tools for Design and Mapping of Portable Parallel 
Programs," to appear in Proceedings of Workshop on Challenges for Parallel 
Processing, International Conference on Parallel Processing, August 1995. 
S. Chen, M. M. Eshaghian, and Y. Wu, "Mapping Arbitrary Non-Uniform Task 
Graphs onto Arbitrary Non-Uniform System Graphs," to appear in Proceedings 
of International Conference on Parallel Processing, August 1995. 
S. Chen, M. M. Eshaghian, R. F. Freund, J. L. Potter, and Y. Wu, "Scalable 
Heterogeneous Programming Tools," Proceedings of Heterogeneous Computing 
Workshop, pp 89-96, April 1994. 
iv 
S. Chen, M. M. Eshaghian, A. Khokhar, and M. E. Shaaban, “A Selection 
Theory and Methodology for Heterogeneous Supercomputing,” Proceedings of 
Workshop on Heterogeneous Processing, pp 15-22, April 1993.
L. R. Welch, A. D. Stoyenko, and S. Chen, “Assigning ADT Modules with Random 
Neural Networks,” Proceedings of Hawaii International Conference on Systems 
Science, pp 546-555, January 1993.
This work is dedicated to 
my grandparents, my parents, and my lovely wife.
vi
ACK N O W LED G M EN T
The author wishes to express his sincere gratitude to his advisor, Professor 
Mary M. Eshaghian, for her guidance, friendship, and moral support throughout 
this research.
Special thanks to Professor John D. Carpinelli, Professor James McHugh, 
Professor Peter A. Ng, and Professor Sotirios G. Ziavras for serving as members 
of the committee and offering invaluable suggestions to this dissertation.
The author is grateful to the Department of Computer and Information Science 
and the National Science Foundation for funding for this project.
The author appreciates the consistent help from the Cluster-M project team 
members: Geetha Chitti, Ajitha Gadangi, Javier G. Vasquez, and especially Ying- 
Chieh Wu.
Lastly, the author wants to thank his dear wife, Jing Zhu, for her love, under­
standing and help without which he simply can not complete this dissertation.
TABLE OF CONTENTS
Chapter Page
1 IN TRODUCTION..................................................................................................  1
1.1 Existing Parallel Programming T o o ls ........................................................ 1
1.2 Mapping Techniques...................................................................................... 3
1.2.1 Mapping of Specialized Task onto Specialized S y s te m s ............. 4
1.2.2 Mapping of Specialized Task onto Arbitrary S y s te m s ...............  5
1.2.3 Mapping of Arbitrary Task onto Specialized S y s te m s ...............  5
1.2.4 Mapping of Arbitrary Tasks onto Arbitrary System s.................  6
1.3 Cluster-M .......................................................................................................  8
1.4 Contributions and O utline ...........................................................................  9
2 CLUSTER-M PROGRAMMING......................................................................... 11
2.1 Cluster-M Specifications..............................................................................  11
2.2 Cluster-M C o n stru c ts ................................................................................... 12
2.3 Implementation of the Cluster-M C onstructs..........................................  13
2.4 Cluster-M Specification M ac ro s .................................................................  14
2.4.1 Associative Binary O peration.......................................................... 15
2.4.2 Vector Dot P roduct............................................................................ 17
2.4.3 SIMD Data Parallel O p era tio n s ..................................................... 18
2.4.4 Broadcast O peration ......................................................................... 18
3 CLUSTERING G R A P H S ..................................................................................... 20
3.1 Clustering Arbitrary Uniform G rap h s ....................................................... 20
3.1.1 Clustering Directed G ra p h s ............................................................  20
3.1.2 Clustering Undirected Graphs .......................................................  25
3.2 Clustering Arbitrary Non-Uniform Graphs ............................................. 29
3.2.1 Clustering Non-Uniform Directed Graphs ...................................  29
viii
Chapter Page
3.2.2 Clustering Non-Uniform Undirected G raphs.................................  36
4 CLUSTER-M M A PPIN G ..........................................................................................  39
4.1 Cluster-M Uniform M ap p in g ......................................................................  39
4.1.1 Uniform Mapping A lg o rith m ..........................................................  40
4.1.2 Uniform Mapping Exam ples............................................................  41
4.2 Uniform Mapping Comparison R e su lts ....................................................  44
4.2.1 Task Scheduling Results .................................................................. 45
4.2.2 Task Allocation R esu lts ....................................................................  45
4.3 Cluster-M Non-Uniform M apping..............................................................  48
4.3.1 Non-Uniform Mapping A lgorithm ..................................................  48
4.3.2 Non-Uniform Mapping E xam ples..................................................  53
4.4 Non-Uniform Mapping Comparison R esults............................................. 56
4.4.1 Comparison with McCreary and Gill’s Clan A lgorithm ............  58
4.4.2 Comparison with El-Rewini and Lewis’s Mapping Heuristic . . .  59
4.4.3 Comparison with Wu-Gajski’s MCP A lg o rith m .........................  61
4.4.4 Comparison with Sarkar’s Edge-Zeroing A lgorithm .................... 61
4.4.5 Comparison with Yang and Gerasoulis’ DSC A lg o rith m   61
5 HIERARCHICAL CLUSTER-M MAPPING FOR HETEROGENEOUS
C O M PU T IN G .....................................................................................................  67
5.1 Heterogeneous Optimal Selection Theory (H O S T )................................ 68
5.2 Modeling the Input to H O ST......................................................................  71
5.3 Hierarchical Cluster-M (H C M ).................................................................... 73
5.3.1 HCM Specification...........................................................................  73
5.3.2 HCM Representation........................................................................  74
5.4 HCM Bound-Degree Mapping Algorithm................................................. 77
5.5 Comparison S tu d y ....................................................................................... 79
6 COMBINED USE OF CLUSTER-M WITH H A SC ..........................................  83
ix
C hapter Page
6.1 Heterogeneous Associative Computing (HAsC) .......................................  83
6.2 Combined Use of Cluster-M and H A sC ......................................................  91
6.3 Scalability Issues ............................................................................................  95
6.3.1 Homogeneous Scalab ility .................................................................. 97
6.3.2 Heterogeneous Scalability.................................................................. 98
6.3.3 Scalability of HAsC and C luster-M ................................................  100
7 CO N CLU SIO N S.....................................................................................................  102
APPENDIX A Cluster-M Constructs in PCN .........................................................  103




4.1 Mapping of Bokhari’s algorithm and C luster-M ............................................. 49
4.2 Comparisons of mappings of Bokhari’s algorithm and C luster-M .............. 50




2.1 Spec graph of a unary operation on an array of size n .................................. 11
2.2 Spec graph of a binary associative operation on 8 elements.........................  12
2.3 PCN system structure.........................................................................................  15
2.4 Cluster-M Specification of broadcast macro.................................................... 19
3.1 Clustering-directed-graphs algorithm...............................................................  23
3.2 A task graph and the obtained Spec graph.....................................................  24
3.3 Clustering-undirected-graphs algorithm .......................................................... 26
3.4 An undirected graph and its clustering............................................................ 27
3.5 A clustered graph of a hypercube.......................................................................  27
3.6 A clustered graph of a mesh................................................................................  27
3.7 A clustered graph of a ring..................................................................................  28
3.8 A clustered graph of a completely connected graph.......................................  28
3.9 Clustering-non-uniform-directed-graphs algorithm........................................ 30
3.10 Two types of clustering....................................................................................... 31
3.11 Clustering on a Merge-node................................................................................ 32
3.12 Clustering on a Merge-node: a general case....................................................  33
3.13 Clustering on a Broadcast-node.........................................................................  33
3.14 Clustering on a Broadcast-node: a general case...............................................  34
3.15 Possible embedding on a Broadcast-node........................................................ 34
3.16 A task graph and the obtained Spec graph....................................................  35
3.17 Clustering-non-uniform-undirected-graphs algorithm...................................  37
3.18 A non-uniform system graph and its clustering.............................................  37
4.1 Uniform mapping algorithm .............................................................................  42
4.2 A mapping example............................................................................................  43
xii
Figure Page
4.3 Gantt chart of the obtained schedule................................................................ 44
4.4 Comparison example with Lee and Aggarwal’s strategy...............................  46
4.5 Comparison with Bokhari’s mapping: task and system graph..................... 47
4.6 Non-uniform mapping algorithm.......................................................................  52
4.7 A mapping example.............................................................................................  53
4.8 Gantt chart of the obtained schedule................................................................ 54
4.9 Mappings on different system graphs................................................................ 55
4.10 Gaussian elimination algorithm.........................................................................  56
4.11 The mapping example of a 5 x 5 matrix Gaussian elimination..................  57
4.12 Comparison example with Clan.........................................................................  60
4.13 Comparison example with MH..........................................................................  62
4.14 Comparison example with MCP, Sarkar and DSC......................................... 63
4.15 Comparison example 2 with DSC.....................................................................  65
4.16 Comparison example 3 with DSC.....................................................................  65
4.17 Comparison example 4 with DSC.....................................................................  66
5.1 Input format to HOST........................................................................................  69
5.2 A heterogeneous subtask consists of MIMD and vector code segments. . . 74
5.3 Construction of the Spec subgraph of the MIMD code segment................  75
5.4 The system graph and its clustering of a heterogeneous suite....................  76
5.5 HCM bound-degree mapping algorithm..........................................................  78
5.6 The obtained mapping result............................................................................. 80
5.7 The mapping results of MIMD code blocks onto MIMD machine.............  81
5.8 The mapping results of Gaussian elimination on the vector machine. . . .  82
6.1 Analogy between an associative computer and an associative configu­
ration of a network...........................................................................................  84
6.2 A layered heterogeneous network...................................................................... 85
6.3 Instruction Synchronization...............................................................................  90
xiii
Figure Page
6.4 Cluster-M aided HAsC computation within HAsC nodes............................. 93
6.5 Switching between Cluster-M and HAsC........................................................ 93
6.6 The task graph and Spec graph of the HAsC user level instructions. . . .  94
6.7 The task graph of a GE on a 7 x 7 m atrix......................................................  95
6.8 The architectures of HAsC Nodel and Node2................................................  96
6.9 The Cluster-M mappings within the HAsC nodes.......................................... 96
6.10 Hierarchical breakdown of a t a s k ................................................................... 99
6.11 Scalability of HAsC and Cluster-M ................................................................  101
xiv
C H A PT E R  1
IN TR O D U C TIO N
In this chapter, we first present an overview of existing parallel programming tools, 
and will specifically focus on tools for design and mapping of portable parallel 
programs. An essential component of these tools is the mapping techniques employed. 
For this reason, we present a detailed overview of various mapping techniques, 
classified in four categories. Finally, in this chapter, we introduce Cluster-M portable 
parallel programming tool which will be studied in detail throughout this disser­
tation.
1.1 E xisting Parallel Program m ing Tools
Many parallel programming tools have been developed. They can be classified as 
debugger, high-level language, library of specialized routines, mapping tool, network 
tool, performance tool, parallelization tool, etc. [18]. For example, PVM [76] is 
a network tool. It consists of library routines embedded in C or FORTRAN that 
permit a network of heterogeneous computers to appear as one large virtual machine. 
PVM is simple and easy to use, therefore it is widely used. However, using PVM, 
the user must specify data allocation and task partitioning. PVM does not provide 
automatic and intelligent load balancing or mapping. Therefore, programs written 
in PVM may not be portable.
In this dissertation, we are only interested in portable mapping tools that 
port parallel programs onto different parallel systems. Using these programming 
tools, the user can write a parallel program without knowing all the details of the 
target computer where the program is to be executed. Examples of these tools 
include Linda, Prep-P, Oregami, Hypertool, and PYRROS [11, 6, 60, 82, 85]. Linda
1
2[11, 1, 46] is a language extension to C and FORTRAN for parallel programming. 
It is a coordination language for creating parallel or distributed applications via a 
virtual shared memory paradigm. Linda defines a logically shared data structuring 
memory mechanism called tuple space. Tuple space holds two kinds of tuples: process 
tuples which are under active evaluation and data tuples that are passive. Ordinarily, 
building a Linda program involves dropping a process tuple into tuple space spawning 
off other process tuples. This pool of process tuples, all executing simultaneously, 
exchange data by generating, reading and consuming data tuples. Once a process 
tuple has finished executing, it turns into a data tuple indistinguishable from other 
data tuples. Linda requires large volumes of data to be exchanged to and from the 
shared memory. This may cause heavy congestion over available communication 
channels of a typical multiprocessor system. For this reason, Linda has been mostly 
used for coarse grain computations. Furthermore, it is very difficult to implement 
Linda on architectures not supporting the shared memory structure.
On the other hand, Prep-P, Oregami, Hypertool and PYRROS all include a 
mapping component which can map a given parallel program onto either a special or 
arbitrary system. However, the mapping components of Prep-P [6] and Oregami [60] 
are basically libraries of specialized mapping algorithms which only map regularly 
structured programs onto regularly structured systems. Their mappings for irreg­
ularly structured programs or systems can be very slow and not effective. Hypertool 
[82] and PYRROS [85] generate fast and near optimal mappings by clustering the 
task graphs. However, they only map the clusters of task modules onto a fully 
connected system.
Cluster-M [29, 30, 31, 32], the parallel programming tool to be studied in this 
dissertation, also includes a mapping component called Cluster-M Mapping Module. 
The other components of Cluster-M are Cluster-M Specification Module and Cluster- 
M Representation Module. Portable parallel programs can be written in Cluster-
3M Specification without any information of the target computer, while the target 
computer system is represented by Cluster-M Representation. Cluster-M Mapping 
Module uses a new mapping technique which maps Cluster-M Specification onto 
Cluster-M Representation. In the next section, we give an overview of existing 
mapping techniques.
1.2 M apping Techniques
The mapping problem has been described in a number of different ways in literature 
[12, 27]. In general, the mapping problem can be viewed as determining an 
assignment of a given program which consists of a collection of task modules 
that can be run serially or in parallel (representable in the the form of a task graph) 
onto the processing elements of the underlying architecture (representable in form of 
a system graph), so that some performance measure, such as total execution time, 
is optimized.
The mapping problem is one of the most challenging problems in parallel 
and distributed computing. It is known to be NP-complete in its general form as 
well as several restricted forms [47, 78, 79, 8, 33, 27]. Basically, the techniques 
used in mapping can be classified into three groups: graph theoretic, mathematical 
programming, and heuristics [71, 14, 2]. The graph theoretic and mathematical 
programming techniques are only suitable for some special mapping problems, e.g., 
for tasks without communication requirement, for systems with special topology, 
etc. In an attem pt to solve the problem in the general case, a number of heuristics 
have been introduced. These heuristics do not guarantee an optimal solution to the 
problem but they try to find near-optimal solutions most of the time.
Mapping can be either static or dynamic. In static mapping, the assignments 
of the nodes of the task graphs onto the system graphs are determined prior to the 
execution and are not changed until the end of the execution. Static mapping can
4be further divided into static mapping with task duplication and static mapping 
without task duplication. In static mapping with task duplication, a node (task 
module) of task graph can be assigned to more than one node (processor) in the 
system graph [53, 63, 21, 19, 22, 62]. Task duplication is not permissible for tasks 
which perform destructive operations such as data output or modification. In static 
mapping without task duplication, a node (task module) of task graph is assigned 
to only one node (processor) in the system graph [8, 71, 5, 4, 56, 52, 59, 61, 70, 33, 
82, 26, 28, 2, 40, 72, 41, 64, 57, 50, 86, 15, 17], In this dissertation, we concentrate 
on static mapping without task duplication.
A static task graph or system graph can be either uniform or non-uniform. A 
graph is called non-uniform if the weights of nodes are different, and the weights of 
edges also differ. Otherwise, it is uniform. Mapping of directed task graphs (if there 
is precedence relation among the task modules), is called task scheduling, as studied 
in [78, 79, 56, 52, 61, 70, 43, 82, 3, 26, 2, 40, 72, 41, 50, 86, 15, 17]. If the task 
graphs to be mapped are undirected, then it is called task allocation, as studied in 
[8, 71, 5, 4, 59, 33, 28, 64, 57, 15], Whether the graphs are directed or undirected, 
uniform or non-uniform, there are basically four types of static mappings based on 
the topological structures of the task and system graphs [15, 17]: (1) mapping of 
specialized tasks onto specialized systems [20, 73, 10]; (2) mapping of specialized 
tasks onto arbitrary systems [9, 33]; (3) mapping of arbitrary tasks onto specialized 
systems [28, 61, 2, 40, 50]; and (4) mapping of arbitrary tasks onto arbitrary systems
[8, 25, 56, 5, 6, 59, 26, 60, 65, 15, 17].
1.2.1 M apping of Specialized Task onto Specialized System s
The most distinguished examples of research on mapping specialized tasks onto
specialized systems include Coffman and Graham’s early work and later Stone and 
Bokhari’s work on a two processor system [20, 73, 74, 7, 75], Coffman and Graham
5[20] did not consider inter-processor communication cost, while Stone and Bokhari 
assumed a serial program with multiple modules was to be mapped onto the two 
processors. In the latter case, an optimal solution can be obtained using max flow 
min cut algorithm in polynomial time [45, 73]. An extension of the min cut to 
three and N  processor system was also discussed. However, it was noted that this 
extension was not trivial and the mapping results can not guarantee to be optimal. 
Other examples of this group of mappings is Bokhari’s partitioning and mapping 
of chain-like tasks on chain-linked processors or host-satellite system, and a tree- 
structured single-host multiple-satellite system (actually a star) [10].
1.2.2 M apping o f Specialized Task onto Arbitrary System s
Some techniques have been developed for mapping specialized tasks onto arbitrary 
systems. Bokhari in [9] used the shortest tree algorithm to obtain the optimal 
mapping of a tree-structured task graph having M  task modules onto arbitrary N  
processors in 0 ( M N 2) time. Again, it was also assumed that the execution of all the 
modules of the task was serial. Towsley in [77] gave an algorithm for mapping a series- 
parallel task graph onto an arbitrary system graph in 0 ( M N 3) time. Fernandez-Baca 
[33] observed that tree graph and series-parallel graph are actually two special cases 
of a fc-tree, and developed an efficient algorithm for mapping any &-tree or partial k- 
tree task graph onto an arbitrary system graph in 0 ( M N h+1) time. This matches the 
time complexity of [9] and [77] as its special cases. Also, Fernandez-Baca developed 
an algorithm for mapping an almost tree with parameter k in 0 ( M A ^ 21+2) time.
1.2.3 M apping of Arbitrary Task onto Specialized System s
Most mapping techniques developed fall into the third category. When the system 
has N  processors but of a specialized structure, some specialized techniques can be 
used for the mapping. These techniques are especially suited for a set of specialized 
systems, therefore the mappings can be very efficient and effective. For example,
6Ercal, et. al.’s [28] mapping algorithm on a hypercube using Kernighan-Lin’s mincut 
bipartitioning technique [49], Sadayappan and Ercal’s work on mesh [68], Lo’s 
algorithm for bus network [58], etc. Indurkya et al. also analyzed the optimal 
mapping of an arbitrary task graph onto a specialized system graph with additive 
communication cost, e.g., a bus network [44]. Mappings on more general regular- 
structured task graphs were studied by Berman and Snyder using edge grammar [5]. 
Many mapping algorithms did not consider the system in detail, thus assuming that 
all the processors were fully connected [61, 2]. Ali and El-Rewini in [2] proposed an 
interesting graph theoretic approach for mapping M  modules onto N  processors by 
constructing a split graph containing M  module nodes and N  processor nodes. The 
mapping problem was then reduced to finding cliques of the split graph. Since a fully 
connected system provides the strongest communication capacity, many heuristics 
have been developed for mapping on such a system [70, 82, 40, 86]. Various clustering 
techniques can be used (especially for mapping on fully connected system) to reduce 
the time complexity of mappings [70, 82, 40, 86].
1.2.4 M apping of Arbitrary Tasks onto Arbitrary System s
General mappings from an arbitrary task onto arbitrary systems have proven to be 
the most difficult, especially when the task graph and system graph become large 
and complex. Bokhari in [8] defined the mapping problem to be matching the edges 
of the task graph and the system graph. The order of the task graph was assumed 
to be no greater than that of the system graph. The edges were uniformly weighted, 
and the mapping was assumed to be one-to-one onto mapping, which may not be 
the optimal mapping. A heuristic mapping algorithm based on local search using 
pair-wise exchange was presented in [8]. The time complexity of this algorithm is 
0 ( N 3). Lee and Aggarwal in [56] extended Bokhari’s general mapping algorithm 
to take into account a directed task graph with a set of communication phases.
7Therefore, communication edges of different phases can be mapped independently. 
However, the order of the task graph was still assumed to be no greater than that 
of the system graph. This restriction was relaxed in [14]. Stone’s max flow min cut 
algorithm, by which an optimal mapping can be obtained on a two processor system, 
can also be used for sub-optimal mappings of arbitrarily connected M  modules on N  
processors. Lo used the max flow min cut algorithm as the first step in her heuristic 
algorithm [59]. The time complexity of Lo’s heuristic is 0 ( M 2N\ Ep\ log M ), where 
\EV\ is the number of communication links between processors. El-Rewini and Lewis 
presented their mapping heuristic (MH) algorithm in [26]. MH is a list scheduling 
heuristic which maps an arbitrary task graph onto an arbitrary system graph. In list 
scheduling, each task module is assigned a priority. Whenever a processor is available, 
a task module with the highest priority is selected from the list and assigned to this 
processor. MH has a time complexity of 0 ( M 2N 3). A searching algorithm can also 
be used to match an arbitrary task graph to an arbitrary system graph [71, 48], 
Graph contraction and or clustering is often used to reduce the task graph before 
mapping, thus reducing the time complexity. For some regularly structured tasks, 
a specialized graph contraction technique, such as edge grammar, can be used to 
reduce the order of the task graph to a desired value [5, 60]. When task graph is 
arbitrary shaped, heuristic approaches such as simulated annealing, simply greedy, 
or critical path can be used for graph contraction or clustering [25, 4, 6, 61, 39, 28, 
69, 60, 83, 85, 65, 81, 32],
An alternative approach to map an arbitrary task graph onto an arbitrary 
system graph is to first map the task graph onto a completely connected graph 
with a certain order, and second, map this completely connected graph onto the 
target system graph. Sarkar’s edge-zeroing algorithm [70], Wu and Gajski’s Modified 
Critical-Path (MCP) and Mobility-Directed (MD) scheduling algorithm [82], and 
Yang and Gerasoulis’s Dominant Sequence Clustering (DSC) algorithm [86] fall into
8this group. They all produce fast and good mappings from an arbitrary directed 
task graph onto a completely connected system graph. However, to finally map onto 
an arbitrary target system graph, other mapping (allocation) algorithms such as 
Bokhari’s 0 ( N 3) mapping algorithm have to be used which may increase the overall 
time complexity. Also, the final mapping results on an arbitrary system may not be 
as good as on a completely connected system.
Most mapping techniques for mapping arbitrary tasks onto arbitrary systems 
only consider uniform systems [8, 56, 5, 70, 82, 60, 28, 14, 86, 62, 15]. Therefore, 
information about systems such as computation speed of each processor and commu­
nication bandwidth on each link is implicitly known before mapping. Even in those 
techniques which consider non-uniform systems [71, 59, 72], the information about 
the system graphs is incorporated in their task graphs. Only in [26, 15, 17], the 
information about the speed of the processors and the communication links is kept 
independent of the task graphs. Therefore, these results can be directly used towards 
designing portable programs which are representable in form of machine-independent 
task graphs.
1.3 Cluster-M
Cluster-M facilitates the design and mapping of portable parallel programs onto 
various multiprocessor systems by clustering not only machine-independent task 
graphs but also system graphs. Cluster-M has three components: the Cluster-M 
Specification of the given task, the Cluster-M Representation of the underlying 
system, and the Cluster-M Mapping Module. Portable programs are specified in 
Cluster-M Specifications in a way which represents concurrent computations and 
communications at every step of the overall execution. On the other hand, the 
processors of the underlying system are clustered in a hierarchical fashion so that all 
of those in the same cluster have an efficient communication medium. The Cluster-
9M Specification and Representation are actually multi-level clustered graphs of the 
directed (or undirected) task graph, called the Spec graph, and the undirected system 
graph, called the Rep graph. The user can directly specify the Cluster-M Specifi­
cation of a given task which is representable in the form of a Spec graph. On the 
other hand, for any given task graph, a Spec graph can be generated by using one of 
the clustering algorithms presented in this dissertation. Similarly, a Rep graph can 
be generated given the topology of the target system as input.
Both Spec and Rep graphs contain a certain number of clustering levels. In 
each level, there are a number of clusters which are called Spec clusters and Rep 
clusters respectively. A Spec cluster represents a set of concurrent computations 
which have inter-communication between each other. A Rep cluster represents a set 
of processors with a certain level of connection. The mapping is carried out from 
a Spec graph onto a Rep graph by matching Spec clusters to Rep clusters. As the 
number of clusters at each clustering level is much less than the number of original 
task modules or processors, the mapping process becomes very fast, yet produces 
sub-optimal results.
1.4 Contributions and Outline
The mapping technique presented in this dissertation is the first which produces a 
near-optimal mapping of an arbitrary non-uniform machine-independent task graph 
onto an arbitrary non-uniform task-independent system graph. The clustering is 
done only once for a given task graph (system graph) independent of any system 
graphs (task graphs). This is a machine-independent (application-independent) 
clustering and is not repeated for different mappings. The presented recursive 
mapping algorithm maps any task graph with M  modules onto any system graph 
having N  processors in O(MP)  time, where P = max(M, N).  This time complexity 
guarantees faster mappings compared to other leading mapping techniques. Our
10
experimental results also indicate that Cluster-M produces better or similar mapping 
results compared to other techniques which work only for restricted task or system 
graphs.
The rest of the dissertation is organized as follows. In Chapter 2, we 
present Cluster-M programming in Cluster-M Specification Module. The clustering 
algorithms for uniform/non-uniform directed/undirected graphs are given in Chapter 
3. Chapter 4 presents mapping algorithms for both uniform and non-uniform graphs, 
and experimental results and comparisons with other known techniques. Related 
work of mapping of specialized heterogeneous tasks, and the combined use of Cluster- 
M with another tool, called HAsC, are discussed in Chapter 5 and 6 respectively. 
Finally, conclusions are given in Chapter 7.
CHAPTER 2
C L U ST E R -M  P R O G R A M M IN G
In this chapter, we show how to write portable parallel programs in form of Cluster- 
M Specifications. A set of Cluster-M constructs, which are essential for writing 
Cluster-M Specifications, is described. To illustrate programming in Cluster-M 
Specifications, several frequently used operations are coded using these constructs.
2.1 C lu ste r-M  Specifications
A Cluster-M Specification of a task is a high level machine-independent program that 
specifies the computation and communication requirements of a given problem. A 
Cluster-M Specification can be translated into a Spec graph which contains multiple 
levels of clustering. In each level, there are a number of Spec clusters representing 
concurrent computations at a certain step. Clusters are merged when there is a need 
for communication among concurrent task modules. For example, if all n elements 
of an array are to be squared, each element is placed in a cluster, then the Cluster-M 
specification would state:
For all n clusters, square the contents.
©  ®  ©    ©  ©
F ig u re  2.1 Spec graph of a unary operation on an array of size n.
Note, that since no communication is necessary, there is only one level in the 
Spec graph as shown in Figure 2.1. The mapping of this Specification to any archi­
tecture having n processors would be identical. Figure 2.2 shows the Spec graph 
of a binary associative operation on 8 elements. Initially, each element is in a
11
cluster. Then clusters are merged at each level when they have inter-communication. 
The result of this binary associative operation is obtained in the cluster at the last 
clustering level.
ix.eii




F ig u re  2.2 Spec graph of a binary associative operation on 8 elements.
2.2 C lu ste r-M  C o n stru c ts
The basic operations on the clusters and their contained elements are performed 
by a set of constructs which form an integral part of the Cluster-M Specification 
Module. The following is a list and description of the constructs essential for writing 
Cluster-M Specifications.
• C M A K E { L V L ,  E L E M E N T S ,  x )
This construct creates a cluster x  at level L V L  which contains E L E M E N T S  
as its initial elements. E L E M E N T S  is an ordered tuple of the form 
[ej, e2, • • •, en] where n is the total number of components of E L E M E N T S .  
The components of E L E M E N T S  could be scalar, vector, mixed-type, or any 
type of data structure required by the problem.
• C E L E M E N T ( x ,  j, e)
This construct yields the j- th  element of cluster x, and returns this element as
13
e. If j  is replaced by then C E L E M E N T  yields all the elements of cluster 
x. If x  is replaced by then C E L E M  E N T  yields all the elements of all 
clusters.
• C S I Z E ( x , e )
Returns e as the number of elements of cluster x.
• C M E R G E (x ,y ,  E L E M  E N T S ,  z)
This construct merges clusters x, y into cluster 2 . The elements of the new 
cluster are given by E L E M E N T  S. If E L E M E N T  S  in C M E R G E  is replaced 
by the elements of the new cluster are the elements of x  concatenated to 
the elements of y.
• C U N(op ,n ,x , i ,e )
This construct applies unary operation op to the i-th element of cluster x, and 
returns the result in e. If op is left or right shift operation, the number of shifts 
is specified by n.
• CBI(op, x, i, y , j ,  e)
This construct applies binary operation op to the i-th element of cluster x  and 
the j-th  element of cluster y, and returns the result by e. If i, j  are replaced 
by then the binary operation is applied to all elements of x, y.
•  C S P L I T ( E , k , E l , E 2 )
This construct splits cluster E  at the fc-th element into two clusters E l  and 
E2.
2.3 Im plem entation of the C luster-M  Constructs
The Cluster-M Specification constructs have been implemented by Program Compo­
sition Notation (PCN), a system for developing and executing parallel programs
14
[13, 35]. PCN consists of a high-level programming language with C-like syntax, tools 
for developing and debugging programs in this language, and interfaces to Fortran 
and C allowing the reuse of existing code in multilingual parallel programs. Programs 
developed using PCN are portable across many different workstations, networks and 
parallel computers. The code portability aspect of PCN makes it suitable as an 
implementation medium for Cluster-M.
PCN focuses on the notion of program composition and emphasizes the 
techniques of using combining forms to put individual components such as blocks, 
procedures and modules together. This encourages the reuse of parallel code, since 
a single combining form can be used to develop many different parallel programs. 
In addition, this facilitates the reuse of sequential code and simplifies development, 
debugging and optimization by exposing the basic structure of parallel programs. 
PCN provides three core primitive composition operators: parallel, sequential, and 
choice composition, represented by “||” , and “?” respectively. More sophisticated 
combining forms can be implemented as user-defined extensions to this core notation. 
Such extensions are referred to as templates or user-defined composition operators. 
Program development, both with the core notation and the templates, is supported 
by a portable toolkit. The three main components of the PCN system are illustrated 
in Figure 2.3. The implementation of the seven Cluster-M constructs is listed in 
Appendix A.
2.4 Cluster-M  Specification Macros
Several operations are frequently encountered in writing parallel programs. Macros 
can be defined using basic Cluster-M constructs to represent such common operations. 
We next present several macros, their coding in terms of Cluster-M constructs and 
their PCN implementations:
15
Portab le  Toolkit
Application-specific 
com position opera to rs
C ore Program m ing Notation
F ig u re  2.3 PCN system structure.
2.4.1 A ssociative B in a ry  O p era tio n
Performing an associative binary operation on N  elements is a common operation in 
parallel applications. The Spec graph for input size =  8 is given in Figure 2.2. The 
resulting Spec graph is an inverted tree with input values each in a leaf cluster at level 
1 and the result at the root cluster at level logn +  1. Using Cluster-M constructs, 
the macro ASSOC-BIN, written in PCN, applies associative binary operation * to 
the N  elements of input A and returns the resulting value as follows:
A S S O C  J3IN(* ,  N, A) 
int N, A[ ];
{ ; Ivl — 0,
makeJuple(N, cluster),
{; i over 0 .. N  — 1 ::





Binary  JOp{cluster, TV, op, Z)
}
Binary  JDp(X, TV, op, B)  
int TV, n;
{? TV > 1— > { ; n := TV/2,
makeluple(n ,  K),
{ ; i over 0 .. n — 1 ::
{ ; B l M E R G E { o p , X[2 * i ) , X [2 * i + 1], Z), 
Y[t\ =  Z
}
} ,
Binary JDp(Y, n, op, B)
} ,
default— > B  — X
}
B I M E R G E ( o p , X  1, X2,  M)  
int e;
{ ; C B I ( o p , X l , l , X 2 , l , e ) ,
C M E R G E { X l ,X 2 , [ e ] ,M )
}
17
2.4.2 Vector D ot Product
As a representative example of vector operations, the dot product of two vectors is 
considered here. The vector dot product of two n-element vectors A and B  is defined 
as d = ' h).  The Spec graph of this operation is similar to that shown in
Figure 2.2. This macro can be written in terms of Cluster-M constructs and the 
above ASSOC-BIN macro as follows:
/* VECTOR DOT PRODUCT*/
DOT-PRODUCT(N ,  op, A, B, Z ) 
int N,A[],B[],C[N],e;  
{; Ivl - 0, 
makeJuple(N,  A l), 
makeJuple(N, B l ) ,  
{|| i over 0 .. TV — 1 :: 
{ ; CMAKE(lvl,[A[i]],a),  
C M A K E ( lv l ,  [£[*]],&), 




{; j  over 0 .. TV — 1 ::




A S S O C  J 3 I N ( “ + ” , N, C, Z)
}
18
2.4.3 SIM D D ata Parallel Operations
In this class of operations each operation is applied to all the input elements 
without any communication. In this case each operand is assigned one cluster in the 
Cluster-M Specification. The desired operation is applied to all clusters. The macro 
DATA-PAR applies operation * to all N  elements of input A, as follows:
DATA-PAR(op, n, N, A , Z) 
int A[ ];
{; Ivl =  1,
makeJuple(N, cluster),
{ ; i over 0 .. N  — 1 ::
{ ; CMAKE(lvl ,[A[i\],c),  




{ ;j over 0 .. N  — 1 ::
{ ; CUN(op, n, cluster[j], 1, e),




2.4.4 Broadcast O peration
A frequently encountered operation in parallel programs is broadcast operation. One 
value is to be broadcast to all processors in the system. The Cluster-M Specification
19
for a macro that broadcasts one value a from cluster x to N recipient clusters, can 
be written in terms of Cluster-M constructs as follows:
B R O A D C A S T ( N ,  e, Z)
{; Ivl = 0,
makeJuple(N, Z ),
{|| i over O to N  — l ::









Figure 2.4 Cluster-M Specification of broadcast macro.
CHAPTER 3
C LUSTERING  G R A PH S
In this chapter a set of clustering algorithms, which can be used to generate Spec 
and Rep graphs from arbitrary uniform/non-uniform directed/undirected task and 
system graphs, are presented. Clustering algorithms for uniform graphs are presented 
in Section 3.1, and clustering algorithms for non-uniform graphs are presented in 
Section 3.2. The obtained Spec and Rep graphs will be input to the Mapping Module 
as presented in the next chapter.
3.1 C lustering Arbitrary Uniform  Graphs
This section addresses clustering algorithms for uniform graphs. If the task graph is 
directed, then the algorithm presented in Section 3.1.1 can be used to obtain the Spec 
graph. If the task graph is undirected, then the algorithm presented in Section 3.1.2 
can be used to generate the Spec graph. Since it is assumed that the connections 
between adjacent processors of the parallel systems studied here are bi-directional, 
the system graphs are always undirected. A Rep graph can be obtained by the 
clustering algorithm for undirected graphs in Section 3.1.2. For every architecture, 
at least one corresponding Cluster-M Rep graph can be constructed. A Rep graph 
with k nested clustering levels represents a connected network of processors with 
diameter fl(k).
3.1.1 C lustering D irected Graphs
Many clustering techniques have been developed to reduce the order and size of task 
graphs [25, 4, 61, 28, 69, 60, 40, 65]. For example, a cluster can be a clan [61] which 
is a set of nodes with common outside ancestors and descendants on the task graph.
20
21
Our Cluster-M based mapping requires clustering of both the task graph as well as 
the system graph to obtain even better and faster solutions. For clustering either the 
task graph or the system graph, the following algorithm is used if the input graph is 
directed, otherwise the algorithm presented in the next section (3.1.2) is utilized. In 
the scheduling problem, task graphs are directed, while in task allocation problem 
they are not. The system graphs, on the other hand, are always assumed to be 
undirected (todays computers have bi-directional links). Therefore, the algorithm 
presented below is to be used only for directed task graphs. In the following, a 
formal definition of directed task graphs, which is also applicable to undirected task 
graphs (with the exception that for every i,j,  (ti,tj) = (fj,i,)), is given.
A task can be represented by a task graph Gt(Vt, E t), where Vt =  {t\,  ..., tm} is 
a set of task modules to be executed, and E t is a set of edges representing the partial 
orders and communication directions between task modules. A directed edge (t i , t j ) 
represents that a data communication exists from module i, to tj and that <,• must be 
completed before tj can start. Furthermore, each task module <,■ is associated with its 
amount of computation A,-, and each edge (£,-, tj) is associated with Dij, the amount of 
data required to be transmitted from module <t- to module tj. Note A,- > 0 and Dij > 
0, for 1 < i , j  < M.  If an directed edge (f;,fj) exists, then /,■ is called a parent node 
(module) of tj and tj a child node (module) of f,-. If a node has more than one child, it 
is called a Broadcast-node. If a node has more than one parent, it is called a Merge- 
node. Task modules are divided into different execution steps and communications 
between modules are divided into different execution phases according to the data 
and operational precedence. Computations in the same step and communications in 
the same phase can be carried out in parallel, but can not start before the parent 
modules of those in the previous step have finished computations. In this section, it 
is assumed that the amount of computation within each task module and the amount 
of data communication between any two task modules are uniform, i.e., A, =  1 and
22
D{j =  1, for 1 < i , j  < M  and (ti,tj) E E t. This assumption leads to the simple 
greedy clustering in the clustering algorithm.
The algorithm for clustering directed graphs is presented in Figure 3.1. The 
basic idea is to merge all the nodes in each execution step if they have a common 
parent node or a common child node. Each cluster has a size which is defined to 
be the number of concurrent nodes contained in this cluster. If a Spec cluster has 
a size as{ and the sizes of its sub-clusters at the lower level are as,,, • • •, &s,k, it is 
obvious that as, =  astl +  • • • + crsik • Also, some task modules which can not be run 
in parallel will be embedded into a supernode, so that they will be finally assigned 
to the same processor. The size of a supernode is still 1. If a parent node t,- has one 
or more children, one of its children is to be embedded to If a child node has one 
or more parents, it will be embedded to one of its parents.
The complexity of the clustering-directed-graph algorithm is on the order of 
the number of edges of the task graph, which is 0 (M 2), where M  is the number 
of nodes of the task graph. To illustrate this algorithm, the following example is 
presented.
A task graph of 15 modules is shown in Figure 3.2. Each module has a compu­
tation amount of 1, and each edge carries this amount of data communication. This 
task graph contains two subgraphs that are not connected, which means the two 
subtasks can be executed in parallel. The Spec graph is constructed by merging the 
clusters when they have communication needs as illustrated in Figure 3.2. The input 
task graph has nodes a to o (15 nodes). The final Spec graph is a multi-layered graph 
containing 9 nodes. For example, j , k and I are embedded to d, since j , k and / are 
in different execution steps and can not be executed concurrently. This will not only 
save the processor resources and communication cost, but also reduce the mapping 
cost since the Spec graph now only contains 9 nodes versus the original 15.
23
Clustering-directed-graphs Algorithm
group nodes of given task graph into corresponding steps 
group edges of given task graph into corresponding phases 
for all nodes at step 1, do
make each node into a cluster 
for all phases, do
for all edges (t{,tj), do 
begin if tj is a Merge-node, then 
begin embed tj to <,■
if the parent nodes of tj are not in a cluster, then 




if t{ is a Broadcast-node, then 
begin k = number of nodes in cluster t{ belongs to 
if t{ has more than k children, then 
begin embed first k children to the above k nodes 
merge the rest into the above cluster 
increase cluster size
end
else embed all children
end
end









© © © © © ©  
cEXiS) (tsXS)
embed j  to d  embed m to e  embed n to g
©  (I<DG1
embed k to d embed o to f
embed 1 to d
j, k, I tod 
embed m to e 
n tog 
o to f
Constructing the Spec graph 
Figure 3.2 A task graph and the obtained Spec graph.
25
3.1.2 C lustering U ndirected Graphs
The algorithm presented in this section can be used to generate the Spec graph of an 
undirected task graph (for allocation problem), as well as the Rep graph of a system 
graph (undirected). Since the definition of a directed task graph was presented in 
the last section, it is also applicable to an undirected task graph (with the exception 
of (t{ , t j) =  (t j , t {), for all i, j ) .  This section only presents the definition of system 
graphs (undirected). Then the algorithm for generating a clustered graph (Spec 
graph for task graph or Rep graph for system graph) out of such an undirected input 
graph is presented.
A parallel system can be modeled as an undirected system graph GP(VP, Ep). 
VP =  {pi,--->P/v} is a set of processors forming the underlying architecture, while Ep 
is a set of edges representing the interconnection topology of the parallel system. It 
is assumed that the connections between adjacent processors of the parallel systems 
studied here are bi-directional. Therefore, an edge (pi,Pj) represents that there is a 
direct connection between processor p,- and pj. The computation speed of processor 
Pi is denoted by S',-, and the communication bandwidth/rate between two processors 
P i and pj is denoted by R,j. In this section, we assume that there is a uniform speed 
at each processor and a uniform transmission rate over any direct communication 
link in the system, i.e., 5; =  1 and Rij =  1, for 1 < i , j  < M  and (p ;,p j) E Ep. This 
assumption leads to the simple greedy clustering.
To construct a clustered graph (Rep graph or Spec graph) from an undirected 
input graph, initially, every node forms a cluster. This node is presented by p,- in 
the case of the system graph and by i,- in the case of the task graph. Then clusters 
which are completely connected are merged to form a new cluster. This is continued 
until no more merging is possible. Two clusters x and y are connected if x contains 
a node px (tx) and y contains a node py (ty), such that nodes px (tx ) and py (ly) 
are connected by a direct communication link. Each cluster has a size which is the
26
Clustering-undirected-graphs Algorithm
for all nodes p,- (£,), do
make a cluster for p; (<,•) at clustering level 1 
set cluster level to 1 
while merging is possible, do 
begin for all clusters c at current level, do
begin make c into cluster c' at next level 
delete cluster c from current level 
for all clusters x in current level, do 
if x is connected to all sub-clusters of c', then 
begin merge x into c'
delete x from current level
end
end
increment clustering level by 1
F ig u re  3.3 Clustering-undirected-graphs algorithm
number of nodes it contains. If a Rep (Spec) cluster has a size or, (crs.) and the 
sizes of its sub-clusters at the lower level are ctr,.,, •••, (jRik (a,s,,, • • as,k), it is
obvious <7r, =  ctr,., H V<TR,k (crs, =  0's,, H The algorithm for clustering
undirected graphs is shown in Figure 3.3. An example is shown in Figure 3.4. The 
undirected graph shown can represent a system graph, therefore the generated output 
is shown as a Rep graph. However, if the same input is an undirected task graph for 
allocation problem, then the generated output is a Spec graph. Figures 3.5, 3.6, 3.7, 
and 3.8 show the clustered graphs of the hypercube, mesh, ring, and a completely 
connected graph respectively.
The running time of this implementation is analyzed as follows. In every level, 
each cluster in that level is compared with the higher numbered clusters in the same 
level and check if they form a clique. Suppose at a certain level of the system 
graph (undirected task graph), there are m clusters ci, with each cluster






© 0 © © ©  © ©  ©
(^T © lg£> (jg) ©j) ( D^~(a£) ©
stcp3: CCS) © ©) v® ©2) C(© @) ©
(rcaulO
Clustering of the undirected graph.
F ig u re  3.4 An undirected graph and its clustering.
F ig u re  3.5 A clustered graph of a hypercube.
1 1
............ -6 Ai i t it I
<-4-----------< H i
l (
-(-■a-............<iH- ■" <-■&------------ i - l i
F ig u re  3.6 A clustered graph of a mesh.
F igu re  3.7 A clustered graph of a ring.
F igure 3 .8  A clustered graph of a completely connected graph.
29
Pi = N  (]C£Li Pi — M),  where N  is the number of underlying processors (M is 
the number of task modules). The time of clustering at this level is dominated by 
the total number of comparisons made to determine if each cluster is connected to 
all sub-clusters of another cluster at next level, which is at most YllLx SjLi+i PiPj
< £ ," i  PiN < N 2 (ES:i EjLi+i TiTi ^  £ £ i  TiM  ^  M2)- The number of levels can 
be at most N  — 1 (M  — 1). Therefore, the total time complexity of this algorithm is 
0 ( N 3) ( 0 ( M 3)).
3.2 Clustering Arbitrary Non-Uniform  Graphs
In this section, two algorithms for clustering non-uniform graphs are presented. The 
clustering is done only once for a given task graph (system graph) independent of any 
system graphs (task graphs). It is a machine-independent (application-independent) 
clustering and is not repeated for different mappings. Once a Spec graph and a 
Rep graph are obtained, a sub-optimal mapping can be generated by using a fast 
recursive mapping algorithm to be presented in Chapter 4.
3.2.1 C lustering Non-Uniform  D irected Graphs
The definition of directed task graph is the same as that presented in Section 3.1.1, 
except that computation amount A, and data transmission amount may be non- 
uniform for different nodes and edges in the task graph.
A clustering algorithm called clustering-non-uniform-directed-graphs is shown 
in detail in Figure 3.9. The algorithm is briefly described in the following. It begins 
with a quadruple of parameters (as, 6s, Ids, n s )• Each of these parameters is 
described as follows. The size of a cluster is denoted by as,  and represents the 
maximum number of nodes in this cluster that can be computed in parallel. The 
number of levels in a cluster represents the maximum sequential computation length 
of each node in the cluster, and is denoted by 6s■ The total amount of communi-
30
Clustering-non-uniform-directed-graphs Algorithm
group nodes of given task graph into corresponding steps 
group edges of given task graph into corresponding phases 
for all nodes at step 1, do 
begin make it into a cluster
calculate the parameters of each cluster
end
for all phases, do
for all edges (t{,tj), do
begin if tj is a Merge-node, then
begin sort all incoming edges to tj in descending 
order of communication amount 
embed the first child node to <,• 
if the parent nodes of tj are not in a cluster, then 
begin merge them into a cluster
calculate the parameters of the new cluster
end
end
if ti is a Broadcast-node, then 
begin sort all outgoing edges from /,• in descending 
order of communication amount 
embed the first child node to t{ 
if the child nodes of t,- are not in a cluster, then 
begin merge them into a cluster




F ig u re  3.9 Clustering-non-uniform-directed-graphs algorithm.
clustering
(a) Em bedding o f  k sequential computations represented
by a node o f  weight k. o r by its equivalent uniform  (b ) M erging  o f  k  com m unications rep resen ted  by  on ed g e  o f  w eight k , o r  by
graph having k unit-com putation nodes. its equ ivalen t un iform  g raph  h av in g  k  u n it-co m m u n ica tio n  edges.
F ig u re  3.10 Two types of clustering.
cation among all clustering levels is denoted by IIs, and the average communication 
amount at current (top) level is denoted by ixs- Furthermore, there are two types of 
operations performed on clusters: embedding and merging. Embedding is when two 
or more sequential clusters are combined into one cluster as shown in Figure 3.10. 
This is shown in perforated line in Figure 3.10(a) and (b). Merging is when a number 
of concurrently executable sub-clusters are grouped to form a new cluster. This is 
shown by a solid line in Figure 3.10(b). The embedded cluster in Figure 3.10(a) has 
as  =  1, Ss — k, Us — 0 and its — 0. The merged cluster in Figure 3.10(b) has 
as — k, 6s — 2, n s =  k and 7Cs = k — 1. In each of the two figures, the value of 
the quadruple obtained is identical for both the uniform and non-uniform equivalent 
representations.
The clustering is done step by step. Each clustering step corresponds to a 
computation step. At every step, the nodes (clusters) are clustered as follows. If 
a node is a Merge-node, it is first embedded onto one of its parent nodes, all the 
parent nodes are merged into a larger cluster, similar to what was done in Section 
3.1.1. A simple case where a Merge-node has only two parent nodes is shown in 
Figure 3.11. Similarly, a general case is shown in Figure 3.12, where a Merge-node 
has n parent nodes, and the n edges are sorted in descending order of the edge weight 
(communication amount). If a node is a Broadcast-node, then one of its child nodes 
will be embedded into this Broadcast-node, and then the rest of the child nodes
32
cluslcrt o ,  , 6 ,  , n ,  , n  , ) cluslctf o 2 ,ft2 , n 2>n 3 )
Suppose D ( >= D 3
Embed node A to its left parent cluster, 
then merge the two clusters.
cluster( o ,  + a 2 , max( 8 ,  + A, 6 j ) + l , n ,  + n 2 + D,  +Da , Da
Figure 3.11 Clustering on a Merge-node.
will be merged with the Broadcast-node into a larger cluster. A simple case where a 
Broadcast-node has only two child nodes is shown in Figure 3.13. Similarly, a general 
case is shown in Figure 3.14, where a Broadcast-node has n child nodes, and the n 
edges are sorted in descending order of their weights. Note that since the task graphs 
are independent of the system graphs (unlike [70, 82, 86]), they do not contain the 
information about computation time and communication delay. Therefore, only one 
child node can be embedded to the parent node in both the merge and broadcast 
cases shown above. Consider an example in Figure 3.15, where D\ > D2 > • • • > Dn. 
If A{ and Di (1 < i <  n) are actual computation time and communication delay, 
then more than one child node should be embedded to the parent Broadcast-node 
if possible. Suppose Dj > J2i=i A, for j  — 2, ■ • •, k , but £ > * ,+ 1  < Y%=i A,-, then the 
first k child nodes can be embedded to the parent node. However, since A, and £), 
(1 < i < n) are just computation and communication amounts, the time spent on 
such computations and communications is unknown before they are mapped onto 
a particular system. Therefore, only the first child node can be embedded to the 
parent node. However, the embedding of multiple child nodes can be done as part 
of the mapping, which is explained in Chapter 4.
d u ste rf a ,  ,5 , , n  , , n , ) clu s te r(o 2 , 62 , I I 2 , ft 2 ) clu>ter( o n . ft n. II n . nn )
F ig u re  3.12 Clustering on a Merge-node: a general case.
cluster( a „  , 6 ,  ,no , it 0 )
Suppose D , >=D .
Ai
Aa.
c lu s te r (a 0 + 1 ,m ax( 6 0 + A , ,  A 2) + 1, n  0 +  D t + D 2 t D2 )
F ig u re  3.13 Clustering on a Broadcast-node.
34
clusler(o0 ,&0 , n 0 , n  0 )
\
nn 2D
clustcr{ o 0 + n -1 , max(B0 + A , ,  A 2 ,... A n) + 1 J I 0 + £  D{ 1 - 2  1 )
Figure 3.14 Clustering on a Broadcast-node: a general case.
k+1
Figure 3.15 Possible embedding on a Broadcast-node.




(.1, 18, 26, 6 )
( 1. 4, 6, 6)
Constructing the Spec graph
Figure 3.16 A task graph and the obtained Spec graph.
36
The time complexity of the clustering-non-uniform-directed-graphs algorithm 
is bound by the number of edges in the task graph, which is 0 ( M 2), where M  
is the number of nodes. To illustrate this algorithm, consider the task graph of 
7 modules and its Spec graph as shown in Figure 3.16. Each module is labeled 
with its computation amount, and each edge is labeled with the amount of data 
communication. The Spec graph is constructed by merging the clusters when they 
have communication needs. The final Spec graph is a multi-level clustered graph.
3.2.2 C lustering N on-U niform  U ndirected Graphs
The algorithm presented in this section can be used for generating the Spec graph 
of an undirected task graph (for allocation problem), as well as the Rep graph of a 
system graph (undirected). The definition of the directed task graph presented in 
the last section is also applicable to an undirected task graph (with the exception of 
(t{, t j) =  (t j , t j ), for all i, j ) .  The definition of undirected system graph is the same 
as that presented in Section 3.1.2, except that both the computation speeds 5, of 
different processors and the transmission rates R,j of different communication links 
may be non-uniform. Therefore, the system discussed in this section can be a truly 
heterogeneous system. The rest of the section concentrates on the clustering of an 
undirected system graph to obtain Rep graphs. The same approach can be used to 
obtain Spec graph from a non-uniform undirected task graph.
Similar to a Spec cluster, each Rep cluster is associated with a quadruple (<7 ^, 
8r , I!/*, ttr) which represents number of processors contained in the cluster, average 
computation speed of the processors in the cluster, total communication bandwidth, 
and the average communication bandwidth at the current (top) clustering level. To 
construct a clustered graph (Rep graph) from an undirected system graph, initially 
every node with computation speed 5,- forms a cluster with parameters ( 1 , 5;, 0 , 
0). Then clusters which are completely connected are merged to form a new cluster,
37
Clustering-non-uniform-undirected-graphs Algorithm
for all nodes p,-, do
begin make a cluster for p,- at clustering level 1
set the parameters of the cluster to be ( 1 , 5,-, 0 , 0 )
end
set cluster level to 1
while there is at least one edge linking two clusters, do 
begin sort all edges linking any two clusters 
while sorted edge list is not empty, do 
begin take the first edge (c,-,Cj) from sorted edge list 
delete the edge from the list 
merge c, and Cj into cluster c' at next level 
calculate the parameters of c' 
delete clusters c, and cj from current level 
for each edge (cx, cy) in sorted edge list 
if cx is a sub-cluster of c' and 
cy is not a sub-cluster of any cluster and 
cy is connected to all other sub-clusters of c', then 
begin merge cy into c'
recalculate the parameters of c' 
delete (cx,cy) from edge list
end
else if cx and cy are sub-clusters of 
two different clusters at next level, then 
begin add the weight of (cx,cy) to
the edge between the two super-clusters 
delete (cx,cy) from edge list
end
end
increment clustering level by 1
end
F ig u re  3.17 Clustering-non-uniform-undirected-graphs algorithm.
(3 ,5/3,1,1)
(1,1,0,0)(2,2,2,2)
F igure 3 .18  A non-uniform system graph and its clustering.
38
and the parameters of the new cluster are calculated accordingly. This process is 
continued until no more merging is possible. Two clusters x and y are connected if 
x contains a node px (tx) and y contains a node py (ty), such that node px (tx) and 
py (ty) are connected by a direct communication link. The algorithm for clustering 
undirected graphs is shown in Figure 3.17. An example is shown in Figure 3.18.
The running time of this implementation is analyzed as follows. For each level, 
first sort all the edges between clusters which takes 0( \EP\ log \EP\), where \EP\ is 
the number of edges in the system graph. Clusters keep merging into the next levels. 
Suppose at a certain level, there are m  clusters c\, ■••,cTn. The time for all these 
comparisons is at most m * m < N 2, where N  is the number of processors in the 
system graph. The number of levels can be at most N  — 1. Therefore the total time 
complexity of this algorithm is 0 ( N ( \ Ep\ log \EP\ +  N 2)). Consider the worst case 
where the system graph is completely connected so that \EP\ =  0 ( N 2), the upper 
bound for this algorithm will be 0 ( N 3 log TV).
C H A PT E R  4 
CLUSTER-M  M A PPIN G
For a given problem, a high level machine independent parallel program can be 
presented in form of a Cluster-M Specification which is directly representable as a 
Spec graph. In addition, a Spec graph can be generated directly from a given task 
graph, using one of the algorithms in the last chapter. On the other hand, given a 
system graph representing an underlying architecture or organization, a Rep graph 
can be generated as shown in the last chapter. In this chapter, given a Spec graph and 
a Rep graph as the input to the Mapping Module, efficient mapping algorithms are 
presented which produce sub-optimal matching of them. The mapping procedure 
presented in this chapter has a much lower time complexity than the traditional 
mappings since it contains a graph matching procedure for which both of the input 
graphs have been clustered. The uniform mapping algorithm presented in Section
4.1 has a time complexity of O(MN) ,  while the non-uniform mapping algorithm in 
Section 4.3 has a time complexity of 0(A /P ), where P = max (M, N) .  Extensive 
experimental results and comparisons with other leading mapping techniques are 
also presented in this chapter.
4.1 C luster-M  Uniform M apping
This section presents a fast recursive mapping algorithm which produces sub-optimal 
mapping of a uniform Spec graph onto a uniform Rep graph in O (MN)  time. A set 




4.1.1 U niform  M apping Algorithm
The mapping function is defined by f m : Vt Vp. Following the precedence 
constraints and the computation and communication requirements of the original 
task graph, a schedule can be obtained which places each task module f,- to processor 
at the proper time (earliest possible starting time). Since the communication 
weights along edges of both uniform task graphs and system graphs are 1 , the 
communication time of the task graph edge (t j , t j ) is equal to dist( fm(ti), f m(tj)), 
where dist(pi,pj) is the shortest distance between processor p; and pj. It is also 
assumed that it takes no time to communicate data within the same processor, i.e., 
dist(pi,pi) =  0 .
A schedule can be illustrated by a Gantt chart which consists of a list of all 
processors and for each processor a list of all task modules allocated to that processor 
ordered by their execution time [27], The total execution time of a schedule, defined 
by Tm, is the latest finishing computation time of the last scheduled task module 
on any processor. Obviously, Tm is equal to the total execution time of a given 
task on a given system. As the shortest execution time of a given task on a system 
is considered to be the ultimate goal in scheduling, Tm is taken as the measure of 
quality to scale a mapping. However, since Tm can only be calculated once a schedule 
has been obtained, it is difficult to predict Tm in the process of mapping. Therefore, 
another objective function is to be presented as the mapping heuristic to guide the 
mapping process.
A detailed description of the uniform mapping algorithm is presented in Figure 
4.1. This gives an overview of the algorithm. Before starting the mapping, it is 
necessary to compute a reduction factor denoted by / ,  which is essential for mapping 
of task graphs having more nodes than the system graphs. The reduction factor /  is 
the ratio of the total sizes of the Rep clusters over the total sizes of the Spec clusters. 
It is used to estimate how many computation nodes need to share a processor. The
41
mapping is done recursively at each clustering level, where the best matching between 
Spec clusters and Rep clusters is found. For matching Spec clusters to Rep clusters, 
first the Spec and Rep clusters are sorted in descending order with respect to their 
sizes. Then to map each Spec cluster ks, with size as,, search for a Rep cluster with 
the best matched size, i.e., closest to /  x asr  Therefore, the objective function to 
be minimized can be formulated as below:
I/m I =  E  1/ X -  ^ /m(KSi)l (4-1)
i
As shown in (4.1), the objective is to find a minimum | / m| by matching all Spec 
clusters to Rep clusters. When the mapping at top level is done, for each pair of the 
mapped Spec and Rep clusters, the same mapping procedure is continued recursively 
at a lower level until the mapping is fine grained to the processor level.
The time complexity of the mapping algorithm is dominated by the time of 
finding the best matches of all Spec clusters at all levels. At each level, the time 
complexity of finding the best matches of all A',- Spec clusters is 0(K{N) ,  as the total 
number of clusters in the Rep graph is O(N),  where N  is the number of processors. 
Since the total number of Spec clusters is O(M),  i.e. X^A', =  O(M),  where M  is 
the number of nodes in original task graph. Therefore, the total time complexity of 
this mapping algorithm is O(MN).
4.1.2 Uniform  M apping Exam ples
In Section 3.1, a Spec graph and a Rep graph have been constructed from the original 
uniform task graph and system graph, as shown in Figures 3.2 and 3.4. Figure 4.2 
shows the mapping from the obtained Spec graph to the Rep graph following the 
mapping algorithm described above. First, calculate as  =  9, an =  8  and /  =  8/9. 
Then sort the Spec and Rep clusters at top level, and find the matching Rep cluster 
for each Spec cluster. The Spec cluster of size 5 is mapped onto the Rep cluster of
42
C lu ster -M  U niform  M ap p in g  A lgorith m
s o r t  a l l  S p e c  c lu s te r s  a t  to p  lev e l in  d e s c e n d in g  o rd e r  o f  s izes, 
s o r t  a ll  R e p  c lu s te r s  a t  to p  lev e l in  d e s c e n d in g  o rd e r  o f  s izes, 
c a lc u la te  a s , a n  a n d  / .  
if  /  >  1, le t  /  =  1.
c a lc u la te  th e  r e q u ire d  s ize  o f  th e  R e p  c lu s te r  m a tc h in g  to  b e  /  x  a s t 
fo r  e a c h  S p e c  c lu s te r  a t  to p  leve l s o r te d  lis t ,  do  
b e g in  if  a  R e p  c lu s te r  o f  r e q u ir e d  s ize  is fo u n d , th e n  
m a tc h  th e  S p e c  c lu s te r  to  th e  R e p  c lu s te r  
d e le te  th e  S p e c  a n d  R e p  c lu s t e r  fro m  S p e c  a n d  R e p  lis t
e n d
fo r  e a c h  u n m a tc h e d  S p e c  c lu s te r ,  do
b e g in  if  t h e  s ize  o f t h e  f ir s t  R e p  c lu s te r  >  th e  r e q u ire d  s ize , th e n
b e g in  s p l i t  th e  R e p  c lu s t e r  in to  tw o  p a r t s  w ith  o n e  p a r t  h a v in g  th e  re q u ir e d  size  
m a tc h  th e  S p e c  c lu s te r  to  th is  p a r t
in s e r t  th e  o th e r  p a r t  to  p r o p e r  p o s it io n  o f  th e  s o r te d  R e p  c lu s te r  lis t
e n d
else
b e g in  m e rg e  R e p  c lu s te r s  u n t i l  th e  s u m  o f  s izes >  th e  r e q u ir e d  s ize  
if  = ,  th e n
m a tc h  th e  S p e c  c lu s t e r  to  t h e  m e rg e d  R e p  c lu s te r  
e lse
b e g in  s p l i t  th e  m e rg e d  R e p  c lu s te r  in to  tw o  p a r t s  w ith  o n e  h a v in g  re q u ir e d  s ize  
m a tc h  th e  S p e c  c lu s te r  to  th is  p a r t  




fo r  e a c h  m a tc h in g  p a i r  o f  S p e c  c lu s te r  a n d  R e p  c lu s te r ,  do  
b e g in  if  th e  R e p  c lu s te r  c o n ta in s  o n ly  o n e  p ro c e ss o r ,  th e n
m a p  a ll th e  m o d u le s  in  t h e  S p e c  c lu s te r  to  th e  p ro c e s s o r  
e lse
b e g in  go  to  th e  s u b -c lu s te r s  o f  th e  S p e c  a n d  R e p  c lu s te r  
( th u s  th e y  a re  p u s h e d  to  to p  leve l) 
c a ll th e  s a m e  m a p p in g  a lg o r i th m  fo r  th e se  c lu s te r s
e n d
e n d
Figure 4.1 Uniform mapping algorithm
Cluster-M Specification Cluster-M Representation:
© ( ( © ( ©  Qyrj) <Js) © Q mapped onto
f = &9
mapped
A H B H C
onto










mapped ©    ®
mapped mapped
— — ©  © — — <
mapped 
©  -® mapped __■(c) ( d »  ----------------► © mapped ©   ►©
Step 4 :
mapped 
©   ►©
mapped 
®   ► ©
Figure 4.2 A mapping example.
44
Time






F c d J k 1
G b
H a
Figure 4.3 Gantt chart of the obtained schedule.
the same size, however the Spec cluster of size 4 has to be mapped onto the Rep 
cluster of size 3 since this is the closest matching of sizes. Then the same procedure is 
applied to the Spec clusters at the lower level. As shown in step 2 in Figure 4.2, task 
module a is mapped onto Rep cluster H , which contains a single processor. In step 
3, modules 6 , e, / ,  g, h and i are mapped onto corresponding processors. Finally in 
step 4, modules c and d are both mapped onto processor F.  Since modules j ,  k and 
I are embedded to module d (see Figure 3.2), they are also mapped onto processor 
F,  to which d is mapped onto. Similarly, modules m, n and o are mapped onto 
processors D , A and E  respectively. Now all the task modules in the original task 
graph have been mapped onto corresponding processors. Figure 4.3 shows the final 
schedule obtained from the above mapping by following the data and operational 
precedence of the task graph. As shown in the Gantt chart, Tm — 6 .
4.2 Uniform  M apping Comparison R esults
This section presents a set of experimental results that have been obtained in 
comparing Cluster-M mapping algorithm with other leading techniques for mapping 
uniform arbitrary task graphs onto uniform arbitrary system graphs. The examples 
selected here are the same as those presented and experimented by the authors of 
the papers reporting the leading techniques [8 , 56]. The following three criteria are
45
used for evaluating the performance of the algorithms examined: 1 ) the total time 
complexity of executing the mapping algorithm, Tc; 2) the total execution time of 
the generated mappings, Tm\ and (3) the number of processors used, Nm. From (2) 
and (3), speedup Sm =  and efficiency rj =  can be obtained, where Ts is the* m /V m
sequential execution time of the task. In the following, we present the comparison 
results for both the scheduling problem and the allocation problem.
4.2.1 Task Scheduling R esults
In comparison study of task scheduling of uniform graphs, we choose Lee and 
Aggarwal’s mapping strategy [56]. Their mapping strategy considers the task graph 
as directed graph and differentiate nodes and edges into different computation steps 
and communication phases in order to accurately calculate the actual communication 
cost between two non-adjacent processors. However, Lee and Aggarwal’s strategy 
maps the entire task graph onto the system graph without graph contraction or 
clustering. Also, it assumes that the order of the task graph is no greater than 
that of the system graph. The time complexity of Lee and Aggarwal’s algorithm is 
0 (A 3), while ours is O( MN)  (i.e., if M = N,  then ours is 0 ( N 2)).
Given a task graph as shown in Figure 4.4(a), the mapping obtained by Lee 
and Aggarwal on a 16-processor hypercube is (iQ t\ t-r t9 t9 t2 110 / 1 3  t5 tn  114 t6 
18  ^ 1 5  1^ 2 ) [56]. The final schedule following the task graph precedency is illustrated 
in Figure 4.4(b), while the schedule obtained from Cluster-M mapping is illustrated 
in Figure 4.4(c). An optimal schedule, which also uses fewer number of processors, 
is shown in Figure 4.4(d).
4.2.2 Task A llocation R esults
The goal of task allocation is to minimize the communication delay between 
processors and to balance the load among processors. The problem of task allocation 












0 1  2 3 4 5 6 7 8 9  10 11 12
m
S
(b) Lee and Aggarwal’s mapping, Tc =  0 ( N 3), Tm — 12, N„
Time













I t s l ' j i l  I « 4
16
t7 : . I n f
id
(c) Cluster-M mapping, Tc = 0 ( M N ) ,  Tm =  11, Nm = 8
lim e
Processors ®  ^ 2 3 4 5 6 7
0 toj tl t2 t9
1 t3 t4 tl( M ........ .
2 15 16 t i l
3 17 18 t l ^ t l d j t l i
(d) An optimal mapping, Tc — 0(2MN), Tm =  8, Nm = 4 
F igure  4.4 Comparison example with Lee and Aggarwal’s strategy.
Task graph System graph
F ig u re  4.5 Comparison with Bokhari’s mapping: task and system graph.
Therefore, the task graph in task allocation is undirected and the clustering- 
undirected-graphs algorithm is used to generate the Spec graph in this case. The 
measure of mapping quality in task allocation is still Tm.
Cluster-M mapping algorithm are compared to Bokhari’s mapping (allocation) 
algorithm [8] using uniform undirected task graphs. Bokhari’s algorithm has the 
running time complexity of 0 ( N 3), while Cluster-M has O(MN) .  Bokhari’s 
algorithm assumes the number of task modules is no greater than the number 
of processors, so that the mapping can be one to one. In this case, a lower bound 
on Tm can be 6 -f 1, where 8 is the degree of a given task graph.
In comparing Cluster-M with Bokhari’s for the example shown in Figure 4.5 
having a 33-node task graph and a 6 x 6 finite element machine (FEM) [8], a Sun 
SPARC station 1 was used. The results are shown in Table 4.1. The lower bound 
on Tm as described before is 9, and yet both Cluster-M and Bokhari’s algorithms 
have obtained near optimal results of Tm — 17 and 13, respectively. The above 
example uses the same structured task and system graph as tried in [8], Other 
randomly generated task and system graphs have also been tested. Table 4.2 shows 
the mapping results and comparisons for 10 randomly generated task and system 
graphs of 10 nodes. Similar results were obtained for other random graphs. Bokhari’s
48
algorithm is not applicable to mapping a larger task graph onto a smaller system 
graph (called cardinality variation [5]). However, Cluster-M mapping is an efficient 
approach to the mapping problems having topological and cardinality variation in 
their input graphs [5].
4.3 Cluster-M  Non-U niform  M apping
This section presents an efficient mapping algorithm which produces sub-optimal 
matching of a non-uniform Spec graph onto a non-uniform Rep graph in O(MP)  
time, where P  = max(M, N).  A high level description of the mapping algorithm is 
presented in Section 4.3.1. In Section 4.3.2, a few examples are given to illustrate 
the non-uniform mapping algorithm.
4.3.1 N on-U niform  M apping A lgorithm
The mapping function is defined to be f m : Vt Vp, as in Section 4.1.
However, since both the task graph and the system graph may be non-uniform, 
we assume that the communication time of the task graph edge (i,-, t j )  is equal to 
T,(x,y)epath(Mti),jm(tJ)) where p a t h ( p i , p j ) is the shortest path between processor 
Pi  and p j .
A detailed description of the non-uniform mapping algorithm is presented in 
Figure 4.6. In the following, an overview of the algorithm is given. Before the 
mapping begins, it is necessary to compute a reduction factor denoted by / ,  which 
is essential for mapping of task graphs having more nodes than the system graphs. 
The reduction factor /  is the ratio of the total sizes of the Rep clusters over the total 
sizes of the Spec clusters. This is used to estimate how many computation nodes 
are to share a processor. The mapping is done recursively at each clustering level, 
where the best matching between Spec clusters and Rep clusters are found. For 
matching Spec clusters to Rep clusters, first the Spec and Rep clusters are sorted
49




































Tc 0 ( N 3) O(MN)
Tm 13 17
Tt (sec) 152.5 0.05
50
Table 4.2 Comparisons of mappings of Bokhari’s algorithm and Cluster-M
Random graphs T-L  m running time (sec)
of 10 nodes Bokhari Cluster-M lower bound Bokhari Cluster-M
1 15 15 8 0.82 0.03
2 9 13 7 1.58 0.03
3 10 11 8 1.20 0.03
4 11 14 8 1.00 0.03
5 11 12 9 1.02 0.03
6 10 12 8 2.35 0.02
7 11 12 8 1.40 0.03
8 10 12 8 1.18 0.03
9 10 13 9 1.20 0.02
10 9 10 7 1.03 0.02
in descending order with respect to the four parameters (a, 6, Il,7r). For example, 
Spec clusters with larger sizes are sorted before those with smaller sizes, and for Spec 
clusters with the same size, those with larger number of levels are sorted first.
Second, each of the Spec clusters (denoted by Ks,) is mapped as follows. First 
search for the Rep cluster (denoted by k r ; ) with the best matched size, i.e., closest 
to /  x crsi • The first objective function in mapping is thus the same as formulated in 
Equation (4.1) in Section 4.1.1. If multiple Rep clusters with the matching size are 
found, one is selected with the minimum estimated execution time. The estimated 
execution time of mapping Spec cluster Ksi onto Rep cluster krj, r(/cs;,/«Rj), is 
equal to the number of clustering levels of Ksi times the average computation and 
communication time at each level, as formulated in Equation (4.2). If no Rep cluster 
with a matching size can be found for a Spec cluster, either merge or split (unmerge) 
Rep clusters until a matching Rep cluster is found.
, . c , i l n *  x t>Rj x /.
r{KSi, KRj) =  x 7  +  71— 7 -7 - )  (4-2)VRj J Bfij X Osi
Thirdly, for every matched pair of the Spec and Rep clusters, the following 
is done to embed communication intensive nodes together. (This is similar to the
51
clustering process in [70, 82, 86]. However, here it is only done in the mapping step 
so that the clustering of the task graph is kept independent of the system graph, 
as described in the Chapter 3.) If a Spec cluster has multiple sub-clusters and the 
average communication time between these sub-clusters is greater than the possible 
computation time of a sub-cluster as formulated in Inequality (4.3), then embed the 
sub-clusters onto a sub-cluster having the largest size, and calculate the parameter 
quadruple for the new cluster. We then insert it in the proper position in the sorted 
list of Spec clusters for mapping, and repeat the matching as described above by 
Equations (4.1) and (4.2) for the remaining Spec clusters in the list. If no embedding 
is necessary, then the mapping of this Spec cluster onto a Rep cluster is done for this 
level, and therefore this Spec cluster is removed from the list.
T t S i  m i n ( < T 5 u (,—c l u s t e r  X  6 Sub  — c l u s t e r )  1 ( . n  v
 > ----------------------X -  (4.3)
T^ Rj ORj J
In the above mapping algorithm, the worst case of a mapping at a level i 
happens either when (case 1) for each Spec cluster, all the remaining Rep clusters 
have the matching size, therefore Equation (4.2) is used to select the best Rep cluster; 
or when (case 2) for each Spec cluster, no Rep cluster of matching size is found, 
therefore Rep clusters are merged/split recursively until a Rep cluster of matching 
size is obtained. Suppose the number of Spec clusters at level i is K{. In both cases 
described above, or in any combination of the two cases, it takes 0( I \ i N)  time to find 
the best matches of all A',- Spec clusters, as the total number of clusters in the Rep 
graph is O(N),  where N  is the number of processors. For each pair of matching Spec 
and Rep clusters, if Inequality (4.3) is satisfied, the extra time taken in embedding 
will be O(M ). Since the total number of Spec clusters is O(M),  i.e., J2i Ki  — O(M),  
where M  is the number of nodes in original task graph. Therefore, the total time 
complexity of this mapping algorithm is Yli{KiN + M)  =  0 ( M N )  +  0 (M 2) = 
O(MP) ,  where P = max(M, TV).
C lu ster -M  N on -U n iform  M appin g  A lgorith m
s o r t  a ll  S p e c  c lu s te r s  a t  to p  le v e l in  d e s c e n d in g  o rd e r  o f  erg, 6g} 115, A s ­
s o r t  a ll R e p  c lu s te r s  a t  to p  lev e l in  d e s c e n d in g  o r d e r  o f  <t r  tfj*, I 1 r ,  a n d  7Tft. 
c a lc u la te  / .
c a lc u la te  th e  r e q u ire d  size  o f  th e  R e p  c lu s t e r  m a tc h in g  Kgt to  b e  /  X <75 
fo r  e a c h  S p e c  c lu s te r  ngt a t  to p  leve l s o r te d  l is t ,  do  
b e g in  i f  th e  c lu s te r  h a s  o n ly  o n e  s u b -c lu s te r
th e n  g o  to  a  low er lev e l w h e re  t h e r e  a re  m u lt ip le  o r  n o  s u b -c lu s te r s  
if  a t  le a s t  a  R e p  c lu s te r  o f r e q u ir e d  s ize  is fo u n d
th e n  b e g in  s e le c t th e  R e p  c lu s te r  k.j i - w ith  m in im u m  r(kappag t , kappa^j)  
m a tc h  th e  S p e c  c lu s t e r  to  th e  R e p  c lu s te r  
d e le te  th e  S p e c  a n d  R e p  c lu s te r  f ro m  S p e c  a n d  R e p  lis t
e n d
e n d
fo r  e a c h  u n m a tc h e d  S p ec  c lu s te r ,  d o
b e g in  if  th e  s ize  o f  th e  f ir s t R e p  c lu s te r  >  th e  re q u ir e d  size
th e n  b e g in  s p l i t  th e  R e p  c lu s te r  in to  tw o  p a r t s  w ith  o n e  p a r t  h a v in g  r e q u i r e d  s ize  
m a tc h  th e  S p e c  c lu s t e r  to  th is  p a r t
in s e r t  th e  o th e r  p a r t  to  p r o p e r  p o s i t io n  o f  th e  s o r te d  R e p  c lu s t e r  lis t
e n d
else  b e g in  m e rg e  R ep  c lu s te r s  u n t i l  th e  s u m  o f  s izes >  th e  re q u ir e d  s ize  
i f  =  th e n  m a tc h  th e  S p e c  c lu s te r  to  th e  m e rg e d  R e p  c lu s te r  
e lse  b e g in  s p l i t  th e  m e rg e d  R e p  c lu s te r  in to  tw o  p a r t s  
w ith  o n e  h a v in g  re q u ir e d  size  
m a tc h  th e  S p e c  c lu s te r  to  th is  p a r t  




fo r  e a c h  m a tc h in g  p a i r  o f S p e c  c lu s te r  a n d  R e p  c lu s te r ,  do  
b e g in  if  th e  R e p  c lu s te r  c o n ta in s  o n ly  o n e  p ro c e s s o r
th e n  m a p  a ll th e  m o d u le s  in  t h e  S p e c  c lu s te r  to  th e  p ro c e ss o r  
e lse  if  In e q u a l i ty  (4 .3 ) is s a tis f ie d
th e n  b e g in  se le c t th e  s u b -c lu s te r  o f  th e  S p e c  c lu s te r  w ith  th e  l a r g e s t  s ize  
e m b e d  th e  n o d e s  o f  o th e r  s u b -c lu s te r s  
to  th e  c o n n e c te d  n o d e s  o f  t h e  se le c te d  s u b -c lu s te r  
e m b e d  th e s e  s u b -c lu s te r s  o n to  th e  se le c te d  o n e  
c a lc u la te  th e  p a r a m e te r s  fo r  th e  new  c lu s te r  
in s e r t  i t  in to  t h e  s o r te d  S p e c  c lu s te r  lis t
e n d
e lse  b e g in  d e le te  th e  S p e c  a n d  R e p  c lu s te r  fro m  S p ec  a n d  R e p  lis t  
go to  th e  s u b -c lu s te r s  o f  t h e  S p e c  a n d  R ep  c lu s te r  
call th e  s a m e  m a p p in g  a lg o r i th m  fo r  th e se  c lu s te r s
e n d
e n d












(2,2,2,2) ( 1, 1,0 ,0 ) 
C D  O
(33/3,1,1)
(2 ,2,2,2 ) ( 1,1,0 ,0 ) 
CD O
(1,4,6,6) (2,17,20,0)
Condition (2) is satisfied, embed CD onto C _D
(2,23,26,0) (33/3,1,1)
(2,13,12,2) (2,2,2,2) (1,1,0,0)
( ) .CD O .
(2,13,12,2) (33/3,1,1)
(1, 12,10,0) (1,2,2,2) 
o  o
(2,13,12,2)
(2,2,2,2) (1,1,0 ,0 ) 
CO o
(2,2,2,2)











l 3 , l 4 ,*5 , l 6 P2
F igu re  4.7 A mapping example.
4.3.2 N on-U n ifo rm  M app ing  E xam ples
In Section 3.2, a Spec graph and a Rep graph were constructed from the original non- 
uniform task graph and system graph, as shown in Figure 3.16 and 3.18. Figure 4.7 
shows the snapshot of the mapping process from the obtained Spec graph to the Rep 
graph following the mapping algorithm described above. Figure 4.8 shows the final 
schedule obtained from the above mapping by following the data and operational 
precedence of the task graph. As shown in the Gantt chart, Tm = 10.
54
0 1 2 3 4 5 6 7 8 9  10
Pi t i l l 2 t7
p? . | t 3 | 14 ‘5 M
p 3
F igu re  4.8 Gantt chart of the obtained schedule.
To show that the same task graph can be mapped onto various system graphs, 
three different system graphs are chosen and shown in Figure 4.9. Figure 4.9(a) 
is the same task graph as shown in Figure 3.16. Figure 4.9(b) shows a uniform 
fully connected system graph and its clustering. The computation speed of each 
processor and the communication bandwidth of each communication link is equal to 
2. The result of Cluster-M mapping onto this graph is shown in Figure 4.9(c). In 
Figure 4.9(d), the system is fully connected with unit computation speed at each 
processor, but having higher communication bandwidths at the edges. In this case, 
the Cluster-M mapping algorithm distributes the task modules as shown in Figure 
4.9(e), to all three processors to utilize the relatively high communication bandwidth 
available. On the other hand, if the system is fully connected with unit communi­
cation bandwidth but having higher computation speeds at the processors as shown 
in Figure 4.9(f), Cluster-M mapping algorithm maps all the task modules onto the 
processor with the highest speed to avoid the relatively expensive communication 
cost. This is shown in Figure 4.9(g).
Finally, an example of mapping a real application task onto a heterogeneous 
system is given. The Gaussian elimination algorithm used in LINPACK [23, 24] is 
chosen. The FORTRAN code is given in Figure 4.10. Suppose it takes 1 unit of time 
to do an addition or subtraction, and it takes 2 units of time to do a multiplication or 
division of two real numbers. It is also assumed that the communication amount of 
sending/receiving each real number to be 1. A task graph for computing the Gaussian 
elimination of a 5 x 5 matrix is shown in Figure 4.11(a). In each task module Tj ,  
column j  is modified by using column k. Suppose that the system running this task
(a) Task graph
132.2.2)
(b) A uniform system graph
(3.13,3)
(d) A non-uniform system graph
(33,1.1.)




0 1 2 3 4 5 6  7 8 9 10 11 12 13 14 15 16
P . ‘ 1 1 t 2 1 , 1 «7
p 2 1 ‘3 1 ‘4 1 1 ‘6
P 3 ....... 1 l * 1
(e) Mapping result on (d)




l l *2 »3 <4 *5 *6 >7
111 *2 I 1 *7
113 1 *4 I 15 I *61
(c) Mapping result on (b)
(1) A different non-uniform system graph (g) Mapping result on (0





C FORM KJI-SAXPY 
C
REAL A(LDA,N)












F ig u re  4.10 Gaussian elimination algorithm.
contains only two workstations p\ and p%. Also, p\ and P2 have speed of 2 and 1.6 
respectively, and are connected with a link of bandwidth 1. The mapping result 
using Cluster-M technique is illustrated in Figure 4.11(b).
4.4 N on-U nifo rm  M ap p in g  C om parison  R esu lts
Presented in this section is a set of experimental results that have been obtained 
in comparing Cluster-M mapping algorithm with other leading techniques. The 
examples selected here are not designed by us, rather are those presented and studied 
by the authors of the papers reporting the leading techniques. Again, the following 
three criteria are used for evaluating the performance of the algorithms examined: 
(1) the total time complexity of executing the mapping algorithm, Tc; (2) the total 






0 4 10 13 19 235 25.5 30.625 33.62534.625 40 415
S 155 23 28.623 34.25 39
(b) Mapping result
Figure 4.11 The mapping example of a 5 x 5 matrix Gaussian elimination.
58
Since there is no existing mapping technique which maps a machine-independent 
arbitrary non-uniform task onto an arbitrary non-uniform system, it is not easy to 
choose candidates for the comparison study. Therefore, the comparison study is 
focused on the leading mapping techniques designed for arbitrary non-uniform tasks, 
but for specialized systems only. The mapping techniques in this category include 
McCreary and Gill’s Clan [61], El-Rewini and Lewis’s MH [26], Sarkar’s Edge- 
Zeroing clustering [70], Wu and Gajski’s MCP [82], and Yang and Gerasoulis’ DSC 
[86]. These algorithms have proven to be very effective and efficient in mapping 
arbitrary and non-uniform directed tasks. Similar to Cluster-M mapping algorithm, 
these algorithms also cluster the task graphs before the mapping. Except for MH, 
which is a list scheduling algorithm, they all assume that the target systems are fully 
connected with an unbounded number of uniform processors and communication 
links. If the number of processors is bounded and smaller than the number of 
obtained clusters of task modules, some clusters will be merged until the number 
of clusters is no less than the number of processors. For a fully connected system, 
it does not m atter to which processor a cluster is mapped. If the system graph 
is arbitrary but uniform, some allocation algorithms such as Bokhari’s pairwise 
exchange mapping [8] can be used for one-to-one mapping of clusters of task 
modules onto processors [85]. The following comparison results show that Cluster-M 
produces better or similar mapping results with less time complexity compared to 
the other mapping techniques studied here.
4.4.1 Com parison w ith M cCreary and G ill’s Clan Algorithm
McCreary and Gill’s Clan algorithm finds suitable sized grain (cluster) of task 
modules to be assigned to the same processor before scheduling the tasks [61]. A 
clan is a set of nodes X  of the directed task graph Gt iff for all ty G X  and 
all t z G Gt — X  such that (1) tz is a parent node of tx iff tz is a parent node of
59
ty; or (2) t z is a child node of tx iff tz is a child node of ty. Informally, a clan is
a subset of nodes where every element outside the set is related in the same way
to each member in the set. An 0 ( M 3) parsing algorithm has been proposed that
decomposes a task graph into clans. In McCreary and Gill’s algorithm, it is also
assumed that the underlying system is fully connected and all the processors and
communication links are uniform (5,- =  1, Rij — 1, for all i, j).  Using McCreary
and Gill’s algorithm, the following task modules of the task graph shown in Figure
4.12(a) are clustered together and are assigned to the processors of a fully connected
four processor system:
Pi'- 1, 2, 9 
P2: 3, 4, 10 
P3: 5, 6, 11 
Pa- 7, 8, 12
As task module 13 receives data from 9 and 10, it is assigned to P\. Similarly, 
14 is also assigned to P2 and 15 is assigned to P\. The schedule resulting from 
this assignment appears in Figure 4.12(b). Even though Cluster-M clustering and 
mapping algorithms are different and more generic than Clan, similar results have 
been obtained as shown in Figure 4.12(c).
4.4.2 Com parison w ith  El-Rewini and Lewis’s M apping H euristic
Next Cluster-M mapping algorithm is compared with El-Rewini and Lewis’s mapping 
heuristic (MH) algorithm [26]. The time complexity of MH is 0 ( M 2N 3), while 
Cluster-M has an 0 ( M N ) time complexity. Given a task graph as shown in Figure 
4.13(a) and a uniform 8-processor hypercube (Dij = 1, if edge (t i , t j ) exists, 1 < i , j  < 
18), the schedule obtained from MH is illustrated by a Gantt chart in Figure 4.13(b) 
[26]. Similarly, the Gantt chart of the schedule obtained by Cluster-M mapping is 
shown in Figure 4.13(c). An optimal schedule is also shown in Figure 4.13(d). Both
60
(a) Task graph
(b) Clan mapping, Tc — 0 ( M 3), Tm -  59, Nm = 4
(c) Cluster-M mapping, Tc = O(MP),  Tm = 59, Nm — 4 
F ig u re  4.12 Comparison example with Clan.
61
MH and Cluster-M mappings have produced close to optimal Tm for this example, 
yet Cluster-M is faster by a factor of 0 ( M N 2).
4.4.3 Comparison w ith W u-G ajski’s M CP A lgorithm
The Modified Critical Path (MCP) algorithm [82] is based on critical path introduced 
by Hu [42]. A critical path in a DAG is a path of greatest weight from a source node 
to a sink node, including the weights of all the nodes and edges along this path. 
The critical paths can be shortened by removing communication weights (zeroing 
edges) and embedding the nodes on the path. MCP assumes that the weights of task 
nodes and edges are the actual computation and communication times. Therefore, 
given the same task graph as shown in Figure 3.16 and the system graph as shown 
in Figure 4.9(b), a transformed task graph incorporating the information about the 
system graph has to be generated first as shown in Figure 4.14(b). The mapping 
results by Cluster-M and MCP are shown in Figure 4.14(c) and (d) respectively. 
Cluster-M has produced a mapping with Tm =  10 while MCP has Tm =  10.5. The 
time complexity of MCP is 0 ( M 2 log M).
4.4.4 Com parison w ith  Sarkar’s Edge-Zeroing A lgorithm
The basic idea of Sarkar’s Edge-Zeroing algorithm is to repetitively zero the highest 
weighted edge if it does not increase the estimated Tm, until all the edges have been 
examined. Its time complexity is 0{ \Et \{M +  |i^ |)), where \Et \ is the number of 
edges in the task graph. Figure 4.14(e) shows the mapping result obtained by the 
Edge-Zeroing algorithm on the same example used for MCP in Figure 4.14. This 
result matchs that of Cluster-M.
4.4.5 Comparison w ith Yang and G erasoulis’ D SC  Algorithm
Yang and Gerasoulis’ Dominant Sequence Clustering (DSC) algorithm [86] is also 
based on critical path and edge zeroing, and it incorporates several other heuristics
(a) Task graph
Time








(b) MH mapping, Tc -  0 ( M 2N 3), Tm =  26, Nm =  7
Time









(c) Cluster-M mapping, Tc =  0( M P ) ,  Tm = 26, =  8
Time
















1 19 | 113 | 117
t l  12 110 114 Ilf
13
14 t i l 115
15




tl  t3 12 110 115 t lf
t4 112 116
15 117




(d) An optimal mapping, Tc =  0 (2MW), Tm — 25, _/Vm -  8 




(a) The original task graph (b) The transformed task graph
0 1 2 3 4 5 6 7 8 9  10 0 1 2 5 6 7 8 8.5 9. 5 10.5
Pi ti l  t2 t7 Pi tl t31 t2 1
p7 ........I'-il 14 1 Is 1 tfil P2 ......,......... ,.... l4 1 t6 t7
P3 P3 . . ..... . .......1 t5 fc:-
(c) Cluster-M, Tc = O(MP) 
Tm = 1 0 , Nm = 3
(d) MCP, Tc = 0 ( M 2 log M) 
Tm = 10.5, Nm — 3
0 1 2 3 4 5 6 7 8 9  10 0 1 2 3  5 6 7 8 9
Pi i i t2 1 i t7 Pi ‘■I t2 1
p 7, t3 t4 1 tS tfi P2 t3 t4 ts t6 t7
p 3 P3
(e) Sarkar, Tc = 0(\Et\(M + |£ t|)) (f) DSC, Tc = 0((\Et\ + M)\ogM)
Tm = 1 0 , Nm = 3 Tm = 9 ,N m = 2
Figure 4.14 Comparison example with MCP, Sarkar and DSC.
64
for better clustering. DSC can find optimal schedules for some special DAG’s such 
as fork and join. However, the task graphs considered in DSC are not machine- 
independent, and similar to the above three techniques, it cannot map to non-uniform 
systems such as those shown in Figure 4.9(d) and (f). The time complexity of DSC 
is 0((\Et \ +  M )logM ), where \Et\ is the number of edges in the task graph.
Figure 4.14(f) shows the mapping result obtained by DSC for the same example 
which was studied in comparison with MCP and Sarkar’s algorithms. Among these 
results for this example, DSC’s is the best and actually the optimal, yet the result 
by Cluster-M is very close to optimal. In the following, we show several more 
comparison examples with DSC. These examples are taken from [84, 8 6 ]. Figure
4.15 and 4.16 show the mapping of two task graphs onto unbounded number of 
identical processors fully connected by identical communication links. The compu­
tation speed and communication bandwidth of the system in Figure 4.15 are both 1, 
while the computation speed and communication bandwidth of the system in Figure
4.16 are both 2.
Finally, the same example is taken that was used for comparison with MH in 
Figure 4.13. As shown in Figure 4.13, the mapping by MH has Tm — 26 and Nm =  7, 
and an optimal mapping uses 8  processors and has Tm = 25. The mapping results 
by Cluster-M and DSC are illustrated in Figure 4.17(b) and (c). If a 4-processor 
hypercube is used, the mappings of the same task graph by Cluster-M and DSC are 
shown in Figure 4.17(d) and (e).
(a) Task graph
0 1 2 3 4 5  0 1 2 3 4 5
P i ‘1 4 3 1 l 4 l  ‘ 6 P i ‘1 l 3 1 1-41 t 6
p ? *2 P 2 ‘2
P 3 .. ‘ 5  1 1 ‘ 7 P 3 t s  1 1 t 7
(b) Cluster-M, Tm =  5, Nm =  3 (c) DSC, Tm = 5, Nm = 3




(a) Task graph (b) Transformed task graph
0 I  2.5 3 5 9
Pi t i  1 t 2  | t 3 t 5  l = - : v
P ? t 4
(c) Cluster-M mapping result, Tm =  9, Nm —
0  1 2.5 8 5  9 5





(d) DSC-I’s mapping result, Tm — 9.5, Nm =  2 
Figure 4.16 Comparison example 3 with DSC.
(a) Task graph
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
P» l l l t2 n o  1 1 114 I'll
p, 13
r, t4 1 111 I I 115 1
Pi 1 15 1
P4 ; r : : ^ 16 1 tl 2 1 1 . 116 1
P5 t7
p* t8 I 113 1 ...  117 1 . . .
p7 .......  1 ® I;:- ■ ■
(b)Cluster-M mapping on 8  processors, Tm =  26, N m = 8
0 5 10 15 20 25 2
Pn t3 1 t io 1 117 . | t l 8
Pi t4 t i l l 14
P* t6 112 115
P 3 t8 113 1 1 t i e 1




(c) DSC’s mapping on 8  processors, Tm = 27, Nm =  7
0 5 10 IS 20 25 27
Po ‘ 11 l2 1 t3 1 t io  1 ! 114 |t 18
P i t4 ts t i l t IS
p?. t6 t7 112 t | 6
P 3 t 8 t9 l 13 117
(d) Cluster-M mapping on 4 processors T-*■ m =  2 7 ,  Nm ==  4
0 S 10 IS 20 25 27
Po . ‘5 «6 112 116
P i I ‘ 7 l 3 t io 117
p? ti 1 l2 1 t4  1 t i l  1 1 t I4 |t  18
P 3 t9 t8 113 115
(e) DSC’s mapping on 4 processors, Tm =  27, Nm =  4
Figure 4.17 Comparison example 4 with DSC.
CHAPTER 5
H IERARCH ICAL CLUSTER-M  M A P P IN G  FOR H ETER O G EN EO U S  
C O M PU TIN G
Heterogeneous Computing (HC) has been proposed as a novel approach towards 
solving computationally demanding application tasks due to exploiting the hetero­
geneity of a variety of high-performance computers [38, 51]. There are two types of 
HC systems: a mixed-mode machine and a mixed-machine system. A mixed-mode 
machine is a single machine which can operate in different modes of parallelism 
[34], while a mixed-machine system is a suite of diverse machines which are inter­
connected by a high-speed network. These diverse machines are high-performance 
computers with different parallelism modes (such as vector processors, SIMD, MIMD, 
and mixed-mode machines). Khokhar et. al. [51] have addressed the challenges 
and issues posed by HC systems. The trend of HC has also brought researchers’ 
attention to the problem of mapping tasks onto a suite of heterogeneous computers 
[80, 16, 54, 55, 67]. However, many existing mapping techniques which were designed 
for uniform homogeneous systems are not suitable for mappings on heterogeneous 
systems. This chapter studies the problem of mapping specialized application tasks 
onto heterogeneous systems by applying the mapping algorithms presented in the 
last chapter.
In [37] Freund proposed the Optimal Selection Theory (OST) which is a proof 
of existence of an optimal configuration of heterogeneous machines for executing an 
application task such that the total execution time is minimized. OST was then 
augmented by Wang et. al. [80], called the Augmented Optimal Selection Theory 
(AOST), to incorporate non-optimal machine choices and non-uniform decompo­
sitions of code segments. This chapter first presents the Heterogeneous Optimal
67
68
Selection Theory (HOST), which is an extension to OST and AOST. HOST includes 
various additional assumptions to be more suitable for a HC environment. Based on 
HOST, a modified Cluster-M mapping algorithm is presented for mapping bound- 
degree heterogeneous task graphs onto bound-degree heterogeneous system graphs. 
At each step of mapping, instead of using a greedy algorithm for matching the Spec 
clusters to the Rep clusters as we did in last chapter, the optimal matching is found 
by using Integer Linear Programming (ILP). The comparison results show that the 
modified mapping algorithm produces better mappings than the original algorithm 
as well as other heterogeneous mapping techniques.
5.1 H eterogeneous Optim al Selection Theory (H O ST)
In Freund’s Optimal Selection Theory (OST) [37], it is assumed that the number of 
machines available is unlimited and an application task comprises several uniform 
and non-overlapping code segments. Each segment has homogeneous parallelism 
embedded in its computations. Also, code segments are considered to be executed 
serially. A code segment is further decomposed into different code blocks. All 
code blocks within a code segment have the same type of parallelism and can be 
executed concurrently. The goal of OST is to assign the code blocks within each 
code segment to the available matching machine types such that the code segment 
can be optimally executed. Augmented Optimal Selection Theory (AOST) [80] 
extended OST to incorporate the performance of code segments on non-optimal 
machine choices, assuming that the number of available machines for each type is 
limited. Based on this assumption, a code segment which is most suitable for one 
type of machine may have to be assigned to another type.
HOST, presented in this section, is an extension to AOST in two ways: it incor­
porates the effects of various fine grain mapping techniques available on individual 
machines, and the task is assumed to have heterogeneous embedded parallelism.
69
The input format to HOST, as shown in Figure 5.1, allows concurrent execution of 
mutually independent code segments. An application task is decomposed into several 
subtasks. Subtasks are executed serially. Each subtask may contain a collection 
of code segments which can be executed in parallel. A code segment consists of 
homogeneous parallel instructions. Each code segment is further divided into several 
code blocks which can be executed concurrently. These code blocks are to be assigned 
to the machines of the same type. A machine type is identified according to the 
underlying architectures, such as SIMD, MIMD, vector, scalar, etc. Each machine 
type may have more than one model, for example, a hypercube and a mesh may be 
two models of SIMD machine type. In HOST, heterogeneous code blocks of different 
code segments can be executed concurrently on different types of machines, exploiting 




t!1 •  •  ft a  |  . . .
Coda blocks (I
F igu re  5.1 Input format to HOST.
I *ip« I « « « ! I
s homoganooua)
To express the formulation of HOST, some parameters need to be defined. Let 
S  be the number of code segments of a given subtask, and M  be the number of 
different machine types to be considered. Let r)[t] be the number of machine models 
of type t, a[f] be the number of mappings available on machine type t, and f3[t, /] 
be the number of machines of model I of type t available. Assume v[t, j] is the 
maximum number of code blocks code segment j  can be decomposed into. Define
70
7 [f,J] to be the number of machines of type t that are actually used to execute 
code segment j .  Therefore, 7 [t,j\ = min(J3j’=l/^[f,/], u[/, j]). A parameter m[t,k] is 
defined to specify the effect of the mapping technique available for a code block k 
on machine type t . Assume that for a particular mapping m  on machine type t, the 
best matched code segment can obtain the optimal speedup 0[t, m] in comparison 
to a baseline system. A real number 7r[t, j] indicates how well a code segment j  can 
be matched with machine type t. A[i,&] is a utilization factor when running code 
block I' on a machine of type t. Thus 0 < n[t,j\ < 1 and 0 < A[f,&] < 1. Let p[j] 
be the percentage of tim e spent executing code segment j  within overall execution 
of a given subtask on baseline machine. J2j=i p[j] =  1- Similar to the definition of 
p[j], let p[j, &] be the percentage of time spent executing code block k within overall 
execution of code segment j  on baseline machine. Y^kt^p\ji  &] =  1 .
Suppose code segment j  is assigned to machine type t. For each code block k 
within code segment j ,  there is a mapping m[t,k\.  Let p[t,j] be mapping vector for 
code segment j  on machine type t.
p[ t , j} = (m [f,l],m [f,2 ] ,- -- ,m [f ,7 [f,j]]),l < m[t,k] < a[<].
With this mapping vector p, on machine type t, the relative execution time of 
segment j  will be:
ru ■ 1 P[i] x P\j, k\o \ t , j , u \  =  max —t -r— pr; r —p r?— rr-i<fc<7 [qj] 6[t,m[t,k]] x ir[t,j] x A[f,&]
Therefore, different mappings, p, available on machine type t result in different 
execution times of segment j.  Let A[i, j] be the minimum execution time of segment 
j  among all the possible mappings on type t.
X [t,j] = min<5[f,j,^[i!,i]]
Let the machine type selection vector t  indicate the selection of machine types 
for code segment 1 to 5, such that r  =  (f[l], t[2 ], • • •, i[S]). Let c[t[j]] denote the cost
Tl
Table 5.1 Notations used in HOST formulation
5 the number of code segments of a given subtask
M the number of different machine types to be considered
v[t] the number of machine models of type t
the number of mappings available on machine type t
/?[<,/] the number of available machines of model I of type t
«[*» j] the maximum number of code blocks code segment j  can be decomposed
i[t,j] the number of machines of type t actually used to execute code segment j
m[t, Ar] mapping technique used for a code block k on machine type t
6[t, m] the optimal speedup for a particular mapping m on machine type t
* [t,j] how well a code segment j  can be matched with machine type t
A[f, k] utilization factor when running code block k on a machine of type t
P[j} the percentage of execution time of code segment j  within a given subtask
P[j, k] the percentage of execution time of block k within code segment j
P[tj] mapping vector for code segment j  on machine type t
b[tj,p] execution time of segment j  with mapping mu on machine type t
minimum execution time of segment j  among all possible mappings on type t
T machine type selection vector
X[r] execution time of the given subtask with machine type selection r
of machine selected to execute code segment j ,  and C be the total cost constraint. 
Define xM  be the execution time of the given subtask with heterogeneous machine 
type selection r  on all the code segments, such that y[r] =  maxi <j<s A[i[j], y], then 
HOST is formulated as follows:
For any subtask , there exists a r  with
s
m in x H  subject to * cMi]D < c
3=1
For an easy reference, all the notations used in HOST formulation are listed in 
Table 5.1.
5.2 M odeling the Input to  HOST
HOST, as described in the previous section, is an existence proof for an optimal 
selection of processors for a given subtask in HC. The input formulation in HOST
72
assumes that a parallel task T  is divided into subtasks f,-, 1 < i < N.  Each subtask 
ti is further divided into code segments 1 <  j  < S, which can be executed 
concurrently. Each code segment within a subtask can belong to a different type of 
parallelism (i.e. SIMD, MIMD, vector, etc.), and thus should ideally be mapped onto 
a machine with a matching type of parallelism. Each code segment may further be 
decomposed into several concurrent code blocks with the same type of parallelism. 
These code blocks tijk, 1 < k < B,  are suited for parallel execution on machines 
having the same type of parallelism. This decomposition of the task into subtasks, 
code segments, and code blocks is shown in Figure 5.1.
A good model of this input format is needed to facilitate the mapping of tasks 
onto a heterogeneous architecture. In addition to modeling the input format, the 
architecture being considered for the execution of the task should also be modeled. 
Several requirements for this model are identified as follows:
• The modeling of the input format should handle the decomposition of the task 
into subtasks, code segments, and code blocks, while preserving the information 
regarding the type of parallelism present in each portion of the task. This is 
essential to match the type of each code block with a suitable machine type in 
the system.
• The model should handle parallelism at fine grain and coarse grain levels.
o Modeling of the input code should emphasize the communication requirements 
of the various code segments.
• The modeling of the input code should be independent of the underlying archi­
tecture.
• The modeling of the system should provide the mode of computation of each 
machine in the system.
73
• The interconnection topology of individual architectures should be systemat­
ically represented in the model at both system and machine levels.
Cluster-M meets most of the above requirements. However, Cluster-M has no 
provision to model the heterogeneity present in the task and the system. In the 
next section, an extension to Cluster-M, called Hierarchical Cluster-M (HCM), is 
presented to incorporate the heterogeneous types both in tasks and systems. Then, 
a HOST based HCM mapping algorithm is presented in Section 5.4 which finds sub- 
optimal selection and mapping of each subtask. The HCM mapping algorithm is 
compared with other techniques in Section 5.5.
5.3 Hierarchical C luster-M  (HCM )
Hierarchical Cluster-M (HCM) is an extension to Cluster-M to exploit parallelism at 
the subtask, code segment, code block, and instruction levels. This is accomplished 
by modifying both the Cluster-M Specification and Representation. The extended 
Cluster-M Specification takes into account the type of parallelism present in each 
portion of the task. The modification to the Cluster-M Representation takes into 
account the presence of several interconnected machines in the system, providing a 
spectrum of computational modes.
5.3.1 HCM  Specification
The HOST formulation can be applied to a non-uniform task graph Gt = (Vt,Et)  as 
defined in Chapter 3. In a non-uniform task graph, each task module t{ is a code 
block. Task modules of the same type of computation requirements compose a code 
segment. A subtask consists of several sequential or concurrent code segments of 
different types. Thus, the Hierarchical Cluster-M Specification can be constructed in 
the same way as the original Cluster-M Specification. Besides, in HCM Specification, 
each Spec cluster is also labeled by a computation type. Only clusters of the same
74
type can be embedded and merged. For example, given a heterogeneous subtask as 
shown in Figure 5.2, the HCM Spec graph can be obtained by clustering the MIMD 
and vector type task modules (code blocks) respectively. Therefore, the obtained 
HCM Spec graph will consist of two subgraphs: one contains MIMD type clusters 
and the other contains vector type clusters. The MIMD type Spec subgraph is 




MIMD type Vector type
F ig u re  5.2 A heterogeneous subtask consists of MIMD and vector code segments. 
5.3.2 H C M  R ep re se n ta tio n
The Hierarchical Cluster-M Representation of a system consists of two layers of 
clustering: system layer and machine layer. System layer clustering consists of 
several levels of nested clusters. At the lowest level of clustering each machine in the 
system is assigned a cluster by itself. Completely connected clusters are merged to 
form the next level of clustering. This process is continued until no more merging 
is possible. Machine layer clustering is obtained in the same way as described in
75
(4,13,4,3)
1( 1 , 12 ,2.0 )
( 1,8, 1,1) (1.8,1,1)1 ,8, 1. 1) 1,8,1.1)
( l ,8 ,2 ,! ) \ l  2 / ' 1’8’1’1)





1( 1, 12,2,0 ) 1( 1, 12,2 ,0 )
(1.14,2,0)
(1 ,8 ,1 ,1 )L
(1,14,2,0)1 (1,14,2,0) 1 -(1.8,1,1)!.
(1,6,1,1)
(1.6.4,4)
•top 3 step 4
Figure 5.3 Construction of the Spec subgraph of the MIMD code segment.
76
Chapter 3. For a heterogeneous suite of interconnected computers, the HCM Repre­
sentation is obtained as follows:
1. For each computer in the system, apply the Clustering-non-uniform-undirected- 
graphs algorithm as in Chapter 3 to obtain Cluster-M Representation of all 
the processors in the computer. The resulting clusters are called machine level 
clusters.
2. Each resulting cluster is labeled according to the type of parallelism present in 
the cluster (i.e. SIMD, MIMD, vector, etc.).
3. Treat each computer as a processor and apply the Clustering-non-uniform- 
undirected-graphs algorithm. At the first level of clustering, each computer 
in the system is in a cluster by itself. Each clustering level is constructed by 
merging clusters from the lower level that are completely connected. This is 
continued until no more clustering is possible. The resulting clusters are called 
system level clusters.
A heterogeneous parallel computing system is shown in Figure 5.4, which 
consists of one MIMD machine and one vector machine. The MIMD machine has 
three processors, P I, P2 and P3, and the vector machine has two processors P4 and 





M IM D m achine Vector m achine
Figure 5.4 The system graph and its clustering of a heterogeneous suite.
77
5.4 HCM  Bound-D egree M apping Algorithm
Given the HCM Spec graph and HCM Rep graph, the mapping can be done for 
each type Spec subgraph and Rep subgraph respectively, using the original Cluster- 
M mapping algorithm as presented in Chapter 4. However, the original Cluster-M 
mapping, whether for uniform or non-uniform graphs, does not find optimal matching 
of Spec clusters with Rep clusters at each level. In this section, we present an 
HCM mapping algorithm for bound-degree task and system graphs. This mapping 
algorithm is a modified version of the original Cluster-M mapping algorithm. Instead 
of greedily matching Spec clusters to Rep clusters in the original mapping, the 
new algorithm finds an optimal matching using integer linear programming yet still 
maintains a polynomial time complexity.
Many parallel systems such as ring, binary tree, and mesh have constant degree. 
Many applications can also be expressed in bound-degree graphs such as in image 
processing, most divide-and-conquer applications, etc. If the degree of a graph is 
bound by a constant fc, the number of sub-clusters within each Spec or Rep cluster 
at any clustering level will be at most k. Therefore, the function 4.2 in Chapter 4 
can be used to find optimal matching of each Spec cluster without increasing the 
time complexity of mapping. In the following, a modified mapping algorithm is 
presented which uses an Integer Linear Programming (ILP) approach to find the 
optimal matching between Spec clusters and Rep clusters at each mapping step. 
Mathematica 2.2 for SPARC , a product by Wolfram Research, Inc., is used to solve 
this ILP problem.
Assume that the degrees of the given task graph and system graph are bound 
by two constants ks and fc/?, respectively. To formulate each mapping step into 
an ILP model, a binary variable ^(ks,, «/?_,) is defined to indicate whether a Spec 
cluster KSi is mapped onto a Rep cluster kr}. /^(ks, , kr}) =  1 if «s, is mapped to 
k^ .  Otherwise, / / ( k s , ,  k r } )  =  0. This transfers the mapping problem into an ILP
78
model in which each Spec cluster can be mapped to only one Rep cluster, which
can be represented by kRj) =  L 1 <  * < &s- The estimated execution time
j
on Rep cluster kR] is denoted by T(kRj ), and T(kRj ) -  ■, kRj )t {ks„ kRj ).
i
Since the overall execution time is denoted by Tm, there are constraints that for all 
j-, Tm > T(kRj ). The objective is to minimize the overall estimated execution time. 
Therefore, the objective function of our ILP model can be expressed as follows.
Minimize  Tm, while Tm > T(kRj ) fo r  all j
Once the minimal Tm is found, the matching of Spec clusters and Rep clusters can 
be determined by using binary variables ks,, kr3)-
HCM Bound-Degree Mapping Algorithm  
begin
for each machine type
calculate reduction factor /  =
i
for each Spec cluster 5,, 1 < i < ks 
begin
for each Rep cluster R j , 1 < j  < kR
end
/ *  Starting Integer Linear Programming */ 
/ *  Set Constraints */
Y^n(Si, Rj )  = 1,1 < i < ks
3
? ( R j )  =  R i ) T ( S »  R i ) ’ 1 < 3 < k R
Tm > r (R j) ,i  < j < k R 




Figure 5.5 HCM bound-degree mapping algorithm.
79
A detailed description of the HCM bound-degree mapping algorithm is 
presented in Figure 5.5. The time complexity of this algorithm can be analyzed as 
follows. The numbers of iteration for the outer for loop and the inner for loop are 
at most ks and kn , respectively. Therefore, the total number of iteration for these 
for loops is bound by 0(ks  x kji). Consider the portion of ILP in this algorithm, 
it examines all instances of (ks, , kr.j ) pairs for all i and j .  The running time of this 
portion, hence, is equal to 0( (kn )ks). The overall time complexity of the mapping 
algorithm is therefore 0(ks  x kn) +  0((kft)hs) = 0((kR)ks). However, since both ks 
and kfi are constants, it is still a polynomial time complexity.
Consider mapping the task graph illustrated in Figure 5.2 to the system graph 
of Figure 5.4. The mapping is done for each type of Spec and Rep clusters respec­
tively. The mapping of the MIMD Spec subgraph onto the MIMD Rep subgraph is 
done as below. At the top level, the mapping is trivial since there is only one Spec 
cluster k<j0(4,25, 10,6) and one Rep cluster k r 0 ( 3, | ,5 ,  | ) .  At the next level, four 
Spec clusters k s ,  (1,12,2,0), «s2(l, 14,2,0), ks3(1, 10,3,0), and ks4 (1,20,5,0) are 
to be mapped to three Rep clusters /c/i, (1,2,0,0), k r 2 { 1,1, 0,0), and k r3(1, 1, 0, 0). 
Using the modified mapping algorithm, ks3 and ks4 are mapped onto kr^ , and ks2 
and are mapped to kr2 and k r 3 respectively. It implies that task modules d, e, 
g, h, i are mapped to processor P i, b, f are mapped to P2, and a, c are assigned to 
P3.
The mapping of the vector Spec subgraph onto the vector Rep subgraph can 
be done in a similar way. The overall mapping result is shown in Figure 5.6.
5.5 Comparison Study
In this section, the HCM bound-degree mapping algorithm is compared with the 
original Cluster-M non-uniform mapping algorithm as well as other two techniques 
which are capable of mapping tasks onto distributed heterogeneous systems. Lo’s
80






mapping algorithm in [59] is a heuristic which combines recursive invocation of max 
flow min cut algorithm to find suboptimal assignments of tasks to heterogeneous 
processors. In [71], Shen and Tsai considered a cost function and a minmax criterion 
for minimization of the cost function, then solve the mapping problem by the well- 
known A* algorithm. Since both algorithms do not incorporate the heterogeneous 
computation and machine types in their mapping, it is only possible to compare the 
result of mapping each type of task modules (code blocks) onto the same type of 
processors respectively.
Considering the example discussed in the previous section for mapping the task 
graph of Figure 5.2 to the system graph of Figure 5.4. The mapping results of MIMD 
type task modules onto MIMD type processors by HCM bound-degree mapping, the 
original Cluster-M non-uniform mapping, Lo’s heuristic, and Shen and Tsai’s A* 
heuristic are shown in Figure 5.7. Their total execution times are 22.5, 24.5, 24, and 
28 respectively. HCM bound-degree mapping produces the best result.
The mapping results of the vector type task modules onto the vector type 
processors are shown in Figure 5.8. The Tm by the four different mapping algorithms 
are 25.67, 30.17, 38, and 33.83, respectively. Again the HCM bound-degree algorithm 
produces the best mapping, yet the mapping of the original Cluster-M algorithm is 
also very good.
j d  e  |g  h | i |
. 1  b  | r
a  c  j
t J t J t }
2 2 2 
t 2 t 3 t 4
3 3
3 T 4 T«iii4
t ! T S T s  T s
4.67 10.67 16.67 21.17 24.17
F ig u re  5.6 The obtained mapping result.
81








3 11 19 21
(b) Original Cluster-M non-uniform mapping








Figure 5.7 The mapping results of MIMD code blocks onto MIMD machine.
c d e S i
a f
b h
5 13 18 24
(d) Shen and Tsai’s A* searching





a b f i
e h
c d g









10.67 12.67 22 22.67 24.67 25.67
10.67 16.67 21.17 24.17
(a) HCM bound-degree mapping
2.67 6.67 10.67 12.67 15.67 17 29.17 30.17
PI
P2
10.67 16.67 21.17 25.67 28.67
(b) Original Cluster-M non-uniform mapping
0 2.67 6.67 10.67 14.67  18.67 20.67 23.67 26.67 29.67 31 33 35 37 38




0  6 10 14 18 20 23 26 27.33 2933 33.83
T 2 T 1*3
T' t2 
*4 l 2 T
2
3 T 21 4 rl 2
3
T 4
t! ^5 IS T* r 4 T5
4 10 21.5 26 28 3  3 31.33 32.33
(d) Shen and Tsai’s A* searching 
F igu re  5.8 The mapping results of Gaussian elimination on the vector machine.
CHAPTER 6
C O M BINED USE OF CLUSTER-M  W ITH  HASC
In this chapter, we study how Cluster-M can be used together with a different 
programming paradigm, called Heterogeneous Associative Computing (HAsC), to 
provide an efficient scheme for heterogeneous programming. Unlike other existing 
heterogeneous orchestration tools which are MIMD based, HAsC is for data-parallel 
SIMD associative computing. HAsC models a heterogeneous network as a coarse­
grained associative computer. It is designed to optimize the execution of tasks where 
the program size is small compared with the amount of data processed. Ease of 
programming and execution speed are the primary goals of HAsC. On the other hand, 
Cluster-M can be applied to both coarse-grained and fine-grained networks. Cluster- 
M provides an environment for porting heterogeneous tasks onto the machines to 
maximize the resource utilization and to minimize the execution time. Both Cluster- 
M and HAsC can efficiently support heterogeneous networks by preserving a level 
of abstraction without containing any architecture details. They are both machine- 
independent and scalable for various network and task sizes.
6.1 H eterogeneous A ssociative Com puting (H A sC )
Heterogeneous Associative Computing (HAsC) models a heterogeneous network as 
a coarse-grained associative computer. It assumes that the network is organized 
into a relatively small number of very powerful nodes. Basically, each node is a 
supercomputer (vector, SIMD, MIMD, etc). Thus each node of the network provides 
a unique computational capability. There may be more than one node of a specific 
type in a case where special properties are present. For example, one SIMD node
83
84
Memory 1 P E I
e •  
•  •  
o •
Memory n PE n
(a) An associative computer
Sequential
Control
Disk 1 Computer 1
•  •  
•  •  
•  •
Disk n Computer n
(b) Associative configuration of a  network
HAsC
Control
F ig u re  6.1 Analogy between an associative computer and an associative configu­
ration of a network.
may be specialized for associative processing and a second SIMD node may contain 
a very powerful internal network configuration.
Figure 6.1 illustrates the logical similarity of an associative machine and a 
heterogeneous network. In particular, a disk-computer node on a network can be 
compared to an associative memory-PE cell. As in an associative cell, the node’s 
computer is dedicated to processing the data on the node’s disk(s). The disk-to- 
machine data transfer rate is much more efficient than the node-to-node transfer 
rate, just as memory-to-PE transfers are much faster than PE-to-PE transfers. Note 
that the associative computer and network diagrams are quite different from shared 
memory MIMD models. Shared memory configurations emphasize the concept that 
all data is equally accessible from all processors. This is not the case in a hetero­
geneous network.
HAsC is “layered” so that any node in the HAsC network may be another 
network. Thus a HAsC node may be a HAsC cell containing more than one computer, 














F ig u re  6.2 A layered heterogeneous network.
most nodes may contain a general purpose computer in addition to a supercomputer 
to function as the node’s port to the rest of the HAsC network. Figure 6.2 shows 
a typical HAsC network organization. Each HAsC node has access to a number of 
instruction stream channels. Each channel broadcasts a different sequence of code. 
The HAsC node selects the appropriate channel based on its local data and previous 
state. The selected channel is saved in a channel register. A port, or transponder 
node, will accept a high level command and “translate it” into the command(s) 
appropriate for the subnetwork.
Some of the properties of the associative computing paradigm well suited for 
heterogeneous computing include: (1) efficient programming and execution with large 
data sets and small programs; (2) optimal data placement; (3) software scalability 
(see Section 6.3); (4) cellular memory allocation; and (5) search-process-retrieve 
synchronism [66].
In HAsC, instructions are broadcast to all of the cells listening to a channel, but 
each individual cell must determine whether to execute the instruction. This determi­
nation is performed as follows: Upon receipt of an instruction, a node “unifies” it with 
its local instruction set and data files. Several languages such as Prolog and STRAND 
[36] incorporate this process. HAsC is different in that it uses unification only at the 
top level. Thus there is only one unification operation per data file, as opposed to 
one per record or field. This difference is critical in a heterogeneous network where 
communication of individual data items would be prohibitively expensive.
86
If there is a match, the appropriate instruction is initiated. The instruction may 
in turn issue more instructions. Thus, control is distributed throughout HAsC. That 
is, a program starts by issuing a command from a control node. If a receiving node 
receives a command that is in effect a subroutine call, it may become a transponder 
control node. It may first perform some local computations and then start issuing 
(broadcasting) commands of its own. If the node happens to be a port node, the 
commands are issued to its subnet as well as to its own network. Thus it is possible 
for multiple instruction streams to be broadcast simultaneously at several different 
logical network levels in a HAsC network.
In general, HAsC assumes that data is resident in a cell. As a result, data 
movement is minimal. However, it is common for one cell to compute a value 
and broadcast it to other cells. Thus, there is a need to synchronize the arrival 
of commands and data. There are basically two cases which are handled autom at­
ically by the HAsC administrator as a part of the search-process-retrieve protocol.
The normal case is for data to be resident in a cell when the HAsC command 
arrives. Instruction unification and execution proceed as described above. HAsC 
allows data transfers, but protocol insists that the data transfer be completed before 
any associated commands are broadcast.
The second case involves command parameters. When a command arrives and 
is unified with resident data at a node but some parameter data is missing, the 
unified command is stored in a table to wait for the parameter in a synchronism 
process called a data rendezvous. When parameter data arrives, the rendezvous 
table is searched for a match. If found, the associated command is executed.
HAsC uses network administrators and execution engines to effect the 
paradigm. Each HAsC network level has a system administrator and each node 
in a network has its own local administrator. The local administrator monitors
87
network traffic capturing incoming instructions and checking for illegal commands. 
It is also responsible for maintaining the local HAsC instruction set.
The HAsC administrator receives all incoming HAsC instructions from the 
local network. It then verifies if each instruction is legal. If it is, the administrator 
puts it in the Execution Engine queue. Otherwise, it attem pts to identify the source 
and makes a report to the system administrator. Repeat offenses cause escalating 
diagnostic actions as determined by the network administrator.
If a Meta HAsC instruction such as (un)install, (un)extend, or (un)augment, 
is received, it is processed immediately. The Meta instructions will create, modify 
and delete HAsC instructions from the local HAsC instruction set respectively. Meta 
instructions can also modify local data structure definitions.
Since the instruction set can be dynamically expanded by the users, it is 
possible for two users to install the same instructions. The node administrator distin­
guishes between the two instructions by a user id and program id which is broadcast 
with every HAsC instruction.
Instructions can be added at several different logical levels: (1) system, (2) 
project, and (3) user. Typical system level instructions would be data move and 
formatting commands. Project commands would be project oriented. For example, 
a numerical analysis project would have matrix multiplication and vector-matrix 
multiplication instructions, while a logic programming project might have specialized 
logic instructions, such as unification. At the user level, one user might specify a 
SAXPY operation while another might want a dot product. Scalable libraries may 
exist at any level, but most commonly at the project level.
Each node/cell has an execution engine which controls instruction execution 
at that node. The execution engine selects the next instruction, makes the bindings 
specified by instruction unification and causes the instruction to be executed. The 
execution engine performs the following tasks: save environment, get next unified
88
instruction, bind unified variables, establish environment, execute unified instruction, 
and restore old environment.
Instruction execution may take two basic forms. First the instruction may be a 
HAsC program which is executed in the transponder mode. Second, the instruction 
may be a library call written in FORTRAN, C, LISP, etc. In this case, the established 
environment restrictions produce the proper interface for the appropriate language.
HAsC must allow for a dynamic instruction set and data structure modifi­
cations. Thus the HAsC install Meta instruction consists of an associative pattern 
and a body of code. When it is broadcast to the system, all nodes which successfully 
unify with the instruction gather the body of code and install it on the local node. 
The extend instruction consists of a pattern and a data definition. Responding 
nodes add the data definition to the local associations. Extend may add a named 
row or column to an existing association. Augment can be used to add an entire new 
association.
The patterns in these instructions contain administrative data. Such as job id, 
project id, etc. If the node is not participating in the project or job, then it does 
not unify and the instruction is not installed or the data definition not extended. 
Uninstall, unextend and unaugment perform the inverse operations.
Basic to the HAsC philosophy is the concept that data, when initially loaded 
into the system, are sent to the appropriate node and are never moved. While this 
would be ideal, there will always be a need to move data from one node to another. 
Accordingly, there are a number of HAsC move commands. Move commands can 
be divided into intra-association and inter-association instructions. Intra-association 
instructions are very much like expressions in conventional languages and are not 
discussed here due to lack of space. Inter-association instructions include file I/O  as 
a special case. Inter-association moves must have node identifiers and for I/O , a file 
server, a disk or other peripheral is a legal node.
89
The essence of HAsC is to model a distributed heterogeneous network as 
an associative data parallel computer, where processor synchronization is on an 
instruction by instruction basis. Accordingly, in HAsC, the associative instructions 
are synchronized. A hierarchy of instructions is briefly described here - from the 
highest, most global (easiest to synchronize) to the lowest, most local (hardest to 
synchronize). HAsC will perform most efficiently if the programs are written using 
high level commands. The lower the level of the command, the more inter-node 
communication is required. Five different levels of instruction coupling are required 
to implement all of the HAsC statements on a heterogeneous network.
Communication and synchronization are built into the HAsC instruction. 
There is no need for the programmer to be aware of the degree of instruction 
communication. The five levels of instructions are presented here to more clearly 
delineate the relationship between associative and heterogeneous computing.
The highest level of instruction synchronization is pure associative data paral­
lelism and involves the use of the channel registers only, i.e., there is no global 
coupling. There are two types of top level instructions: (1) those which execute based 
on the channel register value only, such as logical and arithmetic expressions; and (2) 
those which set the channel register. Data parallel logical expressions (associative 
searchers) can be used to set the channel registers and are “automatically” incor­
porated into many HAsC statements. Thus a data parallel IF or WHERE consists 
of only an associative search, followed by a sequence of data parallel expressions. It 
is a top level instruction. Top level instructions execute in real time and require no 
global response or communication. Most computation is done at the top level.
Figure 6.3 gives an example of instruction synchronization, where $ is the 
parallel marker. ResultS is a data parallel pronoun referring to the results of the last 
performed data parallel computation. The top level synchronization box shows the 
programming style for algebraic expressions supported by HAsC.
90
add the b$ to the c$
subtract the result! from the d$
convolve the result! with the e$
save the result! in the f !
compare the a !  with the b !
where the result! are equal d o ... elsewhere do ...
Top level synchronization 
Expressions and WHERE 
commands
pick one o f the responder! 
any a ! greater than the b !
m ove the a !  to the b !  
save the a !  in the b ! 
read c !
read matrix a !  
exit if  EOF




any a ! greater than 5
Second level synchronization 
Data m ove and I/O 
commands
Third level synchronization 
A N Y  command
Fourth level synchronization 
Item selection
Fifth level synchronization 
Iteration
Figure 6.3  Instruction Synchronization.
91
The second level of instruction coupling requires only global synchronism. 
Prime examples are the data transfer and I/O  commands. I/O  is always local to 
a cell’s processor, but in general the processors may be quite different physically 
and therefore I/O  times may vary dramatically requiring synchronization before the 
next HAsC command is issued. Again, the programmer need not be aware of the 
synchronization requirements of this class of instructions. The synchronization is 
automatic. The programmer only recognizes the need for I/O  or data movement.
The third level of complexity consist of simple responder commands. These 
commands require the ORing of the responder results of all processors (i.e. an OR 
reduction). On a SIMD machine this is a single instruction. In HAsC, it is the 
simplest form of a HAsC reduction communication. The instructions at this level, 
such as ANY, are used to check for error conditions, or determine whether special 
case computing needs to be done.
The fourth level is random selection. The HAsC commands in Figure 6.3 at this 
level consist of an associative search, followed by the selection of a responder by the 
“first reduction” operation. The data object of the selected responder is broadcast 
to the entire HAsC network for further processing.
The fifth level is iteration. The only use for iteration at the top level of HAsC 
is for user interaction. For example, a typical program might be one which allows 
the user to interactively specify kernels to be convolved with an image and to review 
the results. Data iteration does not exist in Figure 6.3.
6.2 Com bined U se of Cluster-M  and HAsC
HAsC is most suitable for coarse-grained heterogeneous parallel computing. It is 
intended to ease the programming effort and to maximize execution speed. Cluster- 
M, on the other hand, provides both coarse-grained and fine-grained mapping in a 
clustered fashion. It aims at maximizing both execution speed as well as resource
92
utilization. Therefore, both paradigms can be combined to achieve a better overall 
performance featuring ease of programming, increased execution speed and optimal 
resource utilization.
Cluster-M mapping can be applied to HAsC in several ways. First, Cluster- 
M can be used to determine the initial data mapping before HAsC computation 
begins so that the overall execution time is minimized. Secondly, Cluster-M mapping 
can be used to decide the fine-grained mapping within HAsC nodes as shown in 
Figure 6.4. Thirdly, Cluster-M can be alternated with HAsC at run time. In this 
approach, a Cluster-M Specification for the task is generated first. The Cluster-M 
Specification preserves computation and communication information in a multi-level 
cluster organization. Clusters at the same level represent computations at a given 
step which can be executed concurrently. This cluster organizational information 
can be sent to the HAsC network controller which then broadcasts the clusters of 
HAsC instructions (Figure 6.5). As described in Section 6.1, the local HAsC nodes 
determine which of the clusters to execute based on their local configuration and data. 
Global results, if any, are returned to the initiating HAsC controller which may use 
them to select the next level of clusters to be broadcast. The process repeats until 
all cluster levels have been processed. This approach is a network implementation of 
the multiple-SIMD architecture described in [66].
The following illustrates the combined use of Cluster-M and HAsC by an 
example. Given two 7 x 7  real matrices A and B, suppose we want to calculate 
Ua  x Ug, where L a  x  Ua  =  A and L b  x  Ub  — B. The matrices L a  (or L b ) and 
UA (or Ub ) have the same dimensions as A; L a  (or L b ) is unit lower triangular 
(i.e., zeros above the diagonal and the value one on the diagonal), and Ua (or Ub) 
is upper triangular (i.e., zero below the diagonal). To transform the original square 













F igu re  6.4 Cluster-M aided HAsC computation within HAsC nodes.
C luster-M  Specification 




Level n ( P © ' " ^ )





F igu re  6.5 Switching between Cluster-M and HAsC.
a Gaussian Elimination (GE) algorithm can be used. Therefore, the solution to the 
above problem can be written at HAsC user level as below:
do GE on A$ 
save result$ in UA$ 
do GE on B$ 
transpose result$ 
save result$ in UBT$ 
multiply UA$ with UBT$
The task graph of this coarse-grain solution is shown in Figure 6.6(a). Using 





F ig u re  6.6 The task graph and Spec graph of the HAsC user level instructions.
Figure 6.6(b). Suppose there is more than one HAsC node available in the system. 
Using Cluster-M mapping, the matrices A  and B  will be allocated to two different 
HAsC nodes, say Nodel and Node2, respectively.
Next, for each level of clustering in the Spec graph (which represents each 
computation step in the original task graph), the concurrent clusters at that level 
(which represnt concurrent computation modules) can be sent to the HAsC network 
controller to be broadcast to all the HAsC nodes. For example, at step 1, two clusters 
of HAsC user level instructions (function calls) “do GE on A$” and “do GE on i?$” 
are broadcast to all HAsC nodes at the same time. The HAsC node Nodel will select 
to execute the first instruction, while the HAsC node Node2 will select to execute 
the second instruction.
Finally, Cluster-M mapping is used to decide the fine-grain mapping within 
each HAsC node. The GE operation, which is a function in the user level library, 
actually consists of many system level instructions which may look similar to the 
SAXPY code in LINPACK [23, 24]. The task graph of a GE on a 7 x 7 matrix A or 
B  is illustrated in Figure 6.7. In each task module T f,  column j  is modified by using 
column k. Suppose Node 1 is a 2 x 3 torus, and Node2 is a 4-processor completely
95
F ig u re  6.7 The task graph of a GE on a 7 x 7 matrix.
connected machine, as shown in Figure 6.8. Also, suppose for both Node 1 and 
Node2, it takes 1 unit of time to compute each T f  and 1 unit of time to transmit 
each column between two connected processors. Using the Cluster-M clustering and 
mapping algorithms, the fine-grain mappings of system level HAsC instructions onto 
the processors within each HAsC node can be obtained, as shown in Figure 6.9.
6.3 S ca lab ility  Issues
Scalability is often understood differently by different authors. We will consider 
scalability to refer to hardware, tasks and software in roughly analogous fashion. In
( fT>— — ( k) Pfi







0 I 2 3 4 5 6 7 8 9 10 II 12 13 14 !5 16 17





0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b) Clustcr-M mopping within HAsC Node 2
Time
Time
F ig u re  6.9 The Cluster-M mappings within the HAsC nodes.
97
addition, scalability may refer to both homogeneous or heterogeneous architectures. 
In the following, first homogeneous scalability is defined and extended to hetero­
geneous scalability. Then the scalability of HAsC and Cluster-M is discussed.
6.3.1 H om ogeneous Scalability
Homogeneous hardware scalability refers to multiple machines which are of the same 
basic architectural type, typically various-sized versions of the same vendor product. 
The hardware scalability function, y (a , /?), between two homogeneous architectures 
a  (the larger) and /3 (the smaller), is defined to be the ratio of the size of a  over 
the size of /?. For example, an eight processor CRAY YMP is a hardware example 
of a scaled-up version of a two-processor CRAY YMP. In this example, the eight- 
processor CRAY YMP has a scalability factor of 4 (y =  4) over the two-processor.
Task scalability is more complex. What is typically implied is the ability to take 
a task (algorithm plus data) executing on a small machine and execute the same task 
on a scaled-up machine. Thus, using the additional resources of the larger machine 
allows scaled-up performance reasonably close to y. One ambiguity in this concept 
is what is meant by the same task. If it means only executing the same program, but 
with possibly different (i.e. larger) data, then tasks in a homogeneous environment 
often scale. The type 1 task scalability function, T(a,/3) for a given program applied 
to two different sized data set a  (the larger) and /? (the smaller), is defined to be the 
ratio of the size of a  over the size of /?. For example, if the size of a  is 16K items and 
the size of is 2K items, then T  — 8. This means that a program is type 1 scalable 
if it processes data set /3 eight times faster than data set a , using the same hardware 
configuration.
However, if applying the above definitions to the case where both the data and 
the algorithm are fixed, then tasks often do not scale. Type 2 task scalability, between 
two homogeneous architectures a  (the larger) and (the smaller), is defined to be
98
the potential to exploit the inherent hardware scalability between them on some task 
of a size that fills a.
The software scalability refers to the ability to exploit task and hardware scala­
bility, with little or no changes other than parameters. Software scalability function, 
<r(a, /?), for the case of two homogeneous architectures a  (the larger) and /? (the 
smaller), is defined to be the real-valued function giving the increase in performance 
of a  over /3. Typically some increase in performance is expected but generally, at 
least in the homogeneous case, not super-linear performance, i.e., 1 < <
x(cr, /?). In most cases a is a simple multiple of y, i.e., cr(a, f3) =  A x y(a:, /?), where 
1 /X(a,/?) < A < 1.0.
6.3.2 H eterogeneous Scalability
Heterogeneous scalability is clearly more complicated than homogeneous scalability, 
though it is also the case in which one can aspire to the ultimate in heterogeneous 
computing potential, i.e, to achieve cr significantly greater than y. This is what 
is meant by super-linear performance. In the heterogeneous case, there may be no 
commonality between two different architectures, therefore, hardware scalability does 
not apply to the heterogeneous case.
Consider the breakdown of a task into four levels, as shown in Figure 6.10. The 
top level is the functional level. In this level, the function “find a datum” is specified. 
Next is the approach level. For this problem, there is a radical difference between the 
approach for a SIMD machine used associatively and non-SIMD machines. In the 
former case, we can use simple associative search, which is 0(1). In the latter case 
we would typically use a sort, then search operation, the asymptotic performance 
of which is bounded by fl(logn). For the associative search on a suitable SIMD 
machine, there is really only one instruction, “find datum ,” so that there is no room 






Find a datum 
Non-SIMD /  v  SIMD
Sort, then search 
i.e., >= Ofiog n)
Associative Search 
(not sort), i.e., o  (I)
Various Sorts (Quicksort, Bubblesorl, etc.)
i/^ \ if^ \ l^\
Various encodings fo r  any 
specific algorithm
Single Associative command, 
e.g., find datum
F ig u re  6.10 Hierarchical breakdown of a task
are many variations possible. For example, depending on the data, parameters, 
architecture, etc., a number of different search techniques can be used and similarly 
a number of different coding schemes for each algorithm could also be utilized.
In this context, the term scalability only applies to either functional level 
or approach level. In the above example, the scalable approach is the non-SIMD 
approach. However, this will bring the following dilemmas: (1) it is possible to have 
a non-scalable implementation (at the approach level) inherently more effective than 
a scalable approach implemented on the same machine; and (2) it is possible to have 
high hardware scalability but low task/software scalability, or vice versa. In other 
words, the scalable metric is inherently defective in this case if scalability is applied 
to the approach level.
In conclusion, the only kind of scalability applicable to a heterogeneous network 
is type 1 task scalability at the functional level. In essence heterogeneous scalability 
refers to the property that a given software scalable program will execute efficiently 
on any size data set on any heterogeneous network configuration without any modifi­
cation. While functional level scalability may be trivial on a homogeneous network,
100
it is fundamental to establishing a common unifying programming environment for 
heterogeneous networks.
6.3.3 Scalability of HAsC and Cluster-M
Both HAsC and Cluster-M are machine-independent as explained in detail and 
therefore support heterogeneous scalability. In HAsC, a program is broadcast to 
the entire network and the individual nodes determine locally which instructions to 
execute. The global broadcasting approach means that there is no need to know 
how nodes are connected in the network or how data is distributed across the nodes. 
This allows data files to be analyzed dynamically at run time as they enter the HAsC 
system and to be directed to the nodes best suited to process them. Broadcasting 
allows scalability. The hardware can be expanded or modified and the problem size 
can be changed without having to reprogram or recompile the basic HAsC program. 
New nodes consisting of new machines with installed HAsC software can be added to 
a network at any time, and at any location. HAsC is not dependent on any physical 
machine or network configuration. This is because the instruction broadcast, cell 
memory organization and associative searching allows the removal of any reference 
to data set size and type from the program.
Cluster-M is also scalable. When a new machine is added to the heterogeneous 
network, a new Cluster-M Representation of the new suite can be generated. 
However, the Cluster-M Specification, which is machine-independent, can be 
efficiently executed without any change. An appropriate new mapping can be 
computed to map the Cluster-M Specification to the new Cluster-M Representation. 
Furthermore, the two paradigms can. be used concurrently as a hybrid scalable 
programming paradigm. Figure 6.11 illustrates the above claims.
101
M achine M achineM achineM achine
M achine Independent P rogran
D istribu tion  Unit
Structure of a scalable heterogeneous paradigm 
Problem
M achine M achineM achineM achine







M achineM achineM achine M achine
M achine Independent P rogran  
(Cluster-M Specification)















M achine Independen t P rogram
(Multi-level Cluster-M Specification of HAsC instructions)
(HAsC Controller broadcasting)
D istribu tion  Unit
Scalability in concurrent use of HAsC and Cluster-M
F igure 6.11 Scalability of HAsC and Cluster-M
CHAPTER 7
CONCLUSIONS
In this dissertation, we presented a generic and efficient technique for mapping 
portable parallel programs onto various multiprocessor systems. The presented 
mapping technique is based on Cluster-M, a parallel programming tool. We presented 
the three main components of Cluster-M: Cluster-M Specifications, Cluster-M Repre­
sentations and Cluster-M Mapping Module. The Cluster-M Specifications are high 
level machine-independent descriptions of parallel tasks, while the Cluster-M Repre­
sentations represent the computation/communication capacity and pattern of the 
underlying parallel computer systems. Both Cluster-M Specifications and Represen­
tations can be viewed as two special types of clustered graphs, called Spec graphs 
and Rep graphs, respectively. The clustering is done only once for a given task graph 
(system graph) independent of any system graphs (task graphs). This is a machine- 
independent (application-independent) clustering and is not repeated for different 
mappings. The Cluster-M Mapping Module maps a Spec graph onto a Rep graph. 
The mapping algorithms presented in this dissertation can map arbitrary tasks onto 
arbitrary systems, for both uniform and non-uniform graphs, in O (M N )  and O(M P)  
time, respectively, where M  is the number of task modules, N  is the number of 
processors and P — max(M ,N ).  Our experimental results indicate that Cluster-M 
produces better or similar mapping results compared to other leading techniques 
which work only for restricted task or system graphs. Lastly, several applications 
of the presented mapping technique to the area of heterogeneous computing were 
presented.
102
A P P E N D IX  A
Cluster-M  Constructs in P C N
The seven Cluster-M constructs are implemented in PCN as follows:
/* 1. Makes given elements into one cluster */
C M AI < E { L V L , E L E M E N T S , x) 
{ | |  M I N _ E L E M E N T { E L E M E N T S , n ),
/* n is the smallest number in ELEMENTS */ 
x  =  [El/E,ra, ELEM E N T S]
}
M I N - E L E M E N T ( E ,  n) 
{; sys : l i s t Jeng t h(E, l en) ,  
{? len = — 1— > n — E[0],
default — > { ? E l  =  [m | E l] — >
{; M I N J S L E M E N T l ( E l , m , m i n ) ,  




M I N - E L E M E N T l ( E l , m , m i n )
{ ? E l?  =  [h | E2] -  >
{;
{ ? h <  m  — > m l  =  h, 




M I N - E L E M E N T l ( E 2 , m l, mm)
},
default  — > min = m
}
/*  2. Yields an element of the cluster */ 
C E L E M E N T ( x , j ,  e)
{; C S IZ E (x ,  s),
{ ? ; = = “ -  ”,x ? =  > e =  i l ,
j  <— s, x l  =  [_, xl] — > C E L E M E N T l ( x l , j , e)
}
}
C E L E M E N T l ( x , j ,  e)
{ ? j  > 1 -  >
{ ? x? = [_|xl]— >
C E L E M E N T l ( x l , j  -  l,e ),
},
d e fau lt— > e = a;[0]
}
/*  3. Yields the size of the cluster */
C S IZ E (x ,s )
{? x? =  [_,_,a;2] — > C S I Z E l ( x 2 ,0,s), 
de fa u l t  — > s =  0
}
C S I Z E l ( x ,  acc,s)
105
{? x l  =  [_|a;l] — > C S I Z E l ( x l ,  acc+ 1, s), 
default  — > s = acc
}
/* 4. Merges cluster x and y */
C M E R G E (x , y , E L E M E N T S , *)
{? x  ? =  [LVZ,_a:,_,a;l], j/? =  [LULj/, _,pl] -  > 
{; M I N J E L E M E N T (E L E M E N T S ,  min), 
makeJuple(3, T),
T[0] =  L V L .x  +  1,
T[l] =  min,
{? E LEm ents  = =  “ —" — >
{; s y s  : l i s t - C o n c a t ( x l , y l , x y ) ,
T[ 2] =  xy
},
d e fa u l t -  > T[2] =  E L E M E N T S
},
.sps : tuple.toJist{T, Z, [])
}
}
/* 5. Does the Unary operation */
CUN(op, n, x, i, e)
{; C E L E M E N T { x , i ,e  1),
{? op = =  “ < <  ” — > le f t^ sh i f t(e l ,n ,e ) ,  
op ==  “ > >  ”— > r igh t^sh if t(e \ ,n ,e ), 
op = — > ones.complement(el,e ),
106
op —— “s q r > e =  el * el, 
op e =  0 -  el
}
}
/* 6. Does the Binary operation */ 
C B I(o p ,x , i ,y , j ,e )
{; C E L E M E N T (x , i ,  el), 
C E L E M E N T ( y J ,e 2 ),
{ ? op = =  “ +  > e =  el +  e2,
o p   “ — ” — > e =  el — e2,
op = =  “ * > e =  el * e2,
op==  7 ” -  > e =  el/e2, 
op = =  > e =  el%e2,
op = =  — > bitwisejand(e, el, e2),
op = =  “I” — > bitwisejor(e, el, e2), 
op = =  > bitwise-Xor(e, el, e2)
}
}
/* 7. Does the Split operation */ 
C S P L IT {x ,k ,p ,q )
{ || C S I Z E { x , s ),
{ ? x? = [ IK L ,n ,£ ] -  >
{ ?  k = = s - >
{ || p =  [IK L +  1 ,n ,£ ] ,
9 =  [ L \ / L  +  l , O , 0 ] ,
{ || C S P L I T l ( E ,k ,E l ,E 2 ) ,  
M I N .E L E M E N T  (E l,  n 1), 
M I N .E L E M E N T (E 2 ,  n2), 
p  = [LVL I, n l, El], 





C S P L I T l ( E ,k ,E l ,E 2 )
{ ? k > 0 -  >
{? E? = [h \ t \ ->
{ || C S P L I T l ( t , k - l , E 3 , E 2 ) ,





1. C-Linda Reference Manual. Scientific Computing Associates, Inc., New Haven,
CT, 1990.
2. H. H. Ali and H. El-Rewini. “A graph theoretic approach for task allocation.”
In Proc. Hawaii International Conference on Systems Science, pages 
577-584, 1992.
3. F. D. Anger, J. Hwang, and Y. Chow. “Scheduling with sufficient loosely coupled
processors.” Journal of Parallel and Distributed Computing, 9:87-92, 
1990.
4. F. Berman. “Experience with an automatic solution to the mapping problem.”
The Characteristics of Parallel Algorithms, pages 307-334, 1987.
5. F. Berman and L. Snyder. “On mapping parallel algorithms into parallel archi­
tectures.” Journal of Parallel and Distributed Computing, 4:439-458, 
1987.
6. F. Berman and B. Stramm. “Prep-P: Evolution and overview.” Technical report
cs89-158, Department of Computer Science, University of California at 
San Diego, 1987.
7. S. H. Bokhari. “Dual processor scheduling with dynamic reassignment.” IEEE
Trans, on Software Engineering, SE-5:341-349, July 1979.
8. S. H. Bokhari. “On the mapping problem.” IEEE Trans, on Computers,
c-30(3):207-214, March 1981.
9. S. H. Bokhari. “A shortest tree algorithm for optimal assignments across space
and time in a distributed processor system.” IEEE Trans, on Software 
Engineering, SE-7(6):583-589, November 1981.
10. S. H. Bokhari. “Partitioning problem in parallel, pipelined, and distributed
computing.” IEEE Trans, on Computers, 37(l):48-57, January 1988.
11. N. Carriero, D. Gelernter, and J. Leichter. “Distributed data structures
in Linda.” In Proc. Thirteenth ACM Symposium on Principles of 
Programming Languages, January 1986.
12. T. L. Casavant and J. G. Kuhl. “A taxonomy of scheduling in general-purpose
distributed computing systems.” IEEE Trans, on Software Engineering, 
14(2):42-45, February 1988.
13. K. M. Chandy and S. Taylor. An Introduction to Parallel Programming. Jones
and Bartlett Publishers, Boston, MA, 1992.
108
109
14. V. Chaudhary and J. K. Aggarwal. “A generalized scheme for mapping
parallel algorithms.” IEEE Trans, on Parallel and Distributed Systems, 
4(3):328-346, March 1993.
15. S. Chen and M. M. Eshaghian. “A fast recursive mapping algorithm.” To appear
at Concurrency: Practice and Experience, August 1995.
16. S. Chen, M. M. Eshaghian, A. Khokhar, and M. E. Shaaban. “A selection
theory and methodology for heterogeneous supercomputing.” In Proc. 
Second Heterogeneous Processing Workshop, pages 15-22, April 1993.
17. S. Chen, M. M. Eshaghian, and Y. Wu. “Mapping arbitrary non-uniform task
graphs onto arbitrary non-uniform system graphs.” To appear at Proc. 
International Conference on Parallel Processing, August 1995.
18. D. Y. Cheng. “A survey of parallel programming languages and tools.” Report
RND-93-005, NASA Ames Research Center, Moffett Field, CA, 1993.
19. Y. Chung and S. Ranka. “Applications and performance analysis of a compile­
time optimization approach for list scheduling algorithms on distributed 
memory multiprocessors.” In Proc. Supercomputing ’92, pages 512-521,
1992.
20. E. G. Coffman and R. L. Graham. “Optimal scheduling for two processor
systems.” Acta Informatica, 1:200-213, 1972.
21. J. Y. Colin and P. Chritienne. “CPM scheduling with small communication
delays and task duplication.” Operations Research, 39(4):680-684, 1991.
22. S. Darbha and D. P. Agrawal. “SDBS: A task duplication based optimal
scheduling algorithm.” In Proc. Scalable High Performance Computing 
Conference, pages 756-763, 1994.
23. J. J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK User’s Guide.
SIAM, Philadelphia, PA, 1979.
24. J. J. Dongarra, F. Gustavson, and A. Karp. “Implementing linear algebra
algorithms for dense matrices on a vector pipeline machine.” SIAM  
Review, 26(1):91-112, 1984.
25. K. Efe. “Heuristic models of task assignment scheduling in distributed systems.”
IEEE Computer, 15(6):50-56, 1982.
26. H. El-Rewini and T. G. Lewis. “Scheduling parallel program tasks onto arbitrary
target machines.” Journal of Parallel and Distributed Computing, 
9:138-153, 1990.
27. H. El-Rewini, T. G. Lewis, and H. H. Ali. Task Scheduling in Parallel and 
Distributed Systems. Prentice Hall, Englewood Cliffs, NJ, 1994.
110
28. F. Ercal, J. Ramanujam, and P. Sadayappan. “Task allocation onto a hypercube
by recursive mincut bipartitioning.” Journal of Parallel and Distributed 
Computing, 10:35-44, 1990.
29. M. M. Eshaghian. “Cluster-M parallel programming model.” In Proc. Interna­
tional Parallel Processing Symposium, pages 462-465, March 1992.
30. M. M. Eshaghian and R. F. Freund. “Cluster-M paradigms for high-order
heterogeneous procedural specification computing.” In Proc. Workshop 
on Heterogeneous Processing, pages 47-49, March 1992.
31. M. M. Eshaghian and M. E. Shaaban. “A Cluster-M based mapping
methodology.” In Proc. International Parallel Processing Symposium, 
pages 213-221, April 1993.
32. M. M. Eshaghian and M. E. Shaaban. “Cluster-M parallel programming
paradigm.” International Journal of High Speed Computing, 
6(2):287-309, June 1994.
33. D. Fernandez-Baca. “Allocating modules to processors in a distributed systems.”
IEEE Trans, on Software Engineering, 15( 11): 1427—1436, November 1989.
34. S. A. Fineberg, T. L. Casavant, and H. J. Siegel. “Experimental analysis of a
mixed-mode parallel architecture using bitonic sequence sorting.” Journal 
of Parallel and Distributed Computing, 11(3):239-251, March 1991.
35. I. Foster and S. Tuecke. “Parallel programming with PCN.” Technical report,
Argonne National Laboratory, University of Chicago, January 1993.
36. I. Foster and T. Stephen. STRAND, New Concepts in Parallel Programming.
Prentice Hall, 1975.
37. R. F. Freund. “Optimal selection theory for superconcurrency.” In Proc. Super­
computing ’89, pages 699-703, November 1989.
38. R. F. Freund and D.S. Conwell. “Superconcurrency: A form of distributed
heterogeneous supercomputing.” Supercomputing Review, 3:47-50, 
October 1990.
39. A. Gerasoulis, S. Venugopal, and T. Yang. “Clustering task graphs for message
passing architectures.” In Proc. ACM International Conference of Super­
computing, June 1990.
40. A. Gerasoulis and T. Yang. “A comparison of clustering heuristics for scheduling
directed acyclic graphs on multiprocessors.” Journal of Parallel and 
Distributed Computing, 16:276-291, 1992.
Ill
41. D. H. Gill, T. J. Smith, T. E. Gerasch, J. V. Warren, C. L. McCreary, and
R. E. K. Stirewalt. “Spatial-temporal analysis of program dependence 
graphs for useful parallelism.” Journal of Parallel and Distributed 
Computing, 19:103-118, October 1993.
42. T. C. Hu. “Parallel sequencing and assembly line problems.” Operations
Research, 9(6):841—848, 1961.
43. J. Hwang, Y. Chow, F. D. Anger, and C. Lee. “Scheduling precedence graphs
in systems with interprocessor communication times.” SIAM  Journal on 
Computing, 18:244-257, 1989.
44. B. Indurkya, H. S. Stone, and X. Lu. “Optimal partitioning of randomly
generated distributed programs.” IEEE Trans, on Software Engineering, 
SE-12(3):483-495, March 1986.
45. L. R. Ford Jr. and D. R. Fulkerson. Flows in Networks. Princeton University
Press, Princeton, NJ, 1962.
46. S. Kambhatla, J. Inouye, and J. Walpole. “Experiences with BeLinda: A
synthetic Linda benchmark for parallel computing platforms.” In Proc. 
International Conference on Parallel Processing, 1990.
47. R. M. Karp. “Reducibility among combinatorial problems.” Complexity of
Computer Computations, 1972.
48. H. Kasahara and S. Narita. “Practical multiprocessor scheduling algorithms
for efficient parallel processing.” IEEE Trans, on Computers, 
c-33( 11): 1023-1029, November 1984.
49. B. W. Kernighan and S. Lin. “An efficient heuristic procedure for partitioning
graphs.” Bell System Technical Journal, February 1970.
50. A. A. Khan, C. L. McCreary, and M. S. Jones. “A comparison of multi­
processor scheduling heuristics.” In Proc. International Conference on 
Parallel Processing, pages II 243-250, 1994.
51. A. Khokhar, V. K. Prasanna, M. E. Shaaban, and C. Wang. “Heterogeneous
computing: Challenges and opportunities.” IEEE Computer, 26(6): 18-27, 
June 1993.
52. S. J. Kim and J. C. Browne. “A general approach to mapping of parallel
computation upon multiprocessor architectures.” In Proc. International 
Conference on Parallel Processing, volume 3, pages 1-8, 1988.
53. B. Kruatrachue and T. Lewis. “Grain size determination for parallel processing.” 
IEEE Trans, on Software Engineering, January 1988.
112
54. B. Narahari, L. Tao, and Y. C. Zhao. “Heuristics for mapping parallel compu­
tations to heterogeneous parallel architectures.” In Proc. Workshop on 
Heterogeneous Processing, pages 36-41, April 1993.
55. C. Leangsuksun and J. Potter. “Problem representation for an automatic
mapping algorithm on heterogeneous processing environment.” In Proc. 
Workshop on Heterogeneous Processing, pages 48-56, April. 1993.
56. S. Lee and J. K. Aggarwal. “A mapping strategy for parallel processing.” IEEE
Trans, on Computers, 36:433-442, April 1987.
57. R. Leland and B. Hendrickson. “An empirical study of static load balancing
algorithms.” In Proc. Scalable High-Performance Computing Conference, 
pages 682-685, 1994.
58. V. M. Lo. “Algorithms for static task assignment and symmetric contraction
in distributed computing systems.” In Proc. International Conference on 
Parallel Processing, pages 239-244, August 1988.
59. V. M. Lo. “Heuristic algorithms for task assignment in distributed systems.”
IEEE Trans, on Computers, 37( 11 ):1384—1397, November 1988.
60. V. M. Lo, S. Rajopadhye, S. Gupta, D. Keldsen, M. A. Mohamed, and J. A.
Telle. “Oregami: Software tools for mapping parallel computations to 
parallel architectures.” In Proc. International Conference on Parallel 
Processing, 1990.
61. C. McCreary and H. Gill. “Automatic determination of grain size for
efficient parallel processing.” Communications of ACM, 32(9):1073-1078, 
September 1989.
62. M. A. Palis, J. Liu, and D. S. L. Wei. “Task clustering and scheduling for
distributed memory parallel architectures.” Technical report, Department 
of Electrical and Computer Engineering, New Jersey Institute of 
Technology, 1995.
63. C. H. Papadimitriou and M. Yannakakis. “Towards an architecture-independent
analysis of parallel algorithms.” SIAM  Journal on Computing, 
19(2):322-328, April 1990.
64. F. Pellegrini. “Static mapping by dual recursive bipartitioning of process and
architecture graphs.” In Proc. Scalable High-Performance Computing 
Conference, pages 486-493, 1994.
65. R. Ponnusamy, N. Mansour, A. Choudhary, and G. C. Fox. “Mapping realistic
data sets on parallel computers.” In Proc. 7th International Parallel 
Processing Symposium, pages 123-128, April 1993.
66. J. L. Potter. Associative Computing. Plenum Press, New York, NY, 1992.
113
67. S. Prakash and A. C. Parker. “A design method for optimal selection of
application-specific heterogeneous multiprocessor systems.” In Proc. 
Workshop on Heterogeneous Processing, pages 75-80, April 1992.
68. P. Sadayappan and F. Ercal. “Nearest-neighbor mapping of finite element graphs
onto processor meshes.” IEEE Trans, on Computers, C-36(12):1408-1424, 
December 1987.
69. P. Sadayappan, F. Ercal, and J. Ramanujam. “Cluster partitioning approaches
to mapping parallel programs onto a hypercube.” Parallel Computing, 
13:1-16, 1990.
70. V. Sarkar. Partitioning and Scheduling Parallel Programs for Execution on
Multiprocessors. MIT Press, Cambridge, MA, 1989.
71. C. Shen and W. Tsai. “A graph matching approach to optimal task assignment
in distributed computing systems using a minmax criterion.” IEEE Trans, 
on Computers, c-34(3):197-203, March 1985.
72. G. C. Sih and E. A. Lee. “A compile-time scheduling heuristic
for interconnection-constrained heterogeneous processor architectures.” 
IEEE Trans, on Parallel and Distributed Systems, 4(2):75-87, February
1993.
73. H. S. Stone. “Multiprocessor scheduling with the aid of network flow
algorithms.” IEEE Trans, on Software Engineering, SE-3(l):85-93, 
January 1977.
74. H. S. Stone. “Critical load factors in distributed systems.” IEEE Trans, on
Software Engineering, SE-4:254-258, May 1978.
75. H. S. Stone and S. H. Bokhari. “Control of distributed processes.” IEEE
Computer, July 1978.
76. V. S. Sunderam. “PVM: A framework for parallel distributed computing.”
Concurrency: Practice and Experience, 2(4):315-339, December 1990.
77. D. Towsley. “Allocating programs containing branches and loops within a
multiple processor system.” IEEE Trans, on Software Engineering, 
SE-12(10): 1018-1024, 1986.
78. J. D. Ullman. “NP-complete scheduling problems.” Journal of Computer
Systems and Science, June 1975.
79. K. Vairavan and R. DeMillo. “On the computational complexity of a generalized
scheduling problem.” IEEE Trans, on Computers, c-25(ll):1067-1073, 
November 1976.
114
80. M. Wang, S. Kim, M. Nichols, R. Freund, and H. J. Siegel. “Augmenting the
optimal selection theory for superconcurrency.” In Proc. Workshop on 
Heterogeneous Processing, pages 13-21, March 1992.
81. L. R. Welch, S. Chen, A. D. Stoyenko, and A. K. Ganesh. “Applying random
neural networks to exploit parallelism and conserve processors in ADT 
module assignments.” Technical report, Department of Computer and 
Information Science, New Jersey Institute of Technology, 1993.
82. M. Y. Wu and D. Gajski. “Hypertool: A programming aid for message-
passing systems.” IEEE Trans, on Parallel and Distributed Systems, 
1(3):101—119, 1990.
83. J. Yang, L. Bic, and A. Nicolan. “A mapping strategy for MIMD computers.”
In Proc. International Conference on Parallel Processing, 1991.
84. T. Yang and A. Gerasoulis. “List scheduling with or without communication
delays.” Technical report, Department of Computer Science, Rutgers 
University, 1992.
85. T. Yang and A. Gerasoulis. “A parallel programming tool for scheduling
on distributed memory multiprocessors.” In Proc. IEEE Scalable High 
Performance Computing Conference, April 1992.
86. T. Yang and A. Gerasoulis. “DSC: Scheduling parallel tasks on an unbounded
number of processors.” IEEE Trans, on Parallel and Distributed Systems, 
5(9):951-967, September 1994.
