In this paper, we evaluate two di erent programming paradigms, Cluster-M and Heterogeneous Associative Computing (HAsC) for heterogeneous computing. These paradigms can e ciently support heterogeneous networks by preserving a level of abstraction without containing any architectural details. The paradigms are architecturally independent and scalable for various network and problem sizes. Cluster-M can be applied to both coarse-grained and ne-grained networks. Cluster-M provides an environment for porting heterogeneous tasks onto the machines in a heterogeneous suite such that resource utilization is maximized and the overall execution time is minimized. HAsC models a heterogeneous network as a coarsegrained associative computer. It is designed to optimize the execution of problems where the program size is small compared with the amount of data processed. Unlike other existing heterogeneous orchestration tools which are MIMD based, HAsC is for data-parallel SIMD associative computing. Ease of programming and execution speed are the primary goals of HAsC. We evaluate how these two paradigms can be used together to provide an e cient scheme for heterogeneous programming. Finally, their scalability issues are discussed.
1 Introduction data to be exchanged to and from the shared memory. This may cause heavy congestion over available communication channels of a typical multiprocessor system. For this reason, Linda has been mostly used for coarse grain computations. Furthermore, it is very di cult to implement Linda on architectures not supporting the shared memory structure, which makes Linda unsuitable to a heterogeneous suite of computers.
On the other hand, Prep-P, Oregami, Hypertool and PYRROS all include an architecturally independent mapping component which can map a given parallel program onto either a special or arbitrary system. However, the mapping components of Prep-P 2] and Oregami 20] are basically libraries of specialized mapping algorithms only which map regularly structured programs onto regularly structured systems. Their mappings for irregularly structured programs or systems that are not found in the libraries may be very slow and ine ective. Hypertool 28] and PYRROS 30] generate fast and near optimal mappings for arbitrary programs by clustering the task graphs. However, they only map the clusters of task modules onto a fully connected system. Therefore, they are not suitable to a heterogeneous network which may have arbitrary interconnections. In this paper, we will only focus on the tools which can e ciently map arbitrary program tasks onto arbitrary computer systems. Since homogeneous programming tools are not suitable to heterogeneous computing, we need to develop a new tool based on a heterogeneous programming paradigm. An essential component of such a tool will be an e cient mapping algorithm, which maps an arbitrary task onto an arbitrary system.
One of the earliest mapping algorithm which can map an arbitrary task onto an arbitrary system is Lo's heuristic in 19] . Basically, this heuristic repetitively uses a max-ow min-cut algorithm 27] to nd mappings of task modules onto heterogeneous processors. The time complexity of Lo's heuristic is O(M 2 NjE p j log M), where M is the number of task modules, N is the number of processors and jE p j is the number of communication links between processors. El-Rewini and Lewis in 10] presented their mapping heuristic (MH). MH is a list scheduling algorithm which maps an arbitrary task graph onto an arbitrary system graph. In list scheduling, each task module is assigned a priority. Whenever a processor is available, a task module with the highest priority is selected from the list and assigned to a processor. ). The mapping problem can also be addressed as a graph matching problem 3, 18, 1, 11] . The input to the mapping problem is two graphs. The rst graph is called the task graph which is similar to the data ow representation of the execution process, where each node is a task module and edges represent dependency and ow of data. The second graph is called the system graph which is a representation of the underlying architecture. The mapping problem is de ned as the matching of these two graphs such that the overall execution time is minimized. This problem has been known to be NP-complete in its general form as well as several restricted forms 11] . In an attempt to solve the problem in a general case, a number of heuristics have been introduced 3, 18, 1, 11]. Lee and Aggarwal's mapping strategy is an example of this approach 18] . It assumed the number of nodes of the task graph to be no greater than that of the system graph. Its time complexity is O(N 3 ).
To reduce the complexity of the mapping problem, a number of approaches such as graph contraction and clustering have been studied 9, 2, 17, 24, 29, 30, 22] . However, in all these graph matching based techniques, only the task graph is clustered, and the entire task graph is still matched against the entire system graph. In this paper, we will rst introduce a new mapping technique which not only clusters the task graph but also clusters the system graph for more e cient mapping. It has a time complexity of only O(MN). This technique is based on a new programming paradigm called Cluster- M 12] . Cluster-M provides an environment for porting various tasks onto the machines in a heterogeneous suite such that resource utilization is maximized and the overall execution time is minimized. The other programming paradigm evaluated in this paper, HAsC 25] , models a heterogeneous network as a coarse-grained associative computer and. HAsC is designed to optimize the execution of problems where the size of the program is small compared to the amount of data processed. It uses broadcasting to avoid the mapping problem. Ease of programming and execution speed, not the utilization of idle resources are the primary goals of HAsC. We will illustrate how these two paradigms can be used together to provide an e cient medium for heterogeneous programming. We will also de ne scalability for heterogeneous programming paradigms and show that both paradigms presented are scalable.
The rest of the paper is organized as follows. In sections 2 and 3, Cluster-M paradigm and mapping methodology are evaluated. In section 4, HAsC is presented and evaluated for our purpose. We then introduce the combined use of HAsC and Cluster-M in section 5. Scalability issues of heterogeneous hardware, tasks, and software are addressed in section 6. Finally, we draw conclusions in section 7.
divided into di erent execution steps and communications between modules are divided into di erent execution phases according to the data and operational precedence. Computations in the same step and communications in the same phase can be carried out in parallel, but can not start before the parent modules of those in the previous step nish computations. The algorithm for clustering directed graphs is presented in 5]. The basic idea is to merge all the nodes in each execution step if they have a common parent node or a common child node. If a parent node t i has one or more children, one of its children is to be embedded to t i . Each Spec cluster has a size which equals to the maximum number of concurrent nodes contained in the cluster. To illustrate this algorithm, the following example is presented.
A task graph of 15 modules is shown in Figure 1 . Each module has a computation amount of 1 and each edge carries amount of data communication of 1. This task graph contains two subgraphs which are not connected, which means the two subtasks can be executed in parallel. The Spec graph is constructed by merging the clusters when they have communication needs as illustrated in Figure 1 . The input task graph has nodes a to o (15 nodes) . The nal Spec graph is a multi-layered graph containing member nodes a to i (9 nodes). For example, j, k and l are embedded to d, since j, k and l are in di erent execution steps and can not be executed concurrently. This will not only save the processor resources and communication cost, but also reduce the mapping cost since the Spec graph now only contains 9 nodes instead of the original 15 nodes.
A parallel system can be modeled as an undirected system graph G p (V p ; E p ). V p = fp 1 ; :::; p N g is a set of processors forming the underlying architecture, while E p is a set of edges representing the interconnection topology of the parallel system. We assume the connections between adjacent processors of the parallel systems studied here are bi-direction.
Therefore, an edge (p i ; p j ) represents that there is a direct connection between processor p i and p j .
To construct a clustered graph (Rep graph or Spec graph) from an undirected input graph, initially, every node forms a cluster. This node is presented by p i in the case of system graph and by t i in the case of task graph. Then clusters which are completely connected are merged to form a new cluster. This is continued until no more merging is possible. Two clusters x Constructing the Spec graph Figure 1 : A task graph and the obtained Spec graph. and y are connected if x contains a node p x (t x ) and y contains a node p y (t y ), such that node p x (t x ) and p y (t y ) are connected by a direct communication link. Each cluster has a size which is the number of processors in this cluster. An example is shown in Figure 2 .
The Cluster-M mapping algorithm is speci ed in Figure 3 . Before we start the mapping, we need to compute a reduction factor denoted by f, which is essential for the mapping of task graphs having more nodes than the system graphs. The reduction factor, f, is the ratio of the total sizes of the Rep clusters over the total sizes of the Spec clusters. It is used to estimate how many computation nodes need to share a processor. The mapping is done recursively at each clustering level where we nd the best matching between Spec clusters and Rep clusters. In order to match Spec clusters to Rep clusters, rst the Spec and Rep clusters are sorted in descending order with respect to their sizes. Next, to map each Spec 
Clustering of the undirected graph. Step 4 :
Step 3 :
Step 2 : We next compare Cluster-M to Lee and Aggarwal's mapping strategy 18]. Their mapping algorithm considers the task graph as directed graph and di erentiate nodes and edges into di erent computation steps and communication phases. This is done in order to accurately calculate the actual communication cost between two non-adjacent processors. However, Lee and Aggarwal's strategy maps the entire task graph onto the system graph without 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Given a similar task graph as shown in gure 6a, the mapping obtained on a 16-processor hypercube is illustrated in Figure 6b . In this example, all the computation and communication requirements are uniform. The schedule obtained from Cluster-M mapping is illustrated in Figure 6c . An optimal schedule, which also uses fewer number of processors, is shown in Figure 6d . Experimental results and comparison studies with other mapping methods 3, 21, 24, 28, 15] are presented in 5, 6] . We have shown that Cluster-M mapping has a superior running time and that the results obtained are similar to or slightly better than those from other algorithms.
Heterogeneous Associative Computing
Heterogeneous Associative Computing (HAsC) models a heterogeneous network as a coarsegrained associative computer. It assumes that the network is organized into a relatively small number of very powerful nodes. Basically, each node is a supercomputer architecture (vector, SIMD, MIMD, etc). Thus each node of the network provides a unique computational capability. There may be more than one node of a speci c type in a case where special properties are present. For example, one SIMD node may be specialized for associative processing and a second SIMD node may contain a very powerful internal network con guration. Figure 7 illustrates the logical similarity of an associative machine and a heterogeneous network. In particular, a disk-computer node on a network can be compared to an associative memory-PE cell. As in an associative cell, the node's computer is dedicated to processing the data on the node's disk(s). The disk-to-machine data transfer rate is much more e cient than the node-to-node transfer rate, just as memory-to-PE transfers are much faster than PE-to-PE transfers. Note that the associative computer and network diagrams are quite di erent from shared memory MIMD models. Shared memory con gurations emphasize the concept that all data is equally accessible from all processors. This is not the case in a . . . heterogeneous network. HAsC is \layered" so that any node in the HAsC network may be another network. Thus a HAsC node may be a HAsC cell containing more than one computer, or it may be a port to another level of computing in the HAsC network. For example, most nodes may contain a general purpose computer in addition to a supercomputer to function as the node's port for the rest of the HAsC network, for le management and other support roles. Figure 8 shows a typical HAsC network organization. Each HAsC node has access to a number of instruction stream channels. Each channel broadcasts a di erent sequence of code. The HAsC node selects the appropriate channel based on its local data and previous state. The selected channel is saved in a channel register. A port, or transponder node will accept a high level command and \translate it" into the commands(s) appropriate for the subnetwork.
Some of the properties of the associative computing paradigm which makes it well suited for heterogeneous computing are: 1) e cient programming and execution with large data sets and small programs; 2) optimal data placement; 3) software scalability (see section 6); 4) cellular memory allocation; and 5) search-process-retrieve synchronism 23]. In HAsC, instructions are broadcast to all of the cells listening to a channel, but each individual cell must determine whether to execute the instruction. This determination is performed as follows: Upon receipt of an instruction, a node \uni es" it with its local instruction set and data les. Several languages such as Prolog and STRAND 13] incorporate this process. HAsC is di erent in that it uses uni cation only at the top level. Thus there is only one uni cation operation per data le, as opposed to one per record or eld. This di erence is critical in a heterogeneous network where communication of individual data items would be prohibitively expensive.
If there is a match, the appropriate instruction is initiated. The instruction may in turn issue more instructions. Thus, control is distributed throughout HAsC. That is, a program starts by issuing a command from a control node. If a receiving node receives a command that is in e ect a subroutine call, it may become a transponder control node. It may rst perform some local computations and then start issuing (broadcasting) commands of its own. If the node happens to be a port node, the commands are issued to its subnet as well as to its own network. Thus it is possible for multiple instruction streams to be broadcast simultaneously at several di erent logical network levels in a HAsC network.
In general, HAsC assumes that data is resident in a cell. As a result, data movement is minimal. However, it is common for one cell to compute a value and broadcast it to other cells. Thus, in general, there is a need to synchronize the arrival of commands and data. There are basically two cases which are handled automatically by the HAsC administrator as a part of the search-process-retrieve protocol.
The normal case is for data to be resident in a cell when the HAsC command arrives.
Instruction uni cation and execution proceeds as described above. HAsC allows data transfers, but protocol insists that the data transfer be complete before any associated commands are broadcast. The second case involves command parameters. When a command arrives and is uni ed with resident data at a node but some parameter data is missing, the uni ed command is stored in a table to wait for the parameter in a synchronism process called a data rendezvous. When parameter data arrives, the rendezvous table is searched for a match. If found, the associated command is executed.
HAsC uses network administrators and execution engines to e ect the paradigm. Each HAsC network level has a system administrator and each node in a network has its own local administrator. The local administrator monitors network tra c capturing incoming instructions and checking for illegal commands. It is also responsible for maintaining the local HAsC instruction set.
The HAsC administrator receives all incoming HAsC instructions from the local network. It then veri es if each instruction is legal. If it is, the administrator puts it in the Execution Engine queue. Otherwise, it attempts to identify the source and makes a report to the system administrator. Repeat o enses cause escalating diagnostic actions as determined by the network administrator.
If a Meta HAsC instruction such as (un)install, (un)extend, or (un)augment, is received, it is processed immediately. The Meta instructions will create, modify and delete HAsC instructions from the local HAsC instruction set respectively. Meta instructions can also modify local data structure de nitions.
Since the instruction set can be dynamically expanded by the users, it is possible for two users to install the same instructions. The node administrator distinguishes between the two instructions by a user id and program id which is broadcast with every HAsC instruction.
Instructions can be added at several di erent logical levels: 1) system, 2) project, 3) user. Typical system level instructions would be data move and formatting commands. Project commands would be project oriented. For example, a numerical analysis project would have matrix multiplication and vector-matrix multiplication instructions, while a logic programming project might have specialized logic instructions, such as uni cation. At the user level, one user might specify a SAXPY operation while another might want a dot product. Scalable libraries may exist at any level, but most commonly at the project level.
Each node/cell has an execution engine which controls instruction execution at that node. The execution engine selects the next instruction, makes the bindings speci ed by instruction uni cation and causes the instruction to be executed. The execution engine performs the following tasks: Save Environment, Get Next Uni ed Instruction, Bind Uni ed Variables, Establish Environment, Execute Uni ed Instruction, Restore Old Environment.
Instruction execution may take two basic forms. First the instruction may be a HAsC program which is executed in the transponder mode. Second, the instruction may be a library call written in FORTRAN, C, LISP, etc. In this case, the established environment restrictions produce the proper interface for the appropriate language.
HAsC must allow for a dynamic instruction set and data structure modi cations. Thus the HAsC install meta instruction consists of an associative pattern and a body of code. When it is broadcast to the system, all nodes which successfully unify with the instruction gather the body of code and install it on the local node. The extend instruction consists of a pattern and a data de nition. Responding nodes add the data de nition to the local associations. Extend may add a named row or column to an existing association. Augment can be used to add an entire new association.
The patterns in these instructions contain administrative data. Such as job id, project id, etc. If the node is not participating in the project or job, then it does not unify and the instruction is not installed or the data de nition not extended. Uninstall, unextend and unaugment perform the inverse operations.
Basic to the HAsC philosophy is the concept that data when initially loaded into the system is sent to the appropriate node and is never moved. While this would be ideal, there will always be a need to move data from one node to another. Accordingly, there are a number of HAsC move commands. Move commands can be divided into intra-association and inter-association instructions. Intra-association instructions are very much like expressions in conventional languages and are not discussed here due to lack of space. Inter-association instructions include le I/O as a special case. Inter-association moves must have node identi ers and for I/O, a le server, a disk or other peripheral is a legal node. The essence of HAsC is to model a distributed heterogeneous network as an associative data parallel computer, where processor synchronization is on an instruction by instruction basis. Accordingly, in HAsC, the associative instructions are synchronized. An e cient implementation of the synchronization requires an understanding of how the various associative statements are mapped onto sequences of virtual machine commands, most importantly the degree of network communication complexity of the sequences.
Here, we brie y describe a hierarchy of instructions -from the highest, most global (easiest to synchronize) to the lowest, most local (hardest to synchronize). HAsC will perform most e ciently if the programs are written using high level commands. The lower the level of the command, the more inter-node communication is required. Five di erent levels of instruction coupling are required to implement all of the HAsC statements on a heterogeneous network.
Communication and synchronization are built into the HAsC instruction. There is no need for the programmer to be aware of the degree of instruction communication. The ve levels of instructions are presented here to more clearly delineate the relationship between associative and heterogeneous computing. Figure 9 gives an example of instruction synchronization, where $ is the parallel marker. Result$ is a data parallel pronoun referring to the last performed data parallel computation.
The top level synchronization box shows the programming style for algebraic expressions supported by HAsC.
Combined Use of Cluster-M and HAsC
HAsC is most suitable for coarse-grained heterogeneous parallel computing. It is intended to ease the programming e ort and maximize execution speed. Cluster-M, on the other hand, provides both coarse-grained and ne-grained mapping in a clustered fashion. It aims at maximizing both execution speed as well as resource utilization. Therefore, both paradigms can be combined to achieve a better overall performance featuring ease of programming, increased execution speed and optimal resource utilization.
Cluster-M mapping can be applied to HAsC in several ways. First, Cluster-M can be used to determine the initial data mapping before HAsC computation begins so that the overall execution time is minimized. Secondly, Cluster-M mapping can be used to decide the ne-grained mapping within HAsC nodes as shown in Figure 10 . Thirdly, Cluster-M can be alternated with HAsC at run time. In this approach, a Cluster-M Speci cation for the task is generated rst. The Cluster-M Speci cation preserves computation and communication information in a multi-level cluster organization. Clusters at the same level represent computations at a given step which can be executed concurrently. This cluster organizational information can be sent to the HAsC network controller which then broadcasts the clusters of HAsC instructions ( Figure 11 ). As described in section 4, the local HAsC nodes determine which of the clusters to execute based on their local con guration and data. The task graph of this coarse-grain solution is shown in Figure 12a . Using one of Cluster-M's clustering algorithms, a Spec graph can be obtained as shown in Figure 12b . Suppose there is more than one HAsC node available in the system. Using Cluster-M mapping. The matrices A and B will be allocated to two di erent HAsC nodes, say Node1 and Node2 respectively.
Next, for each level of clustering in the Spec graph (which represents each computation step in the original task graph), the concurrent clusters at that level (which represnt concurrent computation modules) can be sent to the HAsC network controller to be broadcast to all the HAsC nodes. For example, at step 1, two clusters of HAsC user level instructions (function calls) \do GE on A$" and \do GE on B$" are broadcast to all HAsC nodes at the same time. The HAsC node Node1 will select to execute the rst instruction, while the HAsC node Node2 will select to execute the second instruction.
Finally, Cluster-M mapping is used to decide the ne-grain mapping within each HAsC node. The GE operation, which is a function in user level library, actually consists of many system level instructions which may look similar to the SAXPY code in LINPACK 7, 8] as shown in Figure 13a . The task graph of a GE on a 7 7 matrix A or B is illustrated in Figure  13b . In each task module T k j , column j is modi ed by using column k. Suppose Node1 is a 2 3 torus, and Node2 is a 4-processor completely connected machine, as shown in Figure  14 . Also, suppose for both Node1 and Node2, it takes 1 unit of time to compute each T k j and 1 unit of time to transmit each column between two connected processors. Using the Cluster-M clustering and mapping algorithms, the ne-grain mappings of system level HAsC instructions onto the processors within each HAsC node can be obtained, as shown in Figure  15 .
Scalability Issues
Scalability is often understood di erently by di erent authors. We will consider scalability to refer to hardware, tasks and software in roughly analogous fashion. In addition, scalability may refer to both homogeneous or heterogeneous architectures. In the following, we rst de ne homogeneous scalability and extend it to heterogeneous scalability. Then we discuss the scalability of HAsC and Cluster-M.
Homogeneous Scalability
Homogeneous hardware scalability refers to multiple machines which are of the same basic architectural type, typically various-sized versions of the same vendor product. We de ne the hardware scalability function, ( ; ), between two homogeneous architectures, (the larger) and (the smaller), to be the ratio of the size of over the size of . For example, an eight processor CRAY is a hardware example of a scaled-up version of a two-processor CRAY. In this example, the eight-processor Cray has a scalability factor of 4 ( = 4) over the two-processor. Software scalability refers to the ability to exploit task and hardware scalability, with little or no changes other than parameters. We de ne the software scalability function, ( ; ), for the case of two homogeneous architectures, (the larger) and (the smaller), to be the real-valued function giving the increase in performance of over . Typically we expect some increase in performance but we do not generally, at least in the homogeneous case, expect super-linear performance, i.e., 1 ( ; ) ( ; ). In most cases we expect to be a simple multiple of , i.e., ( ; ) = ( ; ), where 1= ( ; ) 1:0.
Heterogeneous Scalability
Heterogeneous scalability is clearly more complicated than homogeneous scalability, though it is also the case in which we can aspire to the ultimate in heterogeneous computing potential, i.e, to achieve signi cantly greater than . This is what we mean by super-linear performance. In the heterogeneous case, there may be no commonality between two di erent architectures, therefore, hardware scalability does not apply to heterogeneous case. Figure 16 : Hierarchical breakdown of a task
Consider the breakdown of a task into four levels, as shown in Figure 16 . The top level is the functional level. In this level, the function \Find a datum" is speci ed. Next is the approach level. For this problem, there is a radical di erence between the approach for a SIMD machine used associatively and non-SIMD machines. In the former case, we can use simple associative search, which is O(1). In the latter case we would typically use a sort, then search operation, the asymptotic performance of which is bounded by (log n). For the associative search on a suitable SIMD machine, there is really only one instruction \ nd datum", so that there is no room for di ering algorithmic or code variations. However, in the non-SIMD case, there are several possible variations . For example, depending on the data, parameters, architecture, etc., we could use a number of di erent search techniques and similarly we could use a number of di erent coding schemes for each algorithm.
In this context, the term scalability only applies to either functional level or approach level. In the above example, the scalable approach is the non-SIMD approach. However, this will bring the following dilemmas: 1) It is possible to have a non-scalable implementation (at the approach level) inherently more e ective than a scalable approach implemented on the same machine; and 2) It is possible to have high hardware scalability but low task/software scalability, or vice versa. In other words, the scalable metric is inherently defective in this case if scalability is applied to the approach level.
We conclude that the only kind of scalability applicable to a heterogeneous network is type 1 task scalability at the functional level. In essence, heterogeneous scalability refers to the property that a given software scalable program will execute e ciently on any size data set, on any heterogeneous network con guration without any modi cation. While the functional level scalability may be trivial on a homogeneous network, it is fundamental to establishing a common unifying programming environment for heterogeneous networks.
Scalability of HAsC and Cluster-M
Both programming paradigms presented in this paper are architecture-independent as explained in detail and are therefore heterogeneously scalable. In HAsC, a program is broadcast to the entire network and the individual nodes determine locally which instructions to execute. The global broadcasting approach means that there is no need to know how nodes are connected in the network or how data is distributed across the nodes. This allows data les to be analyzed dynamically at run time as they enter the HAsC system and to be directed to the nodes best suited to process them. Broadcasting allows scalability. The hardware can be expanded or modi ed and the problem size can be changed without having to reprogram or recompile the basic HAsC program. New nodes consisting of new machines with installed HAsC software can be added to a network at any time, and at any location. HAsC is not dependent on any physical machine or network con guration. This is because the instruction broadcast, cell memory organization and associative searching allows the removal of any reference to data set size and type from the program.
Cluster-M is also scalable. When a new machine is added to the heterogeneous network, a new Cluster-M Representation of the new suite can be generated. However, the Cluster-M Speci cation, which is architecture-independent, can be e ciently executed without any change. An appropriate new mapping can be computed to map the Cluster-M Speci cation to the new Cluster-M Representation. Furthermore, the two paradigms can be used concurrently as a hybrid scalable programming paradigm. Figure 17 illustrates the above claims. 
