This framework provides a technique for efficiently managing execution of applications in a distributed heterogeneous sup erc omp u t ing system. The technique is based on code profiling and machine benchmarking.
To handle these issues, an integrated approach toDHSS management isneeded. Such an approach should allow management of both computational and network resources by adaptingto application needs and providing a true SUperconcurrent environment. In this article. we suggest one such system, the Distributed Heterogeneous SupercomputingManagement System. DHSMS is based on the principle of user/.system upscale compatibility. According t o this principle, the more information a user can present t o the system about a solution domain and the more the system can match that information against its internal structure. the more accurately the system can b e managed. Following this principle, we describe a framework for managing a DHSS by characterizing an application on the basis of code profiling and computation as well as IiO benchmarking of machines. This framework provides a methodology for dividing user task profiles and system architecture characteristics into manageable and measurable components.
We propose a general framework that applies t o any type or class of supercomputers in any combination. The proposed DHSMS has some features identical t o t h e distributed intelligent network system,' but it also has substantial differences. F o r example, DHSMS includes a systematic methodology for both code profiling and analytical benchmarking. We also suggest a Universal Set of Codes (USC) for generating architecture-dependent code profiles systematically at varying levels of detail. DHSMS takes account of both U 0 benchmarking and network interface delay. Furthermore. it uses network caching of data, communicated among machines, to increase the performance of a DHSS. An experimental prototype DHSMS is under developFuture DHSSs must manage applications by finding suitable matches between codes and machines.
GIG provide detailed architecturedependent task and I/O characterization.
We use code profiling to characterize tasks in terms of their computational behavior and to evaluate the "degree of match" between the codes andmachines. ' The literature offers very few codeprofiling methodologies within a DHSS context.' However. these methodologies have limited applicability. Most of them are based on rather simplistic and highly abstract views of parallelism. They do not account for detailed architectural characteristics. We need new codeprofiling methodsthat incorporate these details and thus support more accurate xhedulinp and mapping decisions with regard to application execution." However. there is a trade-off between the ment at Purdue University to demonstrate these and other features.
Characterization of applications for a DHSS
A distributed application consists of asetoftaskswith certainrelationsamong them. Tasks are the basic units handled by the proposed DHSMS. To run an application efficiently, a DHSMS must analyze both the computational and communicational requirements of the application. An application can be formally modeled as either a task-flow graph (TFG) or a task-interaction graph (TIG).' A TFG expresses the explicit precedence relationships among application tasks. while a TIG is more suitable for representing distributed interactive tasks without their explicit dependencies. Distributed systems use scheduling and mapping algorithms for managing these graphs. Since both the TFG and TIG models are architectureindependent. they are only suitable to homogeneous systems and d o not carry any information about task behavior in heterogeneous systems.
A DHSS requires a more precise and general method for characterizing these applications, one that not only incorporates the information about the "degree of suitability" of a task to a specific machine but also quantifies the communication interaction among tasks. This interaction is an important parameter because data must be exchanged among machines that may have diverse 1 1 0 architectures as well as network interfaces with drastically different performance profiles.
For the DHSMS. we solve these prohlems by in troducing the notions of a codeflow graph ( C F G ) and a codeinteraction graph (CIG). The CFG and accuracy of the information generated by a profile and the complexity involved i n generating it.
For sc h e d u 1 in g/m a p pi n g tasks . code profiling itself is not sufficient. Rather. we require an estimate of code execution time o n a specific machine. For this purpose. we also need to use nnalsticrrl h~~r?c~11/iinr.kit~g. a process used t o estimate machine performance relative t o a baseline system.' Until now, benchmarking research has focused on devising methods to measure the overall performance of each machine on a realistic application program having several tasks with different processing requirements. However. a DHSS environment decomposes an application into multiple tasks that can run separately on different machines: analytical benchmarking for a DHSS must therefore estimate machine performance OJI each part of the application as well as performance of the 110 subsystem.
The ultimate objective is to combine both code profiles and benchmarks. We must therefore have a finite set ofcodes that serve both purposes. The LTSC is one such approach. It can he viewed as a standardized universal set of bcnchmarking programs. This set can also provide information (profiles) about the effect of machine architectural characteristics. The proposed USC can then be used for generating both code profiles and benchmarks that can subsequently be used to estimate the execution time of acode on a specific machine.
Most existing benchmark programs are architecture independent. and cannot provide realistic and meaningful profiles about machines. This is because such programs cannot be mapped prop- erly on the machine. Instead of yielding benchmark profiles, they may even cause a speedup of less than one -that is, performance worse than a uniprocessor. For example, analyses have shown that if the standard molecular motion computation algorithm executes on a supercomputer with a multistage interconnection network, such as the butterfly system. or on a shared-bus interconnection system, the speedup approaches zero as we increase the number of processors beyond a certain ~a l u e .~ It is, therefore, highly desirable t o write benchmark programs based on the architectural features of machines. The architecture-driven USC provides one such solution.
There are many ways t o synthesize a USC. Our approach is hierarchical. It provides not only a systematic way of generating this set but also a flexible way for the user t o choose a subset of the USC that suits the desired accuracy in profiling and benchmarking.
Similarly, for quantifying overhead incurred in the I/O subsystem and network interface at the time of data communication among machines, it is desirable to benchmark DHSS machines for their I/O and interface performance profiles. These profiles can provide information about the timing delays in transferring data among machines.
Generating USC. The hierarchical scheme for generating USC is basically a detailed architectural characterization of supercomputers. At the highest level, we can select the type of processing parallelism for classifying architectures. At the second level, we can further classify these architectures on the basis of finer architectural features such as the organization of the memory system and the interconnection topology. An important characteristic of this structure is that the levels in the hierarchy are selected in such a way that the main architectural features characterized at any level are related t o each other.
A similar approach has been used to characterize supercomputers for evaluating their performance.' Every node in the proposed hierarchy, except the leaf nodes, represents a machine type described by the path from the root of the hierarchy t o that node. The leaf nodes of this hierarchy correspond t o the actual machine models present in a DHSS. Figure 1 shows one such possible classification hierarchy. In this example, the first level is classified according t o the type of parallelism of the machines, namely, single instruction, multiple data (SIMD); multiple instruction, multiple data (MIMD); vector, etc. The second level further classifies these machine types according to their memory organization such as shared-memory system and distributed-memory system. The detail of feature information of machine architecture increases as we go down the hierarchy.
To generate a USC, we first assign a code type to each node of this hierarchical tree. The path from the root to a lower node provides profile information (suitability of those architectural the code associated with that node. A more detailed profile can be used to screen out machines that may have identical benchmarks. This screening can then provide a better estimate for the execution time.
Formally, a USC is defined as a set of code types, { c, ), 0 < i <: K , where each code type is represented by a node in the hierarchy, K is the total number of nodes in the hierarchy, and types c, at every level can be grouped together t o generate the profile information, as discussed below.
On the basis of this hierarchy, we define a code profiling vector, v,!, for a given task ( t ) , and for each level (1) of the hierarchy. We assume that at level 1, nodes are labeled from 1 to N,. where N, is the number of nodes at that level.
Formally, v,' is given as:
features that are given by the path) for where v,,(t) represents the size of parallelism in task f and is equal to n . The elements v , ( t ) quantify the "degree of match"5 that exists between task f and the code associated with the ith node present at the level 1. These elements can be obtained by the "matching" function V. For the set of tasks T a n d USC,
There are various ways t o quantify these elements. For example, v,(f) = 1 indicates that task t shows the same expected speedup as the optimal code type i associated with USC, and v8(t) = 0.5 means that task t may produce half of the speedup of the code type i.' In general, such a quantification of match is determined on the basis of many factors, for example, the amount of parallelism present in the task, and the number of loop iterations. This problem of code profiling is nontrivial. (Yang e t aLx have proposed a graphical approach.)
As a n example, suppose a user selects the first level of hierarchy in Figure 1 . Then V: contains eight elements, corresponding t o the size and type of processing parallelism, namely, SIMD. MIMD, vector, special, dataflow, very long instruction word (VLIW), and mixed mode. Similarly, if the user specifies a more detailed characterization -say, up t o level two -then the vector is of length 15, corresponding t o the two 80 COMPUTER cases of memory organization (distributed and shared). with each organization in turn consisting of the seven cases of the first level. In most cases. code profiling is an online process that incurs a runtime overhead. A "detailed" profile may take into account all the important architectural characteristics of a machine, such as the type of parallelism, the interconnection topology, and the memory organization scheme. Generation of such a profile requires a detailed analysis of the task with respect t o the architectural characteristics defined at the selected level in the hierarchy. Although such a profile provides very useful information for efficiently schedulingimapping a task 1ia accurately matching it to a machine. it can only be generated at the cost of increased overhead associated with the task analyses.
A "coarse" profile. on the other hand. can be generated with relatively low overhead by choosing only a few levels in the hierarchy. However. such a profile may not be accurate enough for scheduling and mapping tasks effectively. This accuracy-versus-complexity trade-off depends on the hierarchy level selected. This selection can be a part of the user-specified processingrequirements.
Computation benchmarking. There are several methodologies for benchmarking parallel machines: among them are kernel, partial (trace). and synthetic benchmarks." Several codes have been proposed for benchmarking the performance of parallel machines: these include Livermore Loops. Linpack, and others."' In a DHSS environment, an application is decomposed into multiple tasks that run separately on different machines. Analytical benchmarking for a DHSS must therefore be able to estimate the performance of a machine on each part of the application. Also. a benchmark program must take into account the architectural characteristics of machines.
Exi5tingbenchmarkprograms are not specially designed t o measure a specific architecture's performance. Rather, their objective is to measure overall machine performance under a simulated application environment." To accurately estimate the performance of a code on a certain machine, we need a standard set of codes. based on architectural features. that can be used for Analytical benchmarking for a DHSS must estimate machine performance on each part of the application.
both code profiling and benchmarking. We can use the proposed USC for this purpose.
Formally, we can define analytical benchmarking by a vector, B(n). which is given as
where M is the number of machine models and b'(n) gives the speedup for the size of parallelism equal t o n and for a machine modelj. The speedup @ ( n ) is based on the machine model's "optimal" benchmark code type as represented by one of the leaf nodes of the USC. Furthermore, this vector should be generated for various sizes of parallelism, as depicted in Figure 2 . Because the benchmarking vectors must be usable at any hierarchy level that is selected for code profiling. we need to group machine models as we move up the hierarchy. This grouping allows code profiling and benchmarking vectors t o be used together t o estimate execution time.' Figure 2 (a) shows the grouping process. It is important to mention that the June 1993 vector B(n) depends on the size of parallelism ( n ) present in each code type j associated with the USC leaf node. Its elements need to be chosen for the give n value of n. as depicted in Figure 2(b) .
IIO benchmarking and networkinterface profiles. Not much work has been done for analytical benchmarking of supercomputer IiO subsystems. The IiO overhead depends on many factors, such as the effective bandwidth of memory channels. the topological characteristics of the 110 interconnection network. and the number and speed of the IiO processors. Accordingly, we can express the 110 benchmarking of a given architecture as a performance function that depends on the amount of data being transferred through the 110 subsystem.
For a typical 110 subsystem, a performance graph. such as the one in Figure  3 . can represent this function." Typically. such a function shows latency time increasing linearly until it reaches a saturation point. as shown in Figure 3 . Usually a single component determines this type of growth in latency rateprobably the slowest one in the IiO subsystem. However. beyond the saturation point. the latency growth rate can increase substantially due to the saturation and loading of various components. This saturation results from many factors. for example. contention
We can represent communication overhead in terms of performance functions.
within communication interconnections and the physical limitation on the movement of disk heads.
The interface system between a machine and the network can also cause considerable delay. This is because the protocols for communication and media access can dominate the overall communication overhead incurred during intermachine data transfer. Therefore. we must consider performance profiles for the network interface. along with the I10 profiles. Such a composite profile, as shown in Figure 3 TIG. We can find this overhead by using the composite performance function d, associated with the machine (see Figure 3) . We can represent communication overhead associated with the whole task graph in terms of these functions. The functions must be tabulated for the machines in a DHSS. It is important to mention that the total volume of data, U , , communicated between two tasks in a TFGiTIG generally represents an aggregated value. In reality, the exchange of data among machines may be intermittent. Therefore, some sort of stochastic performance profiles may be more suitable.
Data conversion is another critical factor that restricts DHSS performance. In this runtime process, data communication among machines completes only when the conversion process is over. The overhead associated with this process depends on the amount of data being transferred and the data types used by the communicating machines. It also depends on the efficiency of the conversion process for a specific data type. We assume that the conversion process depends only on the data size and type. Accordingly, to handle data conversion cost. overhead can be added to the respective IiO function.
CFG and CIG.
Using a code-profiling technique. such as the proposed USC, and analytical benchmarking, we can generate a CFGiCIG from a TFGITIG. Figure 4 illustrates the overall generation process. This process starts with a TFGITIG. which describes the execution time of each task t, on a baseline system and the communication cost a, in terms of amount of data transmitted amongtasks. Anintermediate CFGiCIG is generated from the TFGiTIG by using the code-profiling information. As a result. each task in the TFGiTIG is as- 
The resulting graph is a CFGiCIG that carries the detailed information about the machine-dependent execution and about the IiO performance of t h e tasks and data-communication overhead associated with a TFGiTIG This elaborated machine-dependent characterization of DHSS applications is important for the DHSMS to carry out its task management functions
DHSMS architecture
A Distributed Heterogeneous Supercomputing Management System differs from existing experimental testbeds because it provides a framework for managing applications with different characteristics and machines with heterogeneous machine architectures. A DHSMS consists of various modules. each of which may contain submodules that vary in their functional capabilities and complexity. A DHSMS's basic function is to select a proper set of modules to meet an application's computational needs.
A DHSMS manages the resources and applications. and tries to satisfy the applications' processing requirements. such as on-line and off-line requirements. by A CFGKIG carries detailed information about a task's machine-dependent execution.
making their schedulingimapping decisions. Figure 5 on the next page shows a conceptual architecture of a DHSMS. It consists of the seven modules described here.
Core. This module selects an appropriate set of components on the basis of the type of graph and the user-specified degree of accuracy for code profiling.
By implementing the core as a module that is independent of a distributed operating system (DOS), we can integrate existing DOSS into the DHSMS. Cronus Kernel and V-kernel are typical examples of such a module. This module also allows the integration of new local operating systems without changing the local system or the DHSMS itself.
DOS. This module is the actual administrator for resource management and engagement of needed components. Its basic functions include supportiiig communication among machines, maintaining service-level protocol structures including data-type conversion, and handling some standard services such as managing files a n d directories. A DHSMS can use most existing classes of DOSS. such as integrated, object-oriented. and severipool model-based systems.
Task analyzer. This is a key module in DHSMS. It accepts user applications in the form of source programs and converts them into graphical forms, such as TFGs or TIGs. These graphs are subsequently processed by other modules such as the code profiler and intermediate graph generator. the analytical benchmarker, and the code graph generator.
T o resolve the problem of heterogeneity in programming languages, we assume that a standard graphical model of a program exists to help in generating a TFGITIG. One such graphical "language" is Intermediate Form 1 (IFl),j which is an acyclic graphical language that can be used to represent the flow of code execution. Such a representation is useful for estimating the computation time and overhead for application tasks. This estimation process requires a tool for analyzing applications. Parallel Assessment Window System (PAWS)' is one possible tool that can be used for this purpose.
The task analyzer has two components: a task preprocessor and a TFGi T I G generator. The task preprocessor converts an application into a graphical language. The TFGiTIG generator analyzes the application. Taskcoordinator. This module makes the schedulingimapping decisions for the applications represented as CFGs or CIGs. By using the values of E, and D; associated in these graphs, the module assigns tasks to various machines in a manner that optimizes a cost function, such as the overall application-execution time. Because a DHSS requires both scheduling and mapping mechanisms. the task-coordinator module contains two corresponding components, namely, the scheduler and the mapper.
The scheduler consists of a set of scheduling algorithms with varyingcomplexity and accuracy. Most scheduling algorithms used in homogeneous systems can be modified to handle the DHSS scheduling environment. The formulation of a cost function is generally based on two assumption: that the computation and IiO occur sequentially without any overlapping and that the process of data conversion starts only after
84
C O M P U T E R the transfer of data between two machines is complete. (The latter assumption is not required when network caching is employed, as discussed later in this section.) The total cost, CTotalr of running an application is the overall execution time of a CFG and is given by the length of the critical path in the CFG. The length of this path depends on elements e, and dYk (a,) along the path.
The scheduler's objective is to minimize CTora, by matching each code with a suitable machine from the pool of available machines. Because a limited number of machines is available, an intelligent assignment for the best performance is required. The precedence relationships among CFG codes impose restrictions on the execution order of codes. The various CFG paths provide such an ordering: they must be evaluated to finalize the scheduling decision. In a similar fashion, we can obtain a CIG cost function that depends on estimated times e, and dik(a,).
Minimizing CTulal is basically a dual optimization problem that requires not only the best match of codes with the machines but also the minimization of communication overhead in the exchange of data among machines. A DHSMS must identify the computation and data-communication overhead associated with a CFGiCIG critical path and handle it efficiently by assigning tasks on this path to the "most suitable" machines. This problem is NP-hard..I Various heuristic approaches to scheduling and mapping can be used. We d o not propose any new algorithm here. Rather. we describe the concept of "network caching'' and discuss how the overall schedulingimapping problem can be handled more efficiently by utilizing network resources in conjunction with the DHSMS task-coordinator module.
Data caching within a network. T o manage application execution more efficiently. we propose a mechanism for utilizing underlying network resources, especially the buffering capabilities of various nodes. These buffers can be used to cache data that is exchanged among machines during CFGiCIG execution. We expect that data caching within the network can compensate for the IiO and network-interface bottlenecks as well as reduce the data exchange and conversion overhead.
The network can provide fast buffers at each node. Since the network may tie operating at an extremely high rate (in the gigabits-per-second range), we can view these buffers as constituting a large distributed memory with fast access throughout the network. When two machines need t o communicate, the network can allocate an approprial:e number of these buffers at the time a CFGiCIG is scheduled. The DHSM.S can use the allocated buffers as a cache for data exchange between these machines. Figure 6 illustrates this process. When machine M , accesses data from the IiO subsystem of machine M 2 , the DHSMS can bring appropriate additional data from the storage subsystem of M z arid store it at intermediate nodes after converting it into a format suitable for k',.
Various existing data-caching algorithms can be used for this purpose.
The total size of cache required between two machines depends on the performance of their IiO subsystems and network interfaces, as well as on the amount of data transferred between them. The size of the network cache can be estimated from the elements of the communication-overhead matrix in the CFGiCIG. We assume that the task coordinator can generate such requirements.
Reducing the complexity of the schedulingimapping problem via data caching within the network can be achieved in two steps. Starting with a CFG/CIG, the task coordinator carries out its schedulingimapping decision, based only on the estimated computation-time vectors E,. That is, only the computation-time estimate is used in the cost function to find the best-matched machines: no communication cost is involved. Equivalently, we can modify the CFGiCIG by dropping the communication-overhead matrix. Any heuristic a1-gorithm, such as the one given in B o w m et al.,4 can be used for such scheduling/ mapping.
After the machines are selected, tlne corresponding communication-overhead matrix elements, d,l, (a,) . are evaluated. This evaluation determines the composite data-communication performance profiles for the selected machines that correspond t o the communication ctosts associated with the CFGiCIG links. Such an evaluation can provide the total buffer size required to ensure sufficient cache memory t o gain "delay compensation" that offsets the datacommunication overhead. We know that we can improve Xi0 subsystem performance by increasing the size of the system cache. We need to explore the relatiion between network-cache size and the value of elements d,i(a,) . The interaction between the task coordinator and a network resource manager also needs t o be investigated for this purpose.
Experimental platform for DHSMS
A t Purdue University, we are developing a DHSS platform by interconnecting a MasPar, an nCube, and two four-processor systems through a highspeed optical network called TeraNet. The network operates at a rate of 1 gigabitisecond. This platform provides a facility to test and evaluate various 1) H S MS -r e 1 at e d concepts similar t o those we have presented here. Specifically, we are developing a task analyzer. code profiler and intermediate graph generator, and analytical benchmarker. \Ne are using PAWS,' which allows us to assess the performance of various supercomputers and rank the machines for a given application.
PAWS has capabilities that can be used effectively in the proposed DHSMS:
It can generate a machine-independent graphical representation of an application written in a high-level language, namely, IF1. W e can then transform this representation into an equivalent TFGITIG.
It can simulate execution of a code for a parallel machine, which can provide approximate benchmark results. Exact benchmarks can be obtained by explicitly running codes on the machines.
Recently, we proposed a mapping alJune 1993gorithm for heterogeneous system^.^ We are planning t o use a generalized version of this algorithm that is suitable for a DHSS environment by incorporating the code profiling and the benchmarking information.
T he general DHSMS framework we have proposed is based on a code-profiling and benchmarking methodology for characterizing distributed applications. This methodology lets us incorporate the computational and data-communication overhead associated with an application within a single graphical model of tasks constituting these applications. Wehave also shown how network data caching can help reduce the complexity associated with scheduling decisions for applications in a DHSS.
