Application software development for High-Performance Parallel Computing HPC is a non-trivial process; its complexity can be primarily attributed to the increased d e grees of freedom that have to be resolved and tuned in such an environment. Performance p r ediction tools enable a developer to evaluate available design alternatives and can assist in HPC application software development.
Introduction
Development of e cient application software for High-Performance Parallel Computing HPC is a non-trivial process and requires a thorough understanding not only of the application but also of the target computing environment. A k ey factor contributing to this complexity is the increased degrees of freedom that have to be resolved and tuned in such a n e n vironment. Typically, during the course of parallel application software development, the developer is required to select between available algorithms, between possible hardware con gurations and amongst possible decompositions of the problem onto the selected hardware con guration, and between di erent communication and synchronization strategies. The set of reasonable alternatives that have t o b e e v aluated is quiet large and selecting the most appropriate one can be a formidable task.
Evaluation tools enable a developer to visualize the e ects of various design alternatives. Conventional evaluation techniques typically require extensive experimentation and data collection. Most existing evaluation tools post-process traces generated during an execution run. This implies instrumenting source code, executing the application on the actual hardware to generate traces, post-processing these traces to gain insight i n to the execution and overheads in the implementation, re ning the implementation and then repeating the process. The process is repeated until all possibilities have been evaluated and the most suitable options for the problem have been identi ed. Such a development o verhead can be tedious if not impractical.
Performance prediction tools provide a more practical and cost-e ective means for evaluating available design alternatives and making design decisions. These tools, in symbiosis with other development tools, can be e ectively used to complete the feedback loop of the develop-evaluate-tune" cycle in the HPC application software development process.
In this paper we rst present a n o vel interpretive approach for accurate and cost-e ective performance prediction that can be e ectively used during HPC software development. The essence of the approach is the application of interpretation techniques to performance prediction through an appropriate characterization of the HPC system and the application. An interpretive HPF Fortran 90D application performance prediction framework has been implemented using the interpretive approach and is part of the NPAC 1 HPF Fortran 90D application development e n vironment. The accuracy and usability of the framework are experimentally validated.
Next, we outline the stages typically encountered during HPC application software development and highlight the signi cance and requirements of a performance prediction tool at relevant stages. Numerical results obtained using application codes and benchmarking kernels are then presented to demonstrate the application of the performance prediction framework to di erent stages of the application software development process outlined.
The rest of the paper is organized as follows: Section 2 introduces the interpretive approach t o performance prediction. Section 3 then describes the HPF Fortran 90D performance prediction framework and presents numerical results to validate the accuracy and usability of the interpretive approach. Section 4 outlines the HPC software development process and highlights the signi cance of performance prediction tools. Section 5 presents experiments to illustrate the application of the framework to different stages of the HPC software development process. Section 6 presents some concluding remarks. 1 
Northeast Parallel Architectures Center 2 Interpretive P erformance Prediction
Interpretive performance prediction is an accurate and cost-e ective approach for compile-time estimation of application performance. The essence of the approach is the application of interpretation techniques to performance prediction through an appropriate characterization of the HPC system and the application. A system abstraction methodology is de ned to hierarchically abstract the HPC system into a set of well de ned parameters which represent its performance. A corresponding application abstraction methodology is de ned to abstract a high-level application description into a set of well de ned parameters which represent its behavior. Performance prediction is then achieved by i n terpreting the execution costs of the abstracted application in terms of the parameters exported by the abstracted system. The interpretive approach is illustrated in Figure 1 and is composed of the following four modules:
1. A system abstraction module that de nes a comprehensive system characterization methodology capable of hierarchically abstracting a high performance computing system into a set of well de ned parameters which represent its performance. 2. An application abstraction module that de nes a comprehensive application characterization methodology capable of abstracting a high-level application description source code into a set of well de ned parameters which represent its behavior. 3. An interpretation module that interprets performance of the abstracted application in terms of the parameters exported by the abstracted system. 4. An output module that communicates estimated performance metrics.
A k ey feature of this approach is that each module is independent with respect to the other modules. Further, independence between individual modules is maintained throughout the characterization process and at every level of the resulting abstractions. As a consequence, abstraction and parameter generation for each module, and for individual units within the characterization of the module, can be performed separately using techniques or models best suited to that particular module or unit. This independence not only reduces the complexity of individual characterization models allowing them to be more accurate and tractable, but also supports reusability and easy experimentation. For example, when characterizing a multiprocessor system, each processing node can be characterized independently. Further, the parameters generated for the processing node can be reused in the characterization any system that has the same type of processors. Finally, experimentation with another type of processing node will only require the particular module to be changed. The four modules are brie y described below. A more detailed discussion of the performance interpretation approach can be found in 1 . 
System Abstraction Module
Abstraction of a HPC system is performed by hierarchically decomposing the system to form a rooted tree structure called the System Abstraction Graph SAG. Each level of the SAG is composed of a set of System Abstraction Unit's SAU's. Each S A U abstracts a part of the entire system into a set of parameters representing its performance, and exports these parameters via a well de ned interface. The interface can be generated independent of the rest of the system using evaluation techniques best suited to the particular unit e.g. analytic, simulation, or speci cations. The interface of an SAU consists of 4 components: 1 Processing Component P, 2 Memory Component M, 3 Communication Synchronization Component C S, and 4 Input Output Component I O. Figure 2 illustrates the system abstraction process using the iPSC 860 system. At the highest level SAU-1, the entire iPSC 860 system is represented as a single compound processing component. SAU-1 is then decomposed into SAU-11, SAU-12, and SAU-13 corresponding to the i860 cube, the interconnect between the System Resource Manager SRM and the cube, and the SRM or host respectively. Each SAU is composed of P, M C S, and I O components, each of which can be simple, compound or void. Compound components can be further decomposed. A component a t a n y level is void if it is not applicable at that level for example, SAU-12 has void P, M, and I O components. System characterization thus proceeds recursively down the system hierarchy, generating SAU's of ner granularity at each level. The process terminates when the required granularity of parameterization is achieved. This choice is usually driven by a tradeo between accuracy and cost-e ectiveness.
Application Abstraction Module
Machine independent application abstraction is performed by recursively characterizing the application description into Application Abstraction Units AAU's. Each A A U represents a standard programming construct and parameterizes its behavior. An AAU can be either Compound or Simple depending on whether it can or cannot be further decomposed.
Various classes of simple and compound AAU's are listed in Table 1 . AAU's are combined to abstract the control structure of the application forming an Application Abstraction Graph AAG. The communication synchronization structure of the application is superimposed onto the AAG b y augmenting the graph with a set of edges corresponding to the communications or synchronization between AAU's. The structure generated after augmentation is called a Synchronized Application Abstraction Graph SAAG. The machine speci c lter then incorporates machine speci c information such as introduced compiler transformations optimizations which are speci c to the particular machine into the SAAG based on the mapping that is being evaluated. Figure 3 illustrates the application abstraction process using a sample application description. 
Interpretation Engine
The interpretation engine or interpretation module estimates performance by i n terpreting the abstracted application in terms of the performance parameters obtained via system abstraction. The interpretation module consists of two components: an interpretation function that interprets the performance of an individual AAU, and an interpretation algorithm that recursively applies the interpretation function to a SAAG to predict the performance of the corresponding application. The interpretation function de ned for each A A U class abstract its performance in terms of parameters exported by the SAU to which it is mapped. Functional interpretation techniques are used to resolve the values of variables that determine the ow of the application such as conditions and loop indices. Models and heuristics used to interpret communications synchronizations, iterative and conditional ow control structures, accesses to the memory hierarchy, and user experimentation are brie y described below. A more detailed discussion of these models and the complete set of interpretation functions can be found in 1 . Waiting Time: Waiting time models overheads due to synchronizations, unavailable communications links, or unavailable communication bu ers.
Modeling Communication
The contribution of each of the above components depends on the type of communication synchronization and may di er for the sender and receiver. For example, in case of an asynchronous communication the waiting time and transmission time components do not contribute to the execution time at the sender.
The waiting time component is determined using a global communication structure which maintains speci cations and status of each communication synchronization, and a global clock which i s maintained by the interpretation algorithm. The global clock is used to timestamp each communication synchronization call and message transmission, while the global communication structure stores information such as the time at which a particular message left the sender, or the current count a t a synchronization barrier. In the case of the IterSync AAU, although the number of iterations are known, the loop body contains one or more communication or synchronization calls. This AAU cannot be interpreted as described above as it is necessary to identify the calling time of each instance of the communication synchronization calls. In this case, the loop body is partitioned into blocks without communication synchronization and the communication synchronization calls themselves. The interpretation function for the entire AAU is then de ned recursively such that the execution time of the current iteration is a function of the execution time of the previous iteration. Similarly, the calling and execution times of the communication synchronization calls are also de ned recursively. The nal case is a non-deterministic iterative structure IterND where the number of iterations or the execution of the loop body are not known. For example the number of iterations may depend on the execution of the loop body as in the while loop, or the execution of the loop body varies from iteration to iteration. In this case performance is predicted by unrolling the iterations using functional interpretation and interpreting the performance of each iteration sequentially.
Modeling of
Modeling of Conditional Flow-Control Structures: The execution time for a conditional ow control structure is broken down into three components: 1 the overhead associated with each condition tested i.e. every if", elseif", etc., 2 an additional overhead for the branch associated with a true condition, and 3 the time required to execute the body associated with the true condition. The interpretation function for the conditional AAU i s a w eighted sum of the interpreted performances of each of its branches; the weights evaluate to 1 or 0 during interpretation depending on whether the branch is taken or not. Functional interpretation is used to resolve the execution ow. Modeling of CondtD and CondtSync AAU's is similar to the corresponding iterative A A U's described above.
Modeling Access to the Memory Hierarchy: Access to the memory hierarchy of a computing element is modeled using heuristics based on the access patterns in the application description and the physical structure of the hierarchy. In the current implementation, application access patterns are approximated during interpretation by maintaining an access count and a detected miss count at the program level and by associating with each program variable, a local access count, the last access o set in case of arrays, and values of both program level counters at the last access. A simple heuristic model uses these counts and the size of the cache block, its associativity and the replacement algorithm, to estimate cache misses for each A A U. This model is computationally e cient and provides the required accuracy as can be seen from the results that presented in Section 3. t AAU Comm = 1 , f overlap t comm The f overlap factor could be a typical or explicitly de ned value de ned for the system. Alternately the user can de ne this factor for the particular application or experiment with di erent v alues.
Supporting User Experimentation: The interpretation engine provides support for two t ypes of user experimentation:
Experimentation with run-time situations, e.g. computation and communications loads. Experimentation with system parameters, e.g. processing capability, memory size, communication channel bandwidth. The e ects of each experiment on application performance is modeled by abstracting its e ect on the parameters exported by the system and application modules and setting their values accordingly. Heuristics are used to perform this abstraction. For example, the e ect of increased network load on a particular communication channel is modeled by decreasing the e ective a vailable bandwidth on that channel. An appropriate scaling factor is then de ned which is used to scale the parameters exported by the C S component associated with the communication channel. Similarly, doubling the bandwidth e ectively decreases the transmission time over the channel; while increasing the cache size will re ect on the miss rate.
Output Module
The output module provides an interactive i n terface through which the user can access estimated performance statistics. The user has the option of selecting the type of information and the level at which the information is to be displayed. Available information includes cumulative execution times, the communication time computation time breakup, existing overheads and wait times. This information can be obtained for an individual AAU, cumulatively for a branch of the AAG i.e. sub-AAG, or for the entire AAG.
Related Research i n P erformance Prediction
Existing approaches and models for performance prediction on multicomputer systems can be broadly classi ed as analytic, simulation, monitoring or hybrid which make use of a combination of the above techniques along with possible heuristics and approximations.
A general approach for analytic performance prediction for shared memory systems has been proposed by Siewiorek et al. in 2 while probabilistic models for parallel programs based on queueing theory have been presented in 3 . An analytic performance prediction technique based on the approximation of parallel ow graphs by sequential ow graphs has been proposed by Qin et al. in 4 . The above approaches require users to explicitly model the application along with the entire system. A source based analytic performance prediction model for Dataparallel C has been developed by Clement et al. 5 . The approach uses a set of assumptions and speci c characteristics of the language to develop a speedup equation for applications in terms of system costs.
A simulation based approach is used in the SiGLe system Simulator at Global Level 6 which provides special description languages to describe the architecture, application and the mapping of the application onto the architecture.
An evaluation approach based on instrumentation, data collection and post-processing has been proposed by Darema et al. 7 . Balasundaram et al. 8 use training routines" to benchmark the performance of the architecture and then use this information to evaluate di erent data decompositions.
The PPPT system 9 uses monitoring techniques to pro le the execution of the application program on a single processor, and to derive sequential program parameters such as conditional branch probabilities, loop iteration counts, and frequency counts for each statement t ype. The user is required to provide a characteristic set of input data for this pro ling run. Obtained information is then used by the static parameter based performance prediction tool to estimate performance information for the parallelized SPMD application program on a distributed memory system.
A h ybrid approach is presented in 10 where the runtime of each n o d e o f a s t o c hastic graph representing the application is modeled as a random variable. The distributions of these random variables are then obtained using hardware monitoring.
The layered approach presented in 11 uses a methodology based on application and system characterization. The developer is required to characterize the application as an execution graph and de ne its resource requirements in this system. High Performance Fortran HPF 12 is based on the research language Fortran 90D developed jointly by Syracuse University and Rice University and has the overriding goal to produce a dialect of Fortran that can be used on a variety of parallel machines, providing portable, high-level expression to data parallel algorithms. The idea behind HPF and Fortran 90D is to develop a minimal set of extensions to Fortran 90 to support the data parallel programming model. The incorporated extensions provide a means for explicit expression of parallelism and data mapping. These extensions include compiler directives which are used to advise the compiler on how data objects should be assigned to processor memories, and new language features like the forall statement and construct.
HPF Fortran 90D adopts a two level mapping using the PROCESSORS, ALIGN, DISTRIBUTE, and TEMPLATE directives to map data objects to abstract processors. The data objects typically array elements are rst aligned with an abstract index space called a template. The template is then distributed onto a rectilinear arrangement of abstract processors. The mapping of abstract processors to physical processors is implementation dependent. Data objects not explicitly distributed are mapped according to an implementation dependent default distribution e.g. replication. Supported distributions types include BLOCK and CYCLIC. Use of the directives is shown in Figure 5 .
Our current implementation of the HPF Fortran 90D compiler and performance prediction framework supports a formally de ned subset of HPF. The term HPF Fortran 90D in the rest of this document refers to this subset.
ESP:
The HPF Fortran 90D Performance Prediction Framework ESP see Figure 6 is an interpretive framework for HPF Fortran 90D application performance prediction. It uses the interpretive approach outlined above to provide accurate and cost-e ective performance prediction of HPF Fortran 90D. ESP has been implemented as a part of the HPF Fortran 90D application development e n vironment 13 developed at the NPAC, Syracuse University.
The design of ESP is is based on the HPF source-to-source compiler technology 14 which translates HPF into loosely synchronous, SPMD single program, multiple data Fortran 77 + Message-Passing codes. It uses this technology in conjunction with the performance interpretation model to provide performance estimates for HPF Fortran 90D applications on a distributed memory MIMD multicomputer. HPF Fortran 90D performance prediction is performed in two phases: Phase 1 uses HPF compilation technology to produce a SPMD program structure consisting of Fortran 77 plus calls to run-time routines. Phase 2 then uses the interpretation approach to abstract and interpret the performance of the application. These two phases are described below: The compiler symbol table is extended in this parse by tagging all variables that are critical a critical variable being de ned as a variable whose value e ects the ow of execution, e.g. a loop limit. Critical variables are then resolved using functional interpretation by tracing their de nition paths. If this is not possible, or if they are external inputs, the user is prompted for their values. If a critical variable is de ned within an iterative structure, the user has the option of either explicitly de ning the value of that variable or instructing the system to unroll the loop so as to compute its value. Access information required to model accesses to the memory hierarchy is abstracted from the input program structure in this parse and stored in the extended symbol table.
The nal task of the abstraction parse is the clustering of consecutive Seq AAU's into a single AAU. The granularity of clustering can be speci ed by the user; the tradeo here being estimation time versus estimation accuracy. A t the nest level, each Seq AAU abstracts a single statement of the application description.
Interpretation Parse: The interpretation parse performs the actual performance interpretation using the interpretation model described above. For each A A U in the SAAG, the corresponding interpretation function is used to generate performance metrics associated with it. Metrics maintained at each A A U are its computation, communication and overheads times, and the value of the global clock. In addition, metrics speci c to each A A U t ype e.g. wait and transmission times for a Comm AAU are also maintained. Cumulative metrics are maintained for the entire SAAG, and for each compound AAU. The interpretation parse has provisions to take i n to consideration a set of system compiler optimizations for the generated Fortran 77 + Message Passing code such as loop re-ordering and inline expansion. These can be turned on or o by the user.
Output Parse The nal parse communicates estimated performance metrics to the user. The output interface provides three types of outputs. The rst is a generic performance pro le of the entire application broken up into its communication, computation and overhead components. Similar measures for each individual AAU and for sub-graphs of the AAG are also available. The second form of output allows the user to query the system for the metrics associated with a particular line or a set of lines of the application description. Finally, the system can generate an interpretation trace which can be used as input to a performance visualization package such a s P araGraph 2 . The user can then use the capabilities provided by the package to analyze the performance of the application.
Experimental Evaluation of ESP
The experimental evaluation presented in section has the following objectives:
1. To v alidate the accuracy of the performance prediction framework for applications executing on a high performance computing system. The goal is to show that the predicted metrics are accurate enough to provide realistic information about application performance and can be used as a basis for design tuning. 2. To demonstrate the usability ease of use of the performance interpretation framework and its cost-e ectiveness.
The high performance computing system used for the validation is an iPSC 860 hypercube connected to a 80386 based host processor. The particular con guration of the iPSC 860 consists of 8 i860 nodes. Each node has a 4 KByte instruction cache, 8 KByte data cache and 8 MBytes of main memory. The node operates at a clock speed of 40 MHz and has a theoretical peak performance of 80 MFlop s for single precision and 40 MFlop s for double precision. The validation application set was selected from the NPAC HPF Fortran 90D Benchmark Suite 15 . The suite consists of a set of benchmarking kernels and real-life" applications and is designed to evaluate the e ciency of the HPF Fortran 90D compiler and speci cally, automatic partitioning schemes. The selected application set includes kernels from standard benchmark sets like the Livermore Fortran Kernels and the Purdue Benchmark Set, as well as real computational problems. The applications are listed in Table 2 . Accuracy of the interpretive performance prediction framework is validated by comparing estimated execution times with actual measured times. For each application, the experiment consisted of varying the problem size and number of processing elements used. Measured timings represent a n a verage taken over multiple runs. The results obtained are summarized in Table 3 . Error values listed are percentages of the measured time and represent maximum minimum absolute errors over all problem sizes and system sizes. For example, the N-Body computation was performed for 16 to 4094 bodies on 1, 2, 4, and 8 nodes of the iPSC 860. The minimum absolute error between estimated and measured times was 0.09 of the measured time while the maximum absolute error was 5.9.
The obtained results show that in the worst case, the interpreted performance is within 20 of the measured value, the best case error being less than 0.001 The larger errors are produced by the benchmark kernels which h a ve been speci cally coded to task the compiler. The objectives of the predicted metrics is to serve either as the rst-cut performance estimate of an application or as a relative performance measure to be used as a basis for design tuning. In either case, the interpreted performance is accurate enough to provide the required information.
Validating Usability of the Interpretive F ramework
The interpreted performance estimates for the experiments described above w ere obtained using the interpretive framework running on a Sparcstation 1+. The framework provides a friendly menu-driven, graphical user interface to work with and requires no special hardware other than a conventional desktop workstation. Application characterization is performed automatically unlike most approaches while system abstraction is performed o -line and only once. Application parameters and directives were varied from within the interface itself. Typical experimentation on the iPSC 860 to obtained measured execution times consisted of editing code, compiling and linking using a cross compiler compiling on the front end is not allowed to reduce its load, transferring the executable to the iPSC 860 front end, loading it onto the i860 node and then nally running it. The process had to be repeated for each instance of each experiment. Relative experimentation times for di erent implementation of the Laplace Solver application for di erent problem decompositions using measurements and the performance interpreter are shown in Figure 7 . Experimentation using the interpretive approach required approximately 10 minutes for each of the three implementations. Experimentation using measurements however, took a minimum 27 minutes for the BLOCK,* decomposition and required almost 1 hour for the *,BLOCK case. Clearly, the measurements approach can be very tedious and time consuming, specially when a large number of options have t o b e e v aluated. Further, the iPSC 860, being an expensive resource, is shared by v arious development groups in the organization. Consequently, its usage can be restrictive and the required con guration may not be immediately available. The comparison above v alidates the convenience and cost-e ectiveness of the framework for experimentation during application development. 4 The HPC Application Software Development Process
In this section we outline the HPC application software development process as a set of stages see Figure 8 typically encountered by an application developer. The input to development process is the application speci cation generated either from the problem statement itself if it is a new problem or from existing code when porting of dusty decks. The nal output is a running application. Feedback loops are present at some stages for step-wise re nement and tuning. The stages are brie y listed below. A detailed description of each stage as well as the nature and requirements of support tools that can assist the developer can be found in 16 .
Inputs
The input to the software development process is the application speci cation in the form of a functional ow description of the application and its requirements. The application speci cation corresponds to the user requirement document" in a traditional life-cycle models. Supporting tools at this stage include expert system based tools and intelligent editors, both equipped with a knowledge base to assist in analyzing the application. In Figure 8 these tools are included in the Application Speci cation Filter" module.
Application Analysis Stage
The function of the application analysis stage is to thoroughly analyze the input application speci cation with the objective o f a c hieving the most e cient implementation. The output of this stage is a detailed process ow graph the Parallelization Speci cation" where the nodes of the graph represent functional modules and the edges represent i n terdependencies. The key functions performed by this include: 1 functional module creation, i.e. identi cation of functions that can be executed in parallel; 2 functional module classi cation, i.e. identi cation of standard functions; and 3 module synchronization, i.e. analysis of mutual interdependencies. This stage corresponds to the design phase" in standard software life-cycle models and its output corresponds to the design document".
Application Development Stage
The application development stage receives a process ow graph as input and generates an implementation which can then be compiled and executed. The key functions performed by this stage include: 1 algorithm development, i.e. assist the developer in identifying functional components in the input ow graph and selecting appropriate algorithmic implementations; 2 system level mapping, i.e. help the developer in selecting the appropriate HPC system and system con guration for the application; 3 machine level mapping, i.e. help the developer appropriately mapping functional components onto processors in the selected HPC con guration; and 4 implementation & coding, i.e. handle code generation and code lling of selected templates so as to produce a parallel program which can then be compiled and executed on the target system.
A k ey component of this stage is the design evaluator that assists the developer in evaluating di erent options available and identifying the option that provides the best performance. The design evaluator estimates the performance of the current design on the target system and provides insight i n to computation and communication costs, existing idle times and overheads. The estimated performance can then be used to identify regions where further re nement or tuning is required. The key features of the design evaluator are: 1 the ability to provide evaluations with desired accuracy, with minimum resource requirements and within a reasonable amount of time; 2 the ability to automate the evaluation process; and 3 the ability to perform the evaluation without having to run the application on the target systems.
Compile-Time & Run-Time Stage
The compile-time run-time stage handles the task of executing the parallelized application generated by the development stage to produce the required output. The compile-time portion of this stage consists of optimizing compilers and tools for resource allocation and initial scheduling. The responsibility o f the run-time portion include handling dynamic scheduling, dynamic load balancing, migrations, and irregular communications.
Evaluation Stage
In the evaluation stage, the developer retrospectively evaluates the design choices made during the development stage and looks for ways to improve the design. This stage performs a thorough evaluation of the execution of the entire application, detailing communication and computation times, communication and synchronization overheads and existing idle times. That is, it uses application performance debugging to identify regions in the implementation where performance improvement is possible. The evaluation methodology enables the developer to investigate the e ect of various run-time parameters like system load and network contention on performance, as well as the scalability of the application with machine and problem size. The key feature of this stage is the ability to perform evaluation with the desired accuracy and granularity, while maintaining tractability and non-intrusiveness.
Maintenance Evolution Stage
In addition to the above described stages encountered during the development and execution of HPC applications, there is an additional stage in the life-cycle of this software which i n volves its maintenance and evolution. The functions of this stage include monitoring the operation of the software and ensuring that it continues to meet its speci cations with changes in system con guration.
Application of the Interpretive F ramework to HPC Software Development
The interpretive performance prediction framework can be e ectively used at di erent stages of the software development process outlined in Section 4. In this section we present experiments performed using the current implementation of the ESP HPF Fortran 90D performance prediction framework to illustrate its application to HPC software development.
Application Development Stage
The Design Evaluator module of the Application Development Stage is responsible for evaluating the di erent implementation and mapping alternatives available to the other modules of this stage.
To illustrate the application of the interpretive framework to this stage, we demonstrate how the framework can be used to select an appropriate problem decomposition and mappings for a given system con guration. This is achieved by comparing the performance of the Laplace solver application for 3 di erent distributions HPF DISTRIBUTE directive of the template, namely BLOCK,BLOCK, BLOCK,X and X,BLOCK, and corresponding alignments HPF ALIGN directive of the data elements to the template. These three distributions on 4 processors are shown in Figure 9 and the corresponding HPF Fortran 90D descriptions are listed in Table 4 . Figures 10-13 compare the performance of each of the three cases for di erent system sizes using both, measured times and estimated times. These graphs can be used to select the best directives for a particular problem size and system con guration. For the Laplace solver, the Block,X distribution is the appropriate choice. Further, since the maximum absolute error between the estimated and measured times is less than 1, the directive selection can be accurately made using the interpretive framework.
The key requirement of the design evaluator module is that it provides the ability to obtain evaluations with the desired accuracy, with minimum resource requirements and within a reasonable amount of time; the ability to automate the evaluation process; and the ability to perform the evaluation within an integrated workstation environment without running the application on the target computers. In the above experiment, performance interpretation was source driven and can be automated into an intelligent capable of selecting appropriate decompositions and mappings. Further, as demonstrated in Section 3.5.2, performance interpretation is performed on a workstations and requires a fraction of the experimentation time. The interpretive framework thus can be e ectively used to provide the functionality of the Design Evaluator Module in the Design Evaluation stage of the HPC software development process. 
Evaluation Stage
The Evaluation stage of the HPC software development process is responsible for performing a thorough evaluation of the implementation with two k ey objectives:
Identify regions of the implementation where performance improvement is possible by performance debugging the implementation and analyzing the contribution of di erent parts of the application description and view their computation time communication time breakup. Investigate the scalability of the application with machine and problem size as well as the e ect of system and run-time parameters on its performance. This enables the developer to test the robustness of the design and to modify it to account for di erent run-time scenarios.
The key requirement of this stage is the ability to perform the above e v aluations with the desired accuracy and granularity, while maintaining tractability, non-intrusiveness, and cost-e ectiveness. The use of the interpretive framework to the Evaluation stage of the HPC software development process is illustrated by the following experiments:
1. Application performance debugging. 2. Evaluation of application scalability. 3. Experimentation with system and run-time parameters.
Application Performance Debugging
The metrics generated by the interpretive framework can be used to analyze the performance contribution of di erent parts of the application description and to view their computation time communication time breakup. This is illustrated below using two applications. N-Body Computations: Figure 15 shows the performance pro le for two phases of the n-body application. Phase 1 see Figure 14 represents the forward movement of data around the virtual processor ring while Phase 2 represents accumulation of force data at the original processors. For n processors, each phase requires n 2 circular shifts of the data; consequently their communication pro les are similar. However, Phase 1 performs more computation as it computes the forces interactions. Overhead time represents parallelization overheads. Similar pro les can be obtained at smaller granularities upto a single line of code. Application performance debugging using conventional means involves instrumentation, execution and data collection, and post-processing this data. Further, this process requires a running application and has to be repeated to evaluate each design modi cation. Using the interpretive framework, this information is available, at all levels required, during application development.
Application Scalability E v aluation
Figures 18, 19, & 20 plot the scalability of three applications PI, NBody and Financial with problem and well as system sizes. Both, measured and estimated times are plotted to show that estimated times provide su ciently accurate scalability information.
Experimentation with System Run-Time Parameters
The results presented in this section demonstrate the use of the interpretive framework for evaluating the e ects of di erent system and run-time parameters on the application performance. The following experiments were conducted: E ect of Varying Processor Speed: In this experiment w e e v aluate the e ect of increasing decreasing the speed of the each processor in the iPSC 860 system on application performance. The results are shown in Figure 21 . Such a n e v aluation enables the developer to visualize how the application will perform on a faster prospective machine or alternately if it has be run on a slower processor. It can also be used to evaluate the bene ts of upgrading to a faster processor system. E ect of Varying Network Load: Figure 22 shows the interpreted e ects of network load on application performance. It can be seen that the performance deteriorates rapidly as the network gets saturated. Further, the e ect of network load is more pronounced for larger system con gurations as illustrated in Figure 23. E ect of Varying Interconnection Bandwidth: The e ect of varying the interconnect bandwidth on the application performance is shown in Figure 24 . The increase decrease in application execution times is greater for larger processor con gurations as illustrated in Figure 25 . The ability to experiment with di erent system parameters not only allows the user to evaluate the application during the Evaluation stage, but can also be used during the Maintenance Evolution stage to check whether the application meets its speci cation with changes in the system con guration.
Approximation of PI

Conclusions
Software development i n a n y high-performance parallel computing environment is non-trivial and the development of e cient application software capable of exploiting available computing potentials depends to a large extent on the availability of suitable tools and application development e n vironments. Evaluation tools enable a developer to visualize the e ects of the various design alternatives and make appropriate design decisions, and thus form a critical component of such a development e n vironment.
In this paper we rst present e d a n o vel interpretive approach for accurate and cost-e ective performance prediction that can be e ectively used during HPC application software development. A source-driven HPF Fortran 90D performance prediction framework based on the interpretive approach has been implemented as part of the NPAC HPF Fortran 90D integrated application development environment. The accuracy and usability of the interpretive performance prediction framework were experimentally validated.
We then outlined the stages typically encountered during application software development in a HPC environment and highlighted the signi cance and requirements of a performance prediction tool at the relevant stages. Numerical results using benchmarking kernels and application codes were presented to demonstrate the application of the performance prediction framework to di erent stages of the application software development process.
We are currently working on developing an intelligent HPF Fortran 90D compiler based on the source based interpretation model. This tool will enable the compiler to automatically evaluate directives and transformation choices and optimize the application at compile time. We are also working on expanding to the HPF Fortran 90D application development e n vironment to incorporate a wider set of tools so as to span the stages of the HPC application software development process. 
