Compile-Time Performance Prediction of HPF/Fortran 90D by Parashar, Manish & Hariri, Salim
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science College of Engineering and Computer Science 
1996 
Compile-Time Performance Prediction of HPF/Fortran 90D 
Manish Parashar 
University of Texas at Austin 
Salim Hariri 
Syracuse University 
Follow this and additional works at: https://surface.syr.edu/eecs 
 Part of the Computer Engineering Commons, and the Programming Languages and Compilers 
Commons 
Recommended Citation 
Parashar, Manish and Hariri, Salim, "Compile-Time Performance Prediction of HPF/Fortran 90D" (1996). 
Electrical Engineering and Computer Science. 75. 
https://surface.syr.edu/eecs/75 
This Article is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed TechnologyCompile-Time Performance Prediction of HPF/Fortran 90DManish Parashar Salim HaririDepartment of Computer Sciences Department of Computer EngineeringUniversity of Texas at Austin Syracuse UniversityAustin, TX 78712-1081 Syracuse, NY 13244-4100parashar@cs.utexas.edu hariri@fruit.ece.syr.eduContents1 Introduction 12 Interpretive Performance Prediction 32.1 System Abstraction Module : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42.2 Application Abstraction Module : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52.3 Interpretation Engine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62.4 Output Module : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 An Overview of HPF/Fortran 90D 104 Design of the HPF/Fortran 90D Performance Prediction Framework 114.1 Phase 1 - Compilation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124.2 Phase 2 - Interpretation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124.3 Abstraction & Interpretation HPF/Fortran 90D Parallel Constructs : : : : : : : : : : : : : : : : : : : 134.4 Abstraction of the iPSC/860 System : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 155 Validation/Evaluation of the Interpretation Framework 165.1 Validating Accuracy of the Framework : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 165.2 Application of the Interpretive Framework to HPC Application Development : : : : : : : : : : : : : : 175.2.1 Appropriate Directive Selection : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 185.2.2 Experimentation with System/Run-Time Parameters : : : : : : : : : : : : : : : : : : : : : : : : 195.2.3 Application Performance Debugging : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 225.3 Validating Usability of the Interpretive Framework : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 236 Related Work 257 Conclusions and Future Work 26A Accuracy of the Interpretation Framework 28
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 1Compile-Time Performance Prediction of HPF/Fortran 90DManish Parashar Salim HaririDepartment of Computer Sciences Department of Computer EngineeringUniversity of Texas at Austin Syracuse UniversityAustin, TX 78712-1081 Syracuse, NY 13244-4100parashar@cs.utexas.edu hariri@fruit.ece.syr.eduAbstractIn this paper we present an interpretive approach for accurate and cost-eective performance prediction ina high performance computing environment, and describe the design of a compile-time HPF/Fortran 90Dperformance prediction framework based on this approach. The performance prediction framework has beenimplemented as a part of the HPF/Fortran 90D application development environment that integrates it witha HPF/Fortran 90D compiler and a functional interpreter. The current implementation of the environmentframework is targeted to the iPSC/860 hypercube multicomputer system. A set of benchmarking kernels andapplication codes have been used to validate the accuracy, utility, and usability of the performance predictionframework. The use of the framework for selecting appropriate HPF/Fortran 90D compiler directives,for application performance debugging and for experimentation with run-time and system parameters isdemonstrated.Keywords: Performance experimentation & prediction, HPF/Fortran 90D application development, Sys-tem & Application characterization.1 IntroductionAlthough currently available High Performance Computing (HPC) systems possess large computing capa-bilities, few existing applications are able to fully exploit this potential. The fact remains that developmentof ecient application software capable of exploiting available computing potential is non-trivial and islargely governed by the availability of suciently high-level languages, tools, and application developmentenvironments.A key factor contributing to the complexity of parallel/distributed software development is the increaseddegrees of freedom that have to be resolved and tuned in such an environment. Typically, during thecourse of parallel/distributed software development, the developer is required to select between availablealgorithms for the particular application; between possible hardware congurations and amongst possibledecompositions of the problem onto the selected hardware conguration; and between dierent commu-nication and synchronization strategies. The set of reasonable alternatives that have to be evaluated isvery large and selecting the best alternative among these is a formidable task. Consequently, evaluation
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 2
Figure 1: The HPF/Fortran 90D Application Development Environmenttools form a critical part of any software development environment. These tools, in symbiosis with otherdevelopment tools, complete the feedback loop of the \develop-evaluate-tune" cycle.In this paper we present a novel interpretive approach for accurate and cost-eective performance predic-tion in a high performance computing environment and describe the design of a source-driven HPF1/Fortran90D performance prediction framework based on this approach. The interpretive approach denes a com-prehensive characterization methodology which abstracts system and application components of the HPCenvironment. Interpretation techniques are then used to interpret performance of the abstracted applica-tion in terms of parameters exported by the abstracted system. System abstraction is performed o-linethrough a hierarchical decomposition of the computing system. Parameters required to abstract each com-ponent of this hierarchy can be generated independently, using existing techniques or system specications.Application abstraction is achieved automatically at compile time.The performance prediction framework has been implemented as a part of the HPF/Fortran 90D ap-plication development environment developed at the Northeast Parallel Architectures Center, SyracuseUniversity (see Figure 1). The environment integrates a HPF/Fortran 90D compiler, a functional inter-preter and the source based performance prediction tool and is supported by a graphical user interface.The current implementation of the environment is targeted to the iPSC/860 hypercube multicomputersystem.1High Performance Fortran
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 3
Figure 2: Interpretive Performance PredictionA set of benchmarking kernels and application codes have been used to validate the accuracy, utility,and usability of the performance prediction framework. The use of the framework for selecting appropriateHPF/Fortran 90D compiler directives, for application performance debugging and for experimentation withrun-time and system parameters is demonstrated.The rest of the paper is organized as follows: Section 2 introduces the interpretive performance prediction
approach. Section 3 provides an overview of HPF/Fortran 90D. Section 4 describes the design of theHPF/Fortran 90D performance prediction framework. Section 5 presents numerical results to validatethe interpretive performance prediction framework. Section 6 discusses some related research. Section 7presents some concluding remarks.2 Interpretive Performance Prediction
Interpretive performance prediction is an accurate and cost-eective approach for compile-time estimationof application performance. The essence of the approach is the application of interpretation techniques toperformance prediction through an appropriate characterization of the HPC system and the application.A system abstraction methodology is dened to hierarchically abstract the HPC system into a set of welldened parameters which represent its performance A corresponding application abstractionmethodology isdened to abstract a high-level application description into a set of well dened parameters which represent
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 4its behavior. Performance prediction is then achieved by interpreting the execution costs of the abstractedapplication in terms of the parameters exported by the abstracted system. The interpretive approach isillustrated in Figure 2 and is composed of the following four modules:1. A system abstraction module that denes a comprehensive system characterization methodologycapable of hierarchically abstracting a high performance computing system into a set of well denedparameters which represent its performance.2. An application abstraction module that denes a comprehensive application characterization method-ology capable of abstracting a high-level application description (source code) into a set of well denedparameters which represent its behavior.3. An interpretation module that interprets performance of the abstracted application in terms of theparameters exported by the abstracted system.4. An output module that communicates the estimated performance metrics.A key feature of this approach is that each module is independent with respect to the other modules.Further, independence between individual modules is maintained throughout the characterization processand at every level of the resulting abstractions. As a consequence, abstraction and parameter generationfor each module, and for individual units within the characterization of the module, can be performedseparately using techniques or models best suited to that particular module or unit. This independencenot only reduces the complexity of individual characterization models allowing them to be more accurateand tractable, but also supports reusability and easy experimentation. For example, when characterizing amultiprocessor system, each processing node can be characterized independently. Further, the parametersgenerated for the processing node can be reused in the characterization any system that has the sametype of processors. Finally, experimentation with another type of processing node will only require theparticular module to be changed. The four modules are briey described below. A detailed discussion ofthe performance interpretation approach can be found in [1].2.1 System Abstraction ModuleAbstraction of a HPC system is performed by hierarchically decomposing the system to form a rootedtree structure called the System Abstraction Graph (SAG). Each level of the SAG is composed of a set ofSystem Abstraction Unit's (SAU's). Each SAU abstracts a part of the entire system into a set of parametersrepresenting its performance, and exports these parameters via a well dened interface. The interface canbe generated independent of the rest of the system using evaluation techniques best suited to the particularunit (e.g. analytic, simulation, or specications). The interface of an SAU consists of 4 components: (1)Processing Component (P), (2)Memory Component (M), (3) Communication/Synchronization Component(C/S), and (4) Input/Output Component (I/O). Figure 3 illustrates the system abstraction process using
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 5
 iPSC/860
SAU-1
Cube 
    (SAU-11)
SRM (SAU-13)
SRM-Cube Link 
(SAU-12)
Compound
Simple
Void
 ProcessingP
 MemoryM
 Comm/SyncC/S
 Input/OutputI/OSAU-1
SAU-11 SAU-12 SAU-13
P P
P
I/O I/OMM C/S C/SFigure 3: System Abstraction Processthe iPSC/860 system. At the highest level (SAU-1), the entire iPSC/860 system is represented as asingle compound processing component. SAU-1 is then decomposed into SAU-11, SAU-12, and SAU-13corresponding to the i860 cube, the interconnect between the System Resource Manager (SRM) and thecube, and the SRM or host respectively. Each SAU is composed of P, M C/S, and I/O components,each of which can be simple, compound or void. Compound components can be further decomposed. Acomponent at any level is void if it is not applicable at that level (for example, SAU-12 has void P, M, andI/O components). Parameters exported by the i860 cube (SAU-11) are presented in Section 4.4. Systemcharacterization thus proceeds recursively down the system hierarchy, generating SAU's of ner granularityat each level.2.2 Application Abstraction ModuleMachine independent application abstraction is performed by recursively characterizing the applicationdescription into Application Abstraction Units (AAU's). Each AAU represents a standard programmingconstruct and parameterizes its behavior. An AAU can be either Compound or Simple depending onwhether it can or cannot be further decomposed. Various classes of simple and compound AAU's arelisted in Table 1. AAU's are combined so as to abstract the control structure of the application formingthe Application Abstraction Graph (AAG). The communication/synchronization structure of the applica-tion is superimposed onto the AAG by augmenting the graph with a set of edges corresponding to thecommunications or synchronization between AAU's. The structure generated after augmentation is calledthe Synchronized Application Abstraction Graph (SAAG) and is an abstracted application task graph.The machine specic lter then incorporates machine specic information (such as introduced compiler
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 6
Synchronized AAG (SAAG)
Start
Seq
IterSync
Comm
End
Comm
Spawn
Seq
Comm
Comm
CondtSync
End
Seq
Comm
Comm
CondtSync
End
Application Abstraction Graph (AAG)
Start
Seq
IterSync
Comm
End
Comm
Spawn
Seq
Comm
Comm
CondtSync
End
Seq
Comm
Comm
CondtSync
End
Host Program 
  N = 2
  DO I = 0,N-1
    Spawn Node I
  ENDDO
  Recv RESULTS
END
Node Program 
  ME = MYNODE()
  CALC.....
  SyncSend (ME+1) MOD 2
  SyncRecv (ME-1+2) MOD 2
  IF ME EQ 0
    Send RESULTS
  ENDIF
END
Application Description Figure 4: Application Abstraction Processtransformations/optimizations which are specic to the particular machine) into the SAAG based on themapping that is being evaluated. Figure 4 illustrates the application abstraction process using a sampleapplication description.2.3 Interpretation EngineThe interpretation engine (or interpretation module) estimates performance by interpreting the abstractedapplication in terms of the performance parameters obtained via system abstraction. The interpretationmodule consists of two components; an interpretation function that interprets the performance of anindividual AAU, and an interpretation algorithm that recursively applies the interpretation function tothe SAAG to predict the performance of the corresponding application. Interpretation functions denedfor each AAU class abstract its performance in terms of parameters exported by the SAU to which it ismapped. Functional interpretation techniques are used to resolve the values of variables that determinethe ow of the application such as conditions and loop indices. Models and heuristics used to interpretcommunications/synchronizations, iterative and conditional ow control structures, accesses to the memoryhierarchy, and user experimentation are described below. This description omits a lot of details for brevity;a more detailed discussion of these models and the complete set of interpretation functions can be foundin [1].
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 7AAU Class AAU Type DescriptionStart AAU (Start) Simple marks the beginning of the applicationEnd AAU (END) Simple represents the termination of an independent ow of controlSequential AAU (Seq) Simple abstracts a set of contiguous statements containing onlylibrary functions, system routines, assignments and/orarithmetic/logical operationsSpawn AAU (Spawn) Compound abstracts a \fork" type statement generating independent owsof controlIterative-Deterministic AAU (IterD) Compound abstracts an iterative ow control structure with deterministicexecution characteristics and no comm/sync in its bodyIterative-Synchronized AAU (IterSync) Compound abstracts an iterative ow control structure with deterministicexecution characteristics and at least one comm/sync in its bodyIterative-NonDeterministic (IterND) Compound abstracts a non-deterministic iterative ow control structuree.g. number of iterations depends on loop executionConditional-Deterministic (CondtD) Compound abstracts a conditional ow control structure with deterministicexecution characteristics and no comm/sync in any of its bodiesConditional-Synchronized (CondtSync) Compound abstracts a conditional ow control structure which contains acommunication/synchronization in at least one of its bodiesCommunication AAU (Comm) Simple abstracts statements involving explicit communicationSynchronization AAU (Sync) Simple abstracts statements involving explicit synchronizationSynchronized Sequential AAU (SyncSeq) Simple abstracts any Seq AAU which requires synchronization or communicatione.g. a global reduction operationCall AAU (Call) Compound abstracts invocations of user-dened functions or subroutinesTable 1: Application Characterization
Call Ovhd 
Wait Time 
XMission Time XMission Time 
Call Ovhd 
Global Time Proc 1 Proc 2
Proc 1 Calls Send
Proc 2 Calls Recv
Proc 1 Returns
Proc 2 ReturnsFigure 5: Interpretation Model for Communication/Synchronization AAU'sModeling Communication/Synchronization: Communication or synchronization operations in theapplication are decomposed during interpretation into three components (as shown in Figure 5): Call Overhead: This represents xed overheads associated with the operation.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 8 Transmission Time: This is the time required to actually transmit the message from the sourceto the destination. Waiting Time: Waiting time models overheads due to synchronizations, unavailable communica-tions links, or unavailable communication buers.The contribution of each of the above components depends on the type of communication/synchronizationand may dier for the sender and receiver. For example, in case of an asynchronous communication, thewaiting time and transmission time components do not contribute to the execution time at the sender.The waiting time component is determined using a global communication structure which maintainsspecications and status of each communication/synchronization, and a global clock which is maintained bythe interpretation algorithm. The global clock is used to timestamp each communication/synchronizationcall and message transmission, while the global communication structure stores information such as thetime at which a particular message left the sender, or the current count at a synchronization barrier.Modeling of Iterative Flow-Control Structures: The interpretation of an iterative ow controlstructure depends on its type. Typically, its execution time comprises three components: (1) loop setupoverhead, (2) per iteration overhead, and (3) execution cost of the loop body.In case of deterministic loops (IterD AAU) where the number of iterations is known and there are nocommunications or synchronizations in the loop body, the execution time is dened asTExecIterD = TOvhdSetup +NumIters [TOvhdPerIter + TExecBody]where TExec and TOvhd are estimated execution time and overhead time respectively.In the case of the IterSync AAU, although the number of iterations are known, the loop body containsone or more communication or synchronization calls. This AAU cannot be interpreted as described abovesince it is necessary to identify the calling time of each instance of the communication/synchronizationcalls. In this case, the loop body is partitioned into blocks without communication/synchronization andthe communication/synchronization calls themselves. The interpretation function for the entire AAU isthen dened as a recursive equation such that the execution time of the current iteration is a function ofthe execution time of the previous iteration. Similarly, the calling and execution times of the communi-cation/synchronization calls are also dened recursively. For example, consider a loop body that containstwo communication calls calls (Comm1 & Comm2). Let Blk1 represent the block before Comm1 and Blk2represent the block between Comm1 and Comm2. If the loop starts execution at time T, the calling times(TCall) for the rst iteration are:TCallIterSync(1) = TTCallComm1 (1) = TCallIterSync(1) + TOvhdIterSync + TExecBlk1TCallComm2 (1) = TCallIterSync(1) + TOvhdIterSync + TExecBlk1 + TExecComm1 (1) + TExecBlk2
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 9And for the ith iterationTCallIterSync(i) = TCallIterSync(i  1) + TOvhdIterSync + TExecBlk1 + TExecComm1 (i  1) + TExecBlk2+TExecComm2 (i  1)TCallComm1 (i) = TCallIterSync(i) + TOvhdIterSync + TExecBlk1TCallComm2 (i) = TCallIterSync(i) + TOvhdIterSync + TExecBlk1 + TExecComm1 (i) + TExecBlk2The nal case is a non-deterministic iterative structure (IterND) where the number of iterations or theexecution of the loop body are not known. For example the number of iterations may depend on theexecution of the loop body as in the while loop, or the execution of the loop body varies from iteration toiteration. In this case performance is predicted by unrolling the iterations using functional interpretationand interpreting the performance of each iteration sequentially.Modeling of Conditional Flow-Control Structures: The execution time for a conditional ow con-trol structure is broken down into three components: (1) the overhead associated with each conditiontested (i.e. every \if", \elseif", etc.), (2) an additional overhead for the branch associated with a truecondition, and (3) the time required to execute the body associated with the true condition. The interpre-tation function for the conditional AAU is a weighted sum of the interpreted performances of each of itsbranches; the weights evaluate to 1 or 0 during interpretation depending on whether the branch is taken ornot. Functional interpretation is used to resolve the execution ow. Modeling of CondtD and CondtSyncAAU's is similar to the corresponding iterative AAU's described above.Modeling Access to the Memory Hierarchy: Access to the memory hierarchy of a computing el-ement is modeled using heuristics based on the access patterns in the application description and thephysical structure of the hierarchy. In the current implementation, application access patterns are approx-imated during interpretation by maintaining an access count and a detected miss count at the programlevel and by associating with each program variable, a local access count, the last access oset (in case ofarrays), and values of both program level counters at the last access. A simple heuristic model uses thesecounts and the size of the cache block, its associativity and the replacement algorithm, to estimate cachemisses for each AAU. This model is computationally ecient and provides the required accuracy as canbe seen from the results that presented in Section 5.Modeling Communication-Computation Overlaps: Overlap between communication and computa-tion is accounted for during interpretation, as a fraction of the communication cost; i.e. if a communicationtakes time tcomm and foverlap is the fraction of this time overlapped with computation, then the executiontime of the Comm AAU is weighted by the factor (1  foverlap); i.e.tAAUComm = (1  foverlap) tcomm
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 10The foverlap factor could be a typical (or explicitly dened) value dened for the system. Alternately theuser can dene this factor for the particular application or could experiment with dierent values.Supporting User Experimentation: The interpretation engine provides support for two types of userexperimentation: Experimentation with run-time situations, e.g. computation and communications loads. Experimentation with system parameters, e.g. processing capability, memory size, communicationchannel bandwidth.The eects of each experiment on application performance is modeled by abstracting its eect on theparameters exported by the system and application modules and setting their values accordingly. Heuristicsare used to perform this abstraction. For example, the eect of increased network load on a particularcommunication channel is modeled by decreasing the eective available bandwidth on that channel. Anappropriate scaling factor is then dened which is used to scale the parameters exported by the C/Scomponent associated with the communication channel. Similarly, doubling the bandwidth eectivelydecreases the transmission time over the channel; while increasing the cache size will reect on the missrate.2.4 Output ModuleThe output module provides an interactive interface through which the user can access estimated perfor-mance statistics. The user has the option of selecting the type of information and the level at which theinformation is to be displayed. Available information includes cumulative execution times, the communica-tion time/computation time breakup, existing overheads and wait times. This information can be obtainedfor an individual AAU, cumulatively for a branch of the AAG (i.e. sub-AAG), or for the entire AAG.3 An Overview of HPF/Fortran 90DHigh Performance Fortran (HPF) [2] is based on the research language Fortran 90D developed jointly bySyracuse University and Rice University and has the overriding goal to produce a dialect of Fortran thatcan be used on a variety of parallel machines, providing portable, high-level expression to data parallelalgorithms. The idea behind HPF (and Fortran 90D) is to develop a minimal set of extensions to Fortran90 to support the data parallel programming model. The incorporated extensions provide a means forexplicit expression of parallelism and data mapping. These extensions include compiler directives whichare used to advise the compiler on how data objects should be assigned to processor memories, and newlanguage features like the forall statement and construct.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 11
CHPF$ DISTRIBUTE TMPL(*,BLOCK)
TMPL
CHPF$ TEMPLATE TMPL(8,8)
8
CHPF$ ALIGN A(I,J) WITH TMPL(I,J)
CHPF$ ALIGN B(I,J) WITH TMPL(I+3,J+2)
CHPF$ PROCESSORS PROC(4)
REAL, ARRAY(5,6) :: B
REAL, ARRAY(5,4) :: A
7
6
5
4
3
2
1
PROC 2 PROC 3 PROC 4PROC 1
1 2 3 4 5 6 7 8
A
BFigure 6: HPF/Fortran 90D DirectivesHPF/Fortran 90D adopts a two level mapping using the PROCESSORS, ALIGN, DISTRIBUTE, andTEMPLATE directives to map data objects to abstract processors. The data objects (typically arrayelements) are rst aligned with an abstract index space called a template. The template is then distributedonto a rectilinear arrangement of abstract processors. The mapping of abstract processors to physicalprocessors is implementation dependent. Data objects not explicitly distributed are mapped according toan implementation dependent default distribution (e.g. replication). Supported distributions types includeBLOCK and CYCLIC. Use of the directives is shown in Figure 6.Our current implementation of the HPF/Fortran 90D compiler and performance prediction frameworksupports a formally dened subset of HPF. The term HPF/Fortran 90D in the rest of this document refersto this subset.4 Design of the HPF/Fortran 90D Performance Prediction Frame-workThe HPF/Fortran 90D performance prediction framework is based on the HPF source-to-source compilertechnology [3] which translates HPF into loosely synchronous, SPMD (single program, multiple data)Fortran 77 + Message-Passing codes. It uses this technology in conjunction with the performance in-terpretation model to provide performance estimates for HPF/Fortran 90D applications on a distributedmemory MIMD multicomputer. HPF/Fortran 90D performance prediction is performed in two phases:Phase 1 uses HPF compilation technology to produce a SPMD program structure consisting of Fortran 77plus calls to run-time routines. Phase 2 then uses the interpretation approach to abstract and interpretthe performance of the application. These two phases are described below:
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 124.1 Phase 1 - CompilationThe compilation phase uses the same front-end as the HPF/Fortran 90D compiler. Given a syntacticallycorrect HPF/Fortran 90D program, phase 1 parses the program to generate a parse tree and transformsarray assignment and where statements to equivalent forall statements. Compiler directives are used topartition the data and computation among the processors and parallel constructs in the program are con-verted into loops or nested loops. Required communication are identied and appropriate communicationcalls are inserted. The output of this phase is a loosely synchronous SPMD program structure consistingof alternating phases of local computation and global communication.4.2 Phase 2 - InterpretationPhase 2 is implemented as a sequence of parses: (1) The abstraction parse generates the applicationabstraction graph (AAG) and synchronized application abstraction graph (SAAG); (2) The interpretationparse performs the actual interpretation using the interpretation algorithm; and (3) The output parsegenerates the required performance metrics.Abstraction Parse: The abstraction parse intercepts the SPMD program structure produced in phase 1and abstracts its execution and communication structures to generate the corresponding AAG and SAAG(as dened in Section 2). A communication table (global communication structure) is generated to storethe specications and status of each communication/synchronization.The compiler symbol table is extended in this parse by tagging all variables that are critical (a criticalvariable being dened as a variable whose value eects the ow of execution, e.g. a loop limit). Criticalvariables are then resolved using functional interpretation by tracing their denition paths. If this is notpossible, or if they are external inputs, the user is prompted for their values. If a critical variable is denedwithin an iterative structure, the user has the option of either explicitly dening the value of that variableor instructing the system to unroll the loop so as to compute its value. Access information required tomodel accesses to the memory hierarchy is abstracted from the input program structure in this parse andstored in the extended symbol table.The nal task of the abstraction parse is the clustering of consecutive Seq AAU's into a single AAU.The granularity of clustering can be specied by the user; the tradeo here being estimation time versusestimation accuracy. At the nest level, each Seq AAU abstracts a single statement of the applicationdescription.Interpretation Parse: The interpretation parse performs the actual performance interpretation usingthe interpretation model described above. For each AAU in the SAAG, the corresponding interpretationfunction is used to generate performance measures associated with it. Metrics maintained at each AAUare its computation, communication and overheads times, and the value of the global clock. In addition,
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 13metrics specic to each AAU type (e.g. wait and transmission times for a Comm AAU) are also maintained.Cumulative metrics are maintained for the entire SAAG, and for each compound AAU. The interpretationparse has provisions to take into consideration a set of system compiler optimizations (for the generatedFortran 77 + Message Passing code) such as loop re-ordering and inline expansion. These can be turnedon or o by the user.Output Parse The nal parse communicates estimated performance metrics to the user. The outputinterface provides three types of outputs. The rst type is a generic performance prole of the entireapplication broken up into its communication, computation and overhead components. Similar measuresfor each individual AAU and for sub-graphs of the AAG are also available. The second form of outputallows the user to query the system for the metrics associated with a particular line (or a set of lines) ofthe application description. Finally, the system can generate an interpretation trace which can be usedas input to a performance visualization package. The user can then use the capabilities provided by thepackage to analyze the performance of the application.4.3 Abstraction & Interpretation HPF/Fortran 90D Parallel ConstructsThe abstraction/interpretation of the HPF/Fortran 90D parallel constructs i.e. forall, array assignmentand where is described below:forall Statement: The forall statement generalizes array assignments to handle new shapes of arraysby specifying them in terms of array elements or sections. The element array may be masked with ascalar logical expression. Its semantics are an assignment to each element or section (for which the maskexpression evaluates true) with all the right-hand sides being evaluated before any left-hand sides areassigned. The order of iteration over the elements is not xed. Examples of its use are:forall (I = 1 : N; J = 1 : N ) P (I; J) = Q(I   1; J   1)forall (I = 1 : N; J = 1 : N;Q(I; J):NE:0:0) P (I; J) = 1:0=Q(I; J)Phase 1 translates the forall statement into a three level structure consisting of a collective communi-cation level, a local computation level and another collective communication level, to be executed by eachprocessor. This three level structure is based on the \owner computes rule". The processor that is assignedan iteration of the forall loop is responsible for computing the right-hand-side expression of the assignmentstatement while the processors that owns an array element used in the left-hand side or right-hand side ofthe assignment statement must communicate that element to the processor performing the computation.Consequently, the rst communication level fetches o-processor data required by the computation level.Once this data has been gathered, computations are local. The nal communication level writes calculatedvalues to o-processors.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 14
Seq
CondtD
GATHER_DATA ( G )
DO K = LocalLB,LocalUB
IF (V (K) .GT. 0) THEN
X (K+1)
END IF
END DO
 = X (K) +  G (K)
PACK_PARAMETERS()
ADJUST_BOUNDS()
Comm
Seq
Phase 2Phase 1 IterD
forall (K=2:N-1,V (K) .GT. 0)
X (K+1) = X (K) + X(K-1) Figure 7: Abstraction of the forall StatementPhase 2 then generates a corresponding sub-AAG using the application abstraction model. The com-munication level translates into a Seq AAU corresponding to index translations and message packing per-formed, and a Comm/Sync AAU. The computation level generates an iterative AAU (IterD/IterND/IterSync)which may contain a conditional AAU (CondtD/CondtSync) (depending on whether a mask is specied).The abstraction of the forall statement is shown in Figure 7. In this example, the nal communicationphase is not required as no o-processor data needs to be written.Array Assignment Statements: HPF/Fortran 90D array assignment statements allow entire arrays(or array sections) to be manipulated atomically, thereby enhancing the clarity and conciseness of theprogram and making parallelism explicit. Array assignments are special cases of the forall statement andare abstracted by rst translating them into equivalent forall statements. The resultant forall statementis then interpreted as described above. The translation is illustrated by the following example:A(l1 : u1 : s1) = B(l2 : u2 : s2)translates to:forall(i = l1 : u1 : s1) A(i) = B(l2 + ((i  l1)=s1)  s2)where Statement: Like the array assignment statement, the HPF/Fortran 90D where statement is alsoa special case of the forall statement and is handled in a similar way. The translation of the where statementinto an equivalent forall is illustrated below:where(C(l3 : u3 : s3) :NE: 0:0) A(l1 : u1 : s1) = B(l2 : u2 : s2)translates to:forall(i = l1 : u1 : s1; C(l3 + ((i  l1)=s1)  s3)) A(i) = B(l2 + ((i  l1)=s1)  s2)
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 15Processing Component Memory Component Comm/Sync ComponentArithmetic Opers Memory Org SpecicationsInteger/Float Add/Sub Cache Size TopologyInteger/Float Multiply Cache Block Size Routing SchemeInteger Divide Cache Replication Policy Static Buer SizeFloat Divide Cache Associativity Comm - Static BuersConv: Integer->Float Cache Write Policy Startup OvhdConv: Float->Int Main Memory Size Transmission Time/byte...... Main Memory Page Size Per Hop OvhdIterative Opers Instruction Cache Size Receive OvhdPer-Iteration Loop Ovhd Instruction Cache Block Size Comm - Dynamic buersStep Limit Ovhd Memory Hierarchy Startup Ovhd...... Fetch/Fetch Miss Clks Transmission Time/byteConditional Opers Store/Store Miss Clks Per Hop OvhdCondition Ovhd Main Memory Receive OvhdBranch Taken Ovhd Main Memory Fetch (pipelined) Synchronization...... Main Memory Store (pipelined) Sync OvhdCall Opers Access Ovhds Group CommunicationCall Ovhd TLB Miss Broadcast AlgorithmLib Chars Read/Write Switch Multicast Algorithmabs() ...... Reductionexp() Shifts...... ......Table 2: Abstraction of the iPSC/860 System4.4 Abstraction of the iPSC/860 SystemAbstraction of the iPSC/860 hypercube system to generate the corresponding SAG was performed o-lineusing a combination of assembly instruction counts, measured timings and system specications. Theprocessing and memory components were generated using system specication provided by the vendor,while iterative and conditional overheads were computed using instruction counts. The communicationcomponent was parameterized using benchmarking runs. These parameters abstracted both low-levelprimitives as well as the high-level collective communication library used by the compiler. Benchmarkingruns were also used to parameterize the HPF parallel intrinsic library. The intrinsics included circularshift (cshift), shift to temporary (tshift), global sum operation (sum), global product operation (product),and the maxloc operation which returns the location of the maximum in a distributed array. Some of theparameters exported by each component of the i860 cube are summarized in Table 2. Sample values forthese parameters can be found in [1]. Characterization of the SRM (System Resource Manager) and thecommunication channel connecting the SRM to i860 cube was performed in a similar manner.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 165 Validation/Evaluation of the Interpretation FrameworkIn this section we present numerical results obtained using the current implementation of the HPF/Fortran90D performance prediction framework. In addition to validating the viability of the interpretive approach,this section has the following objectives:1. To validate the accuracy of the performance prediction framework for applications on a high perfor-mance computing system. The aim is to show that the predicted performance metrics are accurateenough to provide realistic information about the application performance and to be used as a basisfor design tuning.2. To demonstrate the application of the framework and the metrics generated to HPC applicationdevelopment. The results presented illustrate the utility of the framework for the following: Application design and directive selection. Experimentation with system and run-time parameters. Application performance debugging.3. To demonstrate the usability (ease of use) of the performance interpretation framework and its cost-eectiveness.The high performance computing system used for the validation is an iPSC/860 hypercube connectedto a 80386 based host processor. The particular conguration of the iPSC/860 consists of eight i860 nodes.Each node has a 4 KByte instruction cache, 8 KByte data cache and 8 MBytes of main memory. Thenode operates at a clock speed of 40 MHz and has a theoretical peak performance of 80 MFlop/s for singleprecision and 40 MFlop/s for double precision. The validation application set was selected from the NPAC2HPF/Fortran 90D Benchmark Suite. The suite consists of a set of benchmarking kernels and \real-life"applications and is designed to evaluate the eciency of the HPF/Fortran 90D compiler and specically,automatic data-mapping schemes. The selected application set includes kernels from standard benchmarksets like the Livermore Fortran Kernels and the Purdue Benchmark Set, as well as real computationalproblems. The applications are listed in Table 3.5.1 Validating Accuracy of the FrameworkAccuracy of the interpretive performance prediction framework is validated by comparing estimated exe-cution times with actual measured times. For each application, the experiment consisted of varying theproblem size and number of processing elements used. Measured timings represent an average taken over1000 runs. The results obtained are summarized in Table 4. Error values listed (and plotted) are percent-ages of the measured time and represent maximum/minimum absolute errors over all problem sizes and2Northeast Parallel Architectures Center
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 17Name DescriptionLFK 1 Hydro FragmentLFK 2 ICCG Excerpt (Incomplete Cholesky - Conjugate Gradient)LFK 3 Inner ProductLFK 9 Integrate PredictorsLFK 14 1-D PIC (Particle In Cell)LFK 22 Planckian DistributionPBS 1 Trapezoidal rule estimate of an integral of f(x)PBS 2 Compute the value of e = nPi=1 mQj=1  1 + 0:5 ji jj+0:001 PBS 3 Compute the value of S = nPi=1 mQj=1 aijPBS 4 Compute the value of R = nPi=1 1xiPI Approximation of PI by calculating the area under the curve using the n-point quadrature ruleN-Body Newtonian gravitational n-body simulationFinance Parallel stock option pricing modelLaplace Laplace solver based on Jacobi iterationsLFK = Livermore Fortran KernelPBS = Purdue Benchmarking Set Table 3: Validation Application Setsystem sizes. For example, the N-Body computation was performed for 16 to 4094 bodies on 1, 2, 4, and8 nodes of the iPSC/860. The minimum absolute error between estimated and measured times was 0.09%of the measured time while the maximum absolute error was 5.9%. Plots for estimated and measuredexecution times are included in Appendix A.The obtained results show that in the worst case, the interpreted performance is within 20% of the mea-sured value, the best case error being less than 0.001%. The larger errors are produced by the benchmarkkernels which have been specically coded to task the compiler. Further, it was found that the interpretedperformance typically lies within the variance of the measured times over the 1000 iterations. This in-dicates that the main contributors to the error are the tolerance of the timing routines and uctuationsin the system load. The objective of the predicted metrics is to serve either as the rst-cut performanceestimate of an application or as a relative performance measure to be used as a basis for design tuning. Ineither case, the interpreted performance is accurate enough to provide the required information.5.2 Application of the Interpretive Framework to HPC Application DevelopmentThe application of the interpretive performance prediction framework to HPC application developmentis illustrated by validating its utility for the following: (1) selection of appropriate HPF/Fortran 90Ddirectives based on the predicted performance, (2) experimentation with larger system congurations,varying system parameters, and with dierent run-time scenarios, and (3) analyzing dierent components
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 18Name Problem Sizes System Size Min Abs Error Max Abs Error(data elements) (# procs) (%) (%)LFK 1 128 - 4096 1 - 8 1.3% 10.2%LFK 2 128 - 4096 1 - 8 2.5% 18.6%LFK 3 128 - 4096 1 - 8 0.7% 7.2%LFK 9 128 - 4096 1 - 8 0.3% 13.7%LFK 14 128 - 4096 1 - 8 0.3% 13.8%LFK 22 128 - 4096 1 - 8 1.4% 3.9%PBS 1 128 - 4096 1 - 8 0.05% 7.9%PBS 2 256 - 65536 1 - 8 0.6% 6.7%PBS 3 256 - 65536 1 - 8 0.8% 9.5%PBS 4 128 - 4096 1 - 8 0.2% 3.9%PI 128 - 4096 1 - 8 0.00% 5.9%N-Body 16 - 4096 1 - 8 0.09% 5.9%Finance 32 - 512 1 - 8 1.1% 4.6%Laplace (Blk,Blk) 16 - 256 1 - 8 0.2% 4.4%Laplace (Blk,*) 16 - 256 1 - 8 0.6% 4.9%Laplace (*,Blk) 16 - 256 1 - 8 0.1% 2.8%Table 4: Accuracy of the Performance Prediction Framework(BLOCK,BLOCK) (BLOCK,*) (*,BLOCK)PROCESSORS PRC(2,2) PROCESSORS PRC(4) PROCESSORS PRC(4)TEMPLATE TEMP(N,N) TEMPLATE TEMP(N) TEMPLATE TEMP(N)DISTRIBUTE TEMP(BLOCK,BLOCK) DISTRIBUTE TEMP(BLOCK) DISTRIBUTE TEMP(BLOCK)ALIGN A(i,j) with TEMP(i,j) ALIGN A(i,*) with TEMP(i) ALIGN A(*,j) with TEMP(j)ALIGN B(i,j) with TEMP(i,j) ALIGN B(i,*) with TEMP(i) ALIGN B(*,j) with TEMP(j)ALIGN C(i,j) with TEMP(i,j) ALIGN C(i,*) with TEMP(i) ALIGN C(*,j) with TEMP(j)ALIGN D(i,j) with TEMP(i,j) ALIGN D(i,*) with TEMP(i) ALIGN D(*,j) with TEMP(j)Table 5: Possible Distributions for the Laplace Solver Applicationof the execution time and their distributions with respect to the application. The experiments performedare described below:5.2.1 Appropriate Directive SelectionTo demonstrate the utility of the interpretive framework in selecting HPF compiler directives we comparethe performance of the Laplace solver for 3 dierent distributions (DISTRIBUTE directive) of the template,namely (BLOCK,BLOCK), (BLOCK,*) and (*,BLOCK), and corresponding alignments (ALIGN directive)of the data elements to the template. These three distributions (on 4 processors) are shown in Figure 8and the corresponding HPF/Fortran 90D descriptions are listed in Table 5. Figures 9-12 compare theperformance of each of the three cases for dierent system sizes using both, measured times and estimatedtimes. These graphs can be used to select the best directives for a particular problem size and system
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 19
P 2
P 3
P 4
P 1
(Block,*)
P 2
P 1 P 3
P 4
(Block,Block)
P 1 P 2 P 3 P 4
(*,Block)Figure 8: Laplace Solver - Data Distributions 0 64 128 192 256Problem Size0.00.1
0.2
0.3
0.4
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 1x1 Proc Grid
Measured (Blk,Blk) - 1x1 Proc Grid
Estimated (Blk,*) - 1 Proc
Measured (Blk,*) - 1 Proc
Estimated (*,Blk) - 1 Proc
Measured (*,Blk) - 1 Proc
Figure 9: Laplace Solver (1 Proc) - Esti-mated/Measured Timesconguration. For the Laplace solver, the (BLOCK,*) distribution is the appropriate choice. Further,since the maximum absolute error between the estimated and measured times is less than 1%, the directiveselection can be accurately made using the interpretive framework. Using the interpretive framework isalso signicantly more cost-eective as will be demonstrated in Section 5.3.In the above experiment, performance interpretation was source driven and can be automated. Thisexposes the utility of the framework as a basis for an intelligent compiler capable of selecting appropriatedirectives and data decompositions. Similarly, it can also enable such a compiler to select code optimizationssuch as the granularity of the computation phase per communication phase in the loosely synchronouscomputation model.5.2.2 Experimentation with System/Run-Time ParametersResults presented in this section demonstrate the use of the interpretive framework for evaluating the eectsof dierent system and run-time parameters on the application performance. The following experimentswere conducted:Eect of Varying Processor Speed: In this experiment we evaluate the eect of increasing/decreasingthe speed of each processor in the iPSC/860 system on application performance. The results are shownin Figure 13 for speeds 2 times (100% processor speed increase), 3 times (200% processor speed increase),and 4 times (300% processor speed increase) the i860 processor speed. Such an evaluation enables thedeveloper to visualize how the application will perform on a faster (prospective) machine or alternately if
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 20
0 64 128 192 256
Problem Size
0.0
0.1
0.2
0.3
0.4
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 1x2 Proc Grid
Measured (Blk,Blk) - 1x2 Proc Grid
Estimated (Blk,*) - 2 Procs
Measured (Blk,*) - 2 Procs
Estimated (*,Blk) - 2 Procs
Measured (*,Blk) - 2 Procs
Figure 10: Laplace Solver (2 Procs) - Esti-mated/Measured Times 0 64 128 192 256Problem Size0.00.1
0.2
0.3
0.4
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 2x2 Proc Grid
Measured (Blk,Blk) - 2x2 Proc Grid
Estimated (Blk,*) - 4 Procs
Measured (Blk,*) - 4 Procs
Estimated (*,Blk) - 4 Procs
Measured (*,Blk) - 4 Procs
Figure 11: Laplace Solver (4 Procs) - Esti-mated/Measured Times
0 64 128 192 256
Problem Size
0.0
0.1
0.2
0.3
0.4
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 2x4 Proc Grid
Measured (Blk,Blk) - 2x4 Proc Grid
Estimated (Blk,*) - 8 Procs
Measured (Blk,*) - 8 Procs
Estimated (*,Blk) - 8 Procs
Measured (*,Blk) - 8 Procs
Figure 12: Laplace Solver (8 Procs) - Esti-mated/Measured Times 0 100 200 300Processor Speed (% Increase)0.000.020.04
0.06
0.08
0.10
0.12
0.14
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 9 - Integrate Predictors
Size: 8192
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Estimated Time - 16 Procs
Figure 13: Eect of Increasing Processor Speedon Performanceit has be run on a slower processor. It can also be used to evaluate the benets of upgrading to a fasterprocessor system.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 21
0 20 40 60 80 100 120 140
Network Bandwidth (% Increase)
0
0.1
0.2
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
N-Body Computation
Size: 4096
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Estimated Time - 16 Procs
Figure 14: Eect of Increasing Network Band-width on Performance -60 -40 -20 0 20 40 60 80 100 120 140Network Bandwidth (% Increase)-60-40-20020
40
60
80
100
120
140
Ex
ec
ut
io
n 
Ti
m
e 
(%
 In
cr
ea
se
)
N-Body Computation
Size: 4096
Estimated Time - 2 Procs
Estimated Time - 16 Procs
Figure 15: Eect of Varying Network Bandwidthon Performance (% Change in Execution time)Eect of Varying Interconnection Bandwidth: The eect of varying the interconnect bandwidth onthe application performance is shown in Figure 14. The increase/decrease in application execution timesis greater for larger processor congurations as illustrated in Figure 15 (negative percentages indicate adecrease in execution time or network bandwidth).Eect of Varying Network Load: Figure 16 shows the interpreted eects of network load on applica-tion performance. It can be seen that the performance deteriorates rapidly as the network gets saturated.Further, the eect of network load is more pronounced for larger system congurations as illustrated inFigure 17.Experimentation with Larger System Congurations: In this experiment we experiment withlarger system congurations than physically available (i.e. 16 & 32 processors). The results are shown inFigures 19 & 18. It can be seen that the rst application (Approximation of ) scales well with increasednumber of processors; while in the second application (Parallel Stock Option Pricing), larger congurationsare benecial only for larger problem sizes.The ability to experiment with system parameters not only allows the user to evaluate the applicationcharacteristics, but also enables the evaluation of new and dierent system congurations. This exposesthe potential of the framework as a design evaluation tool for system architects. Experimentation withrun-time parameter enables the developer to test the robustness of the design and to modify it to accountfor dierent run-time scenarios.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 22
0 10 20 30 40 50 60 70 80
Network Load (%)
0
0.2
0.4
0.6
0.8
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
N-Body Computation
Size: 4096
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Estimated Time - 16 Procs
Figure 16: Eect of Increasing Network Load onPerformance 0 10 20 30 40 50 60 70Network Load (%)050100
150
200
250
300
Ex
ec
ut
io
n 
Ti
m
e 
(%
 In
cr
ea
se
)
N-Body Computation
Size: 4096
Estimated Time - 2 Procs
Estimated Time - 16 Procs
Figure 17: Eect of Varying Network Load onPerformance (% Change in Execution time)
0 1024 2048
Problem Size
0
0.001
0.002
0.003
0.004
0.005
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Approximation of PI
Estimated Time - 16 Procs
Estimated Time - 32 Procs
Figure 18: Experimentation with Larger SystemCongurations - Approximation of PI 0 64 128 192 256 320 384 448 512Problem Size0
0.01
0.02
0.03
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Parallel Stock Option Pricing
Estimated Time - 16 Proc
Estimated Time - 32 Procs
Figure 19: Experimentation with Larger SystemCongurations - Financial Model5.2.3 Application Performance DebuggingThe metrics generated by the interpretive framework can be used to analyze the performance contributionof dierent parts of the application description and to view their computation time/communication timebreakup. This is illustrated below using two applications.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 23
     CSHIFT 1
     COMPUTE
END
REPEAT I=1:N/2
Phase 2
Phase 1
REPEAT I=1:N/2
END
     CSHIFT 1
ACCUMULATEFigure 20: N-Body - Application Phases Phase 1 Phase 2Application Phases0
10000
20000
Ti
m
e 
(u
se
c)
N-Body Computation
Procs = 4; Size = 1024
Comp Time
Comm Time
Ovhd Time
Figure 21: NBody Computation - Interpreted Perfor-mance ProleN-Body Computations: Figure 21 shows the performance prole for two phases of the n-body appli-cation. Phase 1 (see Figure 20) represents the forward movement of data around the virtual processorring while Phase 2 represents accumulation of force data at the original processors. For n processors, eachphase requires n/2 circular shifts of the data; consequently their communication proles are similar. How-ever, Phase 1 performs more computation as it computes the force interactions. Overhead time representsparallelization overheads. Similar proles can be obtained at smaller granularities (up to a single line ofcode).Parallel Stock Option Pricing: A performance prole for the parallel stock option pricing applicationis shown in Figure 23. This application has two phases as shown in Figures 22. Phase 1 creates the(distributed) option price lattice while Phase 2, which requires no communication, computes the call pricesof stock options.Application performance debugging using conventional means involves instrumentation, execution anddata collection, and post-processing this data. Further, this process requires a running application and hasto be repeated to evaluate each design. Using the interpretive framework, this information is available, atall levels required, during application development.5.3 Validating Usability of the Interpretive FrameworkThe interpreted performance estimates for the experiments described above were obtained using the in-terpretive framework running on a Sparcstation 1+. The framework provides a friendly menu-driven,
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 24
Phase 2
Phase 1
Create Stock
Price Lattice
(shift)
Price
Compute CallFigure 22: Financial Model - ApplicationPhases Phase 1 Phase 2Application Phases05000
10000
15000
Ti
m
e 
(u
se
c)
Stock Option Pricing
Procs = 4; Size = 256
Comp Time
Comm Time
Ovhd Time
Figure 23: Financial Model - Interpreted PerformanceProlegraphical user interface to work with and requires no special hardware other than a conventional work-station and a windowing environment. Application characterization is performed automatically (unlikemost approaches) while system abstraction is performed o-line and only once. Application parametersand directives were varied from within the interface itself. Typical experimentation on the iPSC/860 (toobtain measured execution times) consisted of editing code, compiling and linking using a cross compiler(compiling on the front end (or SRM) is not allowed to reduce its load), transferring the executable tothe iPSC/860 front end, loading it onto the i860 node and then nally running it. The process had to berepeated for each instance of each experiment. Relative experimentation times for dierent implementationof the Laplace Solver (Section 5.2.1) using measurements and the performance interpreter are shown inFigure 24. Experimentation using the interpretive approach required approximately 10 minutes for each ofthe three implementation. Experimentation using measurements however, took a minimum 27 minutes (forthe (Blk,*) implementation) and required almost 1 hour for the (*,Blk) case. Clearly, the measurementsapproach is not feasible, specially when a large number of options have to be evaluated. Further, theiPSC/860, being an expensive resource, is shared by various development groups in the organization. Con-sequently, its usage can be restrictive and the required conguration may not be immediately available. Thecomparison above validates the convenience and cost-eectiveness of the framework for experimentationduring application development.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 25
(Blk,Blk) (Blk,*) (*,Blk)
Implementation
0
20
40
60
Ex
pe
rim
en
ta
tio
n 
Ti
m
e 
(m
in
)
Laplace Solver
Interpreter
iPSC/860
Figure 24: Experimentation Time - Laplace Solver6 Related WorkExisting approaches and models for performance prediction on multicomputer systems can be broadlyclassied as analytic, simulation, monitoring or hybrid (which make use of a combination of the abovetechniques along with possible heuristics and approximations).A general approach for analytic performance prediction for shared memory systems has been proposedby Siewiorek et al. in [4] while probabilistic models for parallel programs based on queueing theory havebeen presented in [5]. An analytic performance prediction technique based on the approximation of parallelow graphs by sequential ow graphs has been proposed by Qin et al. in [6]. The above approaches requireusers to explicitly model the application along with the entire system. A source based analytic performanceprediction model for Dataparallel C has been developed by Clement et al. [7]. The approach uses a set ofassumptions and specic characteristics of the language to develop a speedup equation for applications interms of system costs.A simulation based approach is used in the SiGLe system (Simulator at Global Level) [8] which providesspecial description languages to describe the architecture, application and the mapping of the applicationonto the architecture.An evaluation approach based on instrumentation, data collection and post-processing has been proposedby Darema et al. [9]. Balasundaram et al. [10] use `training routines" to benchmark the performance ofthe architecture and then use this information to evaluate dierent data decompositions.The PPPT system [11] uses monitoring techniques to prole the execution of the application program on
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 26a single processor, and to derive sequential program parameters such as conditional branch probabilities,loop iteration counts, and frequency counts for each statement type. The user is required to provide acharacteristic set of input data for this proling run. Obtained information is then used by the staticparameter based performance prediction tool to estimate performance information for the parallelized(SPMD) application program on a distributed memory system.A hybrid approach is presented in [12] where the runtime of each node of a stochastic graph representingthe application is modeled as a random variable. The distributions of these random variables are thenobtained using hardware monitoring.The layered approach presented in [13] uses a methodology based on application and system charac-terization. The developer is required to characterize the application as an execution graph and dene itsresource requirements in this system.7 Conclusions and Future WorkEvaluation tools form a critical part of any software development environment as they enable the developerto evaluate dierent design choices available at various stages of application development, and make themost appropriate selection. These tools, in symbiosis with other development tools, complete the feedbackloop of the \develop-evaluate-tune" cycle.In this paper, we described a novel interpretive approach for accurate and cost-eective performanceprediction on high performance computing systems. A comprehensive characterization methodology isused to abstract the system and application components of the HPC environment into a set of well denedparameters. An interpreter engine then interprets the performance of the abstracted application in termsof the parameters exported by the abstracted system. A source-driven HPF/Fortran 90D performanceprediction framework based on the interpretive approach has been implemented as part of the HPF/Fortran90D integrated application development environment. The current implementation of the environmentframework is targeted to the iPSC/860 hypercube multicomputer system.Numerical results using benchmarking kernels and application from the NPAC HPF/Fortran 90D Bench-mark Suite were presented to validate the accuracy, utility, and usability of the performance predictionframework. The use of the framework for selecting appropriate compiler directives, for application perfor-mance debugging and for experimentation with run-time and system parameters was demonstrated.We are currently working on developing an intelligent HPF/Fortran 90D compiler based on the sourcebased interpretation model. This tool will enable the compiler to automatically evaluate directives andtransformation choices and optimize the application at compile time. Future development of the frameworkwill involve moving it to high performance distributed computing systems and exploiting its potential asa system design evaluation tool.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 27References[1] Manish Parashar, Interpretive Performance Prediction for High Performance Parallel Computing, PhDthesis, Syracuse University, 121 Link Hall, Syracuse, NY 13244-1240, July 1994, Available via WWW athttp://godel.ph.utexas.edu/Members/parashar/ESP/esp.html.[2] High Performance Fortran Forum, High Performance Fortran Language Specications, Version 1.0, Jan. 1993,Also available as Technical Report CRPC-TR92225 from Center for Research on Parallel Computing, RiceUniversity, Houston, TX 77251-1892.[3] Zeki Bozkus, Alok Choudhary, Georey Fox, Tomasz Haupt, and Sanjay Ranka, \CompilingHPF for DistributedMemory MIMD Computers", in David Lilja and Peter Bird, editors, Impact of Compilation Technology onComputer Architecture. Kluwer Academic Publishers, 1993.[4] Dalibor F. Vrsalovic, Daniel P. Siewiorek, Zary Z. Segall, and Edward F. Gehringer, \Performance Predictionand Calibration for a Class of Multiprocessors", IEEE Transactions on Computers, 37(11):1353{1365, Nov.1988.[5] Philip Heildelberger and Kishore S. Trivedi, \Analytic Queueing Models for Programs with Internal Concur-rency", IEEE Transactions on Computers, C-32(1):73{82, Jan. 1983.[6] Reda A. Ammar and Bin Qin, \A Technique to Derive the Detailed Time Costs of Parallel Computations",Proceedings of the 12th Annual International Computer Software and Application Conference, pp. 113{119, 1988.[7] Mark J. Clement and Micheal J. Quinn, \Analytic Performance Prediction on Multicomputers", Technicalreport, Department of Computer Science, Oregon State University, Mar. 1993.[8] F. Andre and A. Joubert, \SiGLe: An Evaluation Tool for Distributed Systems", Proceedings of the InternationalConference on Distributed Computing Systems, pp. 466{472, 1987.[9] Frederica Darema, \Parallel Applications Performance Methodology", in Margaret Simmons, Rebecca Koskela,and Ingrid Bucher, editors, Instrumentation for Future Parallel Computing Systems, chapter 3, pp. 49{57.Addison-Wesley Publishing Company, 1988.[10] Vasanth Balasundaram, Georey Fox, Ken Kennedy, and Ulrich Kremer, \A Static Performance Estimatorin the Fortran D Programming System", in Joel Saltz and Piyush Mehrotra, editors, Languages, Compilersand Run-Time Environments for Distributed Memory Machines, pp. 119{138. Elsevier Science Publishers B.V.,1992.[11] Thomas Fahringer and Hans P. Zima, \A Static Parameter based Performance Prediction Tool for ParallelPrograms", Proceedings of the 7th ACM International Conference on Supercomputing, Japan, July 1993.[12] Franz Sotz, \A Method for Performance Prediction of Parallel Programs", in H. Burkhart, editor, Joint Inter-national Conference on Vector and Parallel Processing, Proceedings, Zurich, Switzerland, pp. 98{107. Springer,Berlin, LNCS 457, Sep. 1990.[13] E. Papaefstathiou, D. J. Kerbyson, and G. R. Nudd, \A Layered Approach to Parallel Software PerformancePrediction: A Case Study", Massively Parallel Processing Applications and Development, Delft, 1994.
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 28A Accuracy of the Interpretation FrameworkEstimated and measured execution times corresponding to the results summarized in Table 4 are plottedin Figures 25-37 below:
0 1024 2048 3072 4096
Problem Size
0
5000
10000
15000
Ex
ec
ut
io
n 
Ti
m
e 
(u
se
c)
LFK 1 - Hydro Fragment
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 25: LFK 1 - Estimated/Measured Times 0 1024 2048 3072 4096Problem Size0.0000.0050.010
0.015
0.020
0.025
0.030
0.035
0.040
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 2 - ICCG Excerpt
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 26: LFK 2 - Estimated/Measured Times
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 29
0 1024 2048 3072 4096
Problem Size
0
0.002
0.004
0.006
0.008
0.01
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 3 - Inner Product 
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 27: LFK 3 - Estimated/Measured Times 0 1024 2048 3072 4096Problem Size0.000.02
0.04
0.06
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 9 - Integrate Predictors
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 28: LFK 9 - Estimated/Measured Times
0 1024 2048 3072 4096
Problem Size
0.000
0.010
0.020
0.030
0.040
0.050
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 14 - 1-D PIC
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 29: LFK 14 - Estimated/Measured Times 0 1024 2048 3072 4096Problem Size0.000.05
0.10
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
LFK 22 - Planckian Distribution
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 30: LFK 22 - Estimated/Measured Times
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 30
0 1024 2048 3072 4096
Problem Size
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
PBS 1 - Trapezoidal Rule
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 31: PBS 1 - Estimated/Measured Times 0 64 128 192 256sqrt(Problem Size)00.1
0.2
0.3
0.4
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
PBS 2 - Computation of e*
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 32: PBS 2 - Estimated/Measured Times
0 64 128 192 256
sqrt(Problem Size)
0
0.01
0.02
0.03
0.04
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
PBS 3 - Computation of S
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 33: PBS 3 - Estimated/Measured Times 0 1024 2048 3072 4096Problem Size00.01
0.02
0.03
0.04
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
PBS 4 - Computation of R
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 34: PBS 4 - Estimated/Measured Times
Compile-Time Performance Prediction of HPF/Fortran 90DTo be published in IEEE Parallel & Distributed Technology 31
0 1024 2048 3072 4096
Problem Size
0
0.025
0.05
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Approximation of PI
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 35: PI - Estimated/Measured Times 0 512 1024 1536 2048 2560 3072 3584 4096Problem Size00.05
0.1
0.15
0.2
0.25
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
N-Body Computation
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 36: N-Body - Estimated/Measured Times
0 64 128 192 256 320 384 448 512
Problem Size
0
0.05
0.1
0.15
0.2
0.25
Ex
ec
ut
io
n 
Ti
m
e 
(s
ec
)
Parallel Stock Option Pricing
Estimated Time - 1 Proc
Estimated Time - 2 Procs
Estimated Time - 4 Procs
Estimated Time - 8 Procs
Measured Time - 1 Proc
Measured Time - 2 Procs
Measured Time - 4 Procs
Measured Time - 8 Procs
Figure 37: Financial Model - Estimated/Measured Times
