Interpreting the Performance of HPF/Fortran 90D by Parashar, Manish et al.
Syracuse University 
SURFACE 
Northeast Parallel Architecture Center College of Engineering and Computer Science 
1994 
Interpreting the Performance of HPF/Fortran 90D 
Manish Parashar 
Syracuse University, Northeast Parallel Architectures Center, parashar@npac.syr.edu 
Salim Hariri 
Syracuse University 
Tomasz Haupt 
Syracuse University, haupt@npac.syr.edu 
Geoffrey C. Fox 
Syracuse University 
Follow this and additional works at: https://surface.syr.edu/npac 
 Part of the Computer Sciences Commons 
Recommended Citation 
Parashar, Manish; Hariri, Salim; Haupt, Tomasz; and Fox, Geoffrey C., "Interpreting the Performance of 
HPF/Fortran 90D" (1994). Northeast Parallel Architecture Center. 2. 
https://surface.syr.edu/npac/2 
This Working Paper is brought to you for free and open access by the College of Engineering and Computer 
Science at SURFACE. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
Interpreting the Performance of HPF/Fortran 90DManish Parashar, Salim Hariri, Tomasz Haupt, and Georey C. FoxNortheast Parallel Architectures CenterSyracuse UniversitySyracuse, NY 13244-4100fparashar,hariri,haupt,gcfg@npac.syr.eduTo be presented atSupercomputing `94, Washington DCAbstractIn this paper we present a novel interpretive ap-proach for accurate and cost-eective performance pre-diction in a high performance computing environ-ment, and describe the design of a source-drivenHPF/Fortran 90D performance prediction frameworkbased on this approach. The performance predic-tion framework has been implemented as part of aHPF/Fortran 90D application development environ-ment. A set of benchmarking kernels and applicationcodes are used to validate the accuracy, utility, usabil-ity, and cost-eectiveness of the performance predic-tion framework. The use of the framework for select-ing appropriate compiler directives and for applicationperformance debugging is demonstrated.Keywords: Performance prediction, HPF/Fortran90D application development, System & Applicationcharacterization.1 IntroductionAlthough currently available High PerformanceComputing (HPC) systems possess large computingcapabilities, few existing applications are able to fullyexploit this potential. The fact remains that the devel-opment of ecient application software capable of ex-ploiting available computing potentials is non-trivialand is largely governed by the availability of su-ciently high-level languages, tools, and application de-velopment environments.A key factor contributing to the complexity of par-allel/distributed software development is the increaseddegrees of freedom that have to be resolved and tunedin such an environment. Typically, during the courseof parallel/distributed software development, the de-veloper is required to select between available algo-rithms for the particular application; between possible
hardware conguration and amongst possible decom-positions of the problem onto the selected hardwareconguration; between dierent communication andsynchronization strategies; and so on. The set of rea-sonable alternatives that have to be evaluated is verylarge and selecting the best alternative among theseis a formidable task. Consequently, evaluation toolsform a critical part of any software development envi-ronment.In this paper we present a novel interpretive ap-proach for accurate and cost-eective performanceprediction in a high performance computing envi-ronment, and describe the design of a source-drivenHPF1/Fortran 90D performance prediction frameworkbased on this approach. The interpretive approachdenes a comprehensive characterization methodologywhich abstracts system and application components ofthe HPC environment. Interpretation techniques arethen used to interpret performance of the abstractedapplication in terms of parameters exported by the ab-stracted system. System abstraction is performed o-line through a hierarchical decomposition of the com-puting system. Application abstraction is achievedautomatically at compile time. The performance pre-diction framework has been implemented as a part ofthe HPF/Fortran 90D application development envi-ronment [1] developed at the Northeast Parallel Ar-chitectures Center (NPAC), Syracuse University. Theenvironment integrates a HPF/Fortran 90D compiler,a functional interpreter and the source based perfor-mance prediction tool; and is supported by a graphi-cal user interface. The current implementation of theenvironment framework is targeted to the iPSC/860hypercube multicomputer system.A set of benchmarking kernels and application1High Performance Fortran
codes are used to validate the accuracy, utility, and us-ability of the performance prediction framework. Theuse of this framework for selecting appropriate com-piler directives and for application performance de-bugging is demonstrated.The rest of the paper is organized as follows: Sec-tion 2 gives an overview of HPF/Fortran 90D. Sec-tion 3 introduces the interpretive performance predic-tion approach, and the system and application char-acterization methodologies. Section 4 describes thedesign of the HPF/Fortran 90D performance predic-tion framework. Section 5 presents numerical resultsto validate the approach and the framework. Section 6discusses some related research. Finally, Section 7presents some concluding remarks and discusses fu-ture extensions to the project.2 An Overview of HPF/Fortran 90DHigh Performance Fortran (HPF) [2] is based onthe research language Fortran 90D [3] and provides aminimal set of extensions to Fortran 90 to support thedata parallel programming model2. Extensions incor-porated into HPF/Fortran 90D provide a means forexplicit expression of parallelism and data mapping.These extensions include compiler directives which areused to advice the compiler how data objects shouldbe assigned to processor memories, and new languagefeatures such as the forall statement and construct.HPF adopts a two level mapping using the PRO-CESSORS, ALIGN, DISTRIBUTE, and TEMPLATEcompiler directives to map data objects to abstractprocessors. The data objects (typically array ele-ments) are rst aligned with an abstract index spacecalled a template. The template is then distributedonto a rectilinear arrangement of abstract processors.The mapping of abstract processors to physical pro-cessors is implementation dependent. Data objectsnot explicitly distributed are mapped according to animplementation dependent default distribution (e.g.replication). Supported distributions include BLOCKand CYCLIC.Our current implementation of the HPF compilerand performance prediction framework supports a for-mally dened subset of HPF. The term HPF/Fortran90D is used to refer to this subset.2The data parallel programming model is dened as sin-gle threaded, global name space, loosely synchronous parallelcomputation.
Figure 1: An Interpretive Performance Prediction Ap-proach3 An Interpretive Approach to Perfor-mance PredictionThe essence of the interpretive approach is the ap-plication of interpretation techniques to performanceprediction through an appropriate characterization ofthe HPC system and the application. It consists offour modules as follows (see Figure 1):
1. The Systems Module which denes a comprehen-sive system characterization methodology capableof hierarchically abstracting the HPC system intoa set of well dened parameters representing itsperformance.2. The Application Module which denes a corre-sponding application characterization methodol-ogy capable of abstracting a high-level applicationdescription into a set of well dened parametersrepresenting its behavior.
3. The Interpretation Engine (or module) which pre-dicts the performance of the application on theHPC system by interpreting the execution costsof the abstracted application in terms of the pa-rameters exported by the abstracted system.4. The Output Module which communicates the pre-dicted performance metrics, and provides the ap-plication developer with the required informationat the required granularity.The four modules are briey described below. A de-
tailed discussion of the performance interpretation ap-proach can be found in [4].
3.1 Systems ModuleThe systems module abstracts a HPC system byhierarchically decomposing it to form a rooted treestructure called the System Abstraction Graph (SAG).Each node of the SAG is a System Abstraction Unit(SAU) which abstracts a part of the HPC systeminto a set of parameters representing its performance.A SAU is composed of 4 components: (1) Process-ing Component (P), (2) Memory Component (M), (3)Communication/Synchronization Component (C/S),and (4) Input/Output Component (I/O); each com-ponent parameterizing relevant characteristics of theassociated system unit.3.2 Application ModuleApplication abstraction is performed in two step:Machine independent application abstraction is per-formed by recursively characterizing the applica-tion description into Application Abstraction Units(AAU's). Each AAU represents a standard program-ming construct (such as iterative, conditional, sequen-tial) or a communication/synchronization operation,and parameterizes its behavior. AAU's are combinedto abstract the control structure of the application,forming the Application Abstraction Graph (AAG).The communication/synchronization structure of theapplication is superimposed onto the AAG by aug-menting the graph with a set of edges correspond-ing to the communications or synchronization betweenAAU's. The resulting structure is the SynchronizedApplication Abstraction Graph (SAAG). The secondstep consists of machine specic augmentation and isperformed by the machine specic lter. This step in-corporates machine specic information (such as intro-duced compiler transformations/optimizations) intothe SAAG based on a mapping dened by the user.3.3 Interpretation EngineThe interpretation engine consists of two compo-nents; an interpretation function that interprets theperformance of an individual AAU, and an interpreta-tion algorithm that recursively applies the interpreta-tion function to the SAAG to predict the performanceof the corresponding application. An interpretationfunction is dened for each AAU type to compute itsperformance in terms of parameters exported by theassociated SAU. Models and heuristics are dened tohandle accesses to the memory hierarchy, overlap be-tween computation and communication, and user ex-perimentation with system and run-time parameters.Details of these models and the complete set of inter-pretation functions can be found in [4].
3.4 Output ModuleThe output module provides an interactive inter-face through which the user can access estimated per-formance statistics. The user has the option of select-ing the type of information, and the level at whichthe information is to be displayed. Available informa-tion includes cumulative execution times, the commu-nication time/computation time breakup and existingoverheads and wait times. This information can beobtained for an individual AAU, cumulatively for abranch of the AAG (i.e. sub-AAG), or for the entireAAG.4 Design of the HPF/Fortran 90D Per-formance Prediction FrameworkThe HPF/Fortran 90D performance predictionframework is based on the HPF source-to-source com-piler technology [5] which translates HPF into looselysynchronous, SPMD3 Fortran 77 + Message-Passingcodes. It uses this technology in conjunction with theperformance interpretation model to provide perfor-mance estimates for HPF/Fortran 90D applicationson a distributed memory MIMD multicomputer. Per-formance prediction is performed in two phases as de-scribed below:4.1 Phase 1 - CompilationThe compilation phase is based on the HPF/-Fortran 90D compiler. Given a syntactically correctHPF/Fortran 90D program, this phase performs thefollowing steps:1. The rst step parses the program to generatea parse tree. Array assignment statement andwhere statement are transformed into equivalentforall statements with no loss of information.2. The partitioning step processes the compiler di-rectives and using these directives, it partitionsthe data and computation among the processors.3. The sequentialization step is responsible for con-verting parallel constructs in the node programinto loops or nested loops.4. The communication detection step detects com-munication requirements and inserts appropriatecommunication calls.5. In the nal step, a loosely synchronous SPMDprogram structure is generated consisting of al-ternating phases of local computation and globalcommunication.3Single Program, Multiple Data
4.2 Phase 2 - InterpretationPhase 2 is implemented as a sequence of parses:(1) The abstraction parse generates the applicationabstraction graph (AAG) and synchronized applica-tion abstraction graph (SAAG). (2) The interpreta-tion parse performs the actual interpretation using theinterpretation algorithm. (3) The output parse gener-ates the required performance metrics.Abstraction Parse: The abstraction parse inter-cepts the SPMD program structure produced in phase1 and abstracts its execution and communicationstructures to generate the corresponding AAG andSAAG (as dened in Section 3). A communicationtable is generated to store the specications and sta-tus of each communication/synchronization.The abstraction parse also identies all critical vari-ables in the application description; a critical vari-able being dened as a variable whose value eectsthe ow of execution, e.g. a loop limit. The criti-cal variables are then resolved either by tracing theirdenition paths or by allowing the user to explicitlyspecify their values.Interpretation Parse: The interpretation parseperforms the actual performance interpretation us-ing the interpretation algorithm. For each AAU inthe SAAG, the corresponding interpretation functionis used to generate the performance measure associ-ated with it. Performance metrics maintained at eachAAU are its computation, communication and over-heads times, and the value of the global clock. In ad-dition, cumulative metrics are also maintained for theentire SAAG.The interpretation parse has provisionsto take into consideration a set of compiler optimiza-tions (for the generated Fortran 77 + MP code) suchas loop re-ordering, etc. These can be turned on/oby the user.Output Parse The nal parse communicates esti-mated performance metrics to the user. The outputinterface provides three types of outputs. The rsttype is a generic performance prole of the entire ap-plication broken up into its communication, computa-tion and overhead components. Similar measures foreach individual AAU and for sub-graphs of the AAGare also available. The second form of output allowsthe user to query the system for the metrics associatedwith a particular line (or a set of lines) of the appli-cation description. Finally, the system can generatean interpretation trace which can be used as input tothe ParaGraph [6] visualization package. The user can
then use the capabilities provided by the package toanalyze the performance of the application.4.3 Abstraction & Interpretation HPF/-Fortran 90D Parallel ConstructsThe abstraction/interpretation of the HPF/Fortran90D parallel constructs i.e. forall, array assignments,and where is described below:forall Statement: The forall statement generalizesarray assignments to handle new shapes of arrays byspecifying them in terms of array elements or sections.The element array may be masked with a scalar log-ical expression. Its semantics are an assignment toeach element or section (for which the mask expres-sion evaluates true) with all the right-hand sides be-ing evaluated before any left-hand sides are assigned.The order of iteration over the elements is not xed.Examples of its use are:forall (I = 1 : N; J = 1 : N) P (I; J) = Q(I   1; J   1)forall (I = 1 : N;Q(I):NE:0:0) P (I) = 1:0=Q(I)Phase 1 translates the forall statement into a threelevel structure consisting of a collective communica-tion level, a local computation level and another col-lective communication level, to be executed by eachprocessor. The processor that is assigned an itera-tion of the forall loop is responsible for computingthe right-hand-side expression of the assignment state-ment, while the processors that owns an array ele-ment used in the left-hand side or right-hand side ofthe assignment statement must communicate that el-ement to the processor performing the computation.Consequently, the rst communication level fetcheso-processor data required by the computation level.Once this data has been gathered, computations arelocal. The nal communication level writes calculatedvalues to o-processors.Phase 2 then generates a corresponding sub-AAGusing the application abstraction model. The commu-nication level translates into a sequential (Seq( AAUcorresponding to index translations and message pack-ing performed, and a communication (Comm) AAU.The computation level generates an iterative (IterD)AAU which may contain a conditional (CondtD) AAU(depending on whether a mask is specied). The ab-straction of the forall statement is shown in Figure 2.In this example, the nal communication phase is notrequired as no o-processor data needs to be written.Array Assignment Statements: HPF/Fortran90D array assignment statements allow entire ar-rays (or array sections) to be manipulated atomically,
Seq
CondtD
GATHER_DATA ( G )
DO K = LocalLB,LocalUB
IF (V (K) .GT. 0) THEN
X (K+1)
END IF
END DO
 = X (K) +  G (K)
PACK_PARAMETERS()
ADJUST_BOUNDS()
Comm
Seq
Phase 2Phase 1 IterD
forall (K=2:N-1,V (K) .GT. 0)
X (K+1) = X (K) + X(K-1) Figure 2: Abstraction of the forall Statementthereby enhancing the clarity and conciseness of theprogram and making parallelism explicit. Array as-signments are special cases of the forall statement andare abstracted by rst translating them into equiva-lent forall statements. The resultant forall statementis then interpreted as described above.where Statement: Like the array assignmentstatement, the HPF/Fortran 90D where statement isalso a special case of the forall statement and is han-dled in a similar way.4.4 Abstraction of the iPSC/860 SystemAbstraction of the iPSC/860 hypercube systemto generate the corresponding SAG was performedo-line using a combination of assembly instructioncounts, measured timings, and system specications.The processing and memory components were gener-ated using system specication provided by the ven-dor, while iterative and conditional overheads werecomputed using instruction counts. The communica-tion component was parameterized using benchmark-ing runs. These parameters abstracted both low-levelprimitives as well as the high-level collective commu-nication library used by the compiler. Benchmarkingruns were also used to parameterize the HPF paral-lel intrinsic library. The intrinsics included circularshift (cshift), shift to temporary (tshift), global sumoperation (sum), global product operation (product),and the maxloc operation which returns the location ofthe maximum in a distribute array. Characterizationof the SRM (host) and the communication channelconnecting the SRM to i860 cube was performed in asimilar manner.
5 Validation/Evaluation of the Inter-pretation FrameworkIn this section we present numerical results ob-tained using the current implementation of theHPF/Fortran 90D performance prediction framework.In addition to validating the viability of the interpre-tive approach, this section has the following objec-tives:1. To validate the accuracy of the performance pre-diction framework for applications on a high per-formance computing system. The aim is to showthat the predicted performance metrics are accu-rate enough to provide realistic information aboutthe application performance and to be used as abasis for design tuning.2. To demonstrate the utility of the framework andthe metrics generated for ecient HPC applica-tion development. The results presented illus-trate the framework's utility for: (1) Applicationdesign and directive selection; and (2) Applica-tion performance debugging.3. To demonstrate the usability (ease of use) ofthe performance interpretation framework and itscost-eectiveness.The high performance computing system used isan iPSC/860 hypercube connected to a 80386 basedhost processor. The particular conguration of theiPSC/860 consists of 8 i860 nodes. Each node has a4 KByte instruction cache, 8 KByte data cache and8 MBytes of main memory. The node operates at aclock speed of 40 MHz and has a theoretical peak per-formance of 80 MFlop/s for single precision and 40MFlop/s for double precision. The validation applica-tion set was selected from the NPAC HPF/Fortran
Name DescriptionLivermore Fortran Kernels (LFK)LFK 1 Hydro FragmentLFK 2 ICCG Excerpt (Incomplete Cholesky; Conj. Grad.)LFK 3 Inner ProductLFK 9 Integrate PredictorsLFK 14 1-D PIC (Particle In Cell)LFK 22 Planckian DistributionPurdue Benchmarking Set (PBS)PBS 1 Trapezoidal rule estimate of an integral of f(x)PBS 2 Compute e = nPi=1 mQj=1  1 + 0:5 ji jj+0:001PBS 3 Compute S = nPi=1 mQj=1 aijPBS 4 Compute R = nPi=1 1xiPI Approximation of  by calculating the areaunder the curve using the n-point quadrature ruleN-Body Newtonian gravitational n-body simulationFinance Parallel stock option pricing modelLaplace Laplace solver based on Jacobi iterationsTable 1: Validation Application Set90D Benchmark Suite [7]. The suite consists of aset of benchmarking kernels and \real-life" applica-tions and is designed to evaluate the eciency ofthe HPF/Fortran 90D compiler and specically, auto-matic partitioning schemes. The selected applicationset includes kernels from standard benchmark sets likethe Livermore Fortran Kernels and the Purdue Bench-mark Set, as well as real computational problems. Theapplications are listed in Table 1.5.1 Validating Accuracy of the Frame-workAccuracy of the performance prediction frameworkis validated by comparing estimated execution timeswith actual measured times. For each application, theexperiment consisted of varying the problem size andnumber of processing elements used. Measured tim-ings represent an average of 1000 runs. The resultsare summarized in Table 2. Error values listed arepercentages of the measured time and represent max-imum/minimumabsolute errors over all problem sizesand system sizes. For example, the N-Body compu-tation was performed for 16 to 4094 bodies on 1, 2,4, and 8 nodes of the iPSC/860. The minimum ab-solute error between estimated and measured timeswas 0.09% of the measured time while the maximumabsolute error was 5.9%.The obtained results show that in the worst case,the interpreted performance is within 20% of the mea-sured value, the best case error being less than 0.001%.
The larger errors are produced by the benchmark ker-nels which have been specically coded to task thecompiler. Further, it was found that the interpretedperformance typically lies within the variance of themeasured times over the 1000 iterations. This indi-cates that the main contributors to the error are thetolerance of the timing routines and uctuations in thesystem load. Predicted metrics typically serve eitheras the rst-cut performance estimate of an applica-tion or as a relative performance measure to be usedas a basis for design tuning. In either case, the inter-preted performance is accurate enough to provide therequired information.5.2 Validating Utility of the FrameworkThe utility of the performance prediction frame-work is validated through the following experiments;(1) selecting the appropriate HPF/Fortran 90D direc-tives based on the predicted performance, and (2) us-ing the tool to analyze dierent components of theexecution time and their distributions with respect tothe application. These experiments are described be-low:5.2.1 Appropriate Directive SelectionTo demonstrate the utility of the interpretive frame-work in selecting HPF compiler directives we comparethe performance of the Laplace solver for 3 dierentdistributions (DISTRIBUTE directive) of the tem-plate, namely (BLOCK,BLOCK), (BLOCK,X) and(X,BLOCK), and corresponding alignments (ALIGNdirective) of the data elements to the template. Thesethree distributions (on 4 processors) are shown in Fig-ure 3. Figures 4 & 5 compare the performance ofeach of the three cases for dierent system sizes us-ing both, measured times and estimated times. Thesegraphs can be used to select the best directives for aparticular problem size and system conguration. Forthe Laplace solver, the (Block,X) distribution is theappropriate choice. Further, since the maximum ab-solute error between estimated and measured timesis less than 1%, directive selection can be accuratelyperformed using the interpretive framework. Usingthe interpretive framework is also signicantly morecost-eective as will be demonstrated in Section 5.3.In the above experiment, performance interpreta-tion was source driven and can be automated. Thisexposes the utility of the framework as a basis for anintelligent compiler capable of selecting appropriatedirectives and data decompositions. Similarly, it canalso enable such a compiler to select code optimiza-tions such as the granularity of the computation phase
Name Problem Sizes System Size Min Abs Error Max Abs Error(data elements) (# procs) (%) (%)LFK 1 128 - 4096 1 - 8 1.3% 10.2%LFK 2 128 - 4096 1 - 8 2.5% 18.6%LFK 3 128 - 4096 1 - 8 0.7% 7.2%LFK 9 128 - 4096 1 - 8 0.3% 13.7%LFK 14 128 - 4096 1 - 8 0.3% 13.8%LFK 22 128 - 4096 1 - 8 1.4% 3.9%PBS 1 128 - 4096 1 - 8 0.05% 7.9%PBS 2 256 - 65536 1 - 8 0.6% 6.7%PBS 3 256 - 65536 1 - 8 0.8% 9.5%PBS 4 128 - 4096 1 - 8 0.2% 3.9%PI 128 - 4096 1 - 8 0.00% 5.9%N-Body 16 - 4096 1 - 8 0.09% 5.9%Financial 32 - 512 1 - 8 1.1% 4.6%Laplace (Blk-Blk) 16 - 256 1 - 8 0.2% 4.4%Laplace (Blk-X) 16 - 256 1 - 8 0.6% 4.9%Laplace (X-Blk) 16 - 256 1 - 8 0.1% 2.8%Table 2: Accuracy of the Performance Prediction Frameworkper communication phase in the loosely synchronouscomputation model.
P 2
P 3
P 4
P 1
(Block,*)
P 2
P 1 P 3
P 4
(Block,Block)
P 1 P 2 P 3 P 4
(*,Block)Figure 3: Laplace Solver - Data Distributions5.2.2 Application Performance DebuggingThe performance metrics generated by the frameworkcan be used to analyze the performance contributionof dierent parts of the application description andto identify bottlenecks. A performance prole for thephases (Figure 6) of the parallel stock option pricingapplication is shown in Figure 7. Phase 1 creates the(distributed) option price lattice while Phase 2, whichrequires no communication, computes the call pricesof stock options.Application performance debugging using conven-
tional means involves instrumentation, execution anddata collection, and post-processing this data. Fur-ther, this process requires a running application andhas to be repeated to evaluate each design modica-tion. Using the interpretive framework, this informa-tion (at all levels required) is available during appli-cation development (without requiring a running ap-plication).5.3 Validating Usability of the Frame-workThe interpreted performance estimates for the ex-periments described above were obtained using theinterpretive framework running on a Sparcstation1+. The framework provides a friendly menu-driven,graphical user interface to work with and requires nospecial hardware other than a conventional worksta-tion and a windowing environment. Application char-acterization is performed automatically (unlike mostapproaches) while system abstraction is performed o-line and only once. Application parameters and direc-tives were varied from within the interface itself. Typ-ical experimentation on the iPSC/860 (to obtainedmeasured execution times) consisted of editing code,compiling and linking using a cross compiler (compil-ing on the front end is not allowed to reduce its load),transferring the executable to the iPSC/860 front end,loading it onto the i860 node and then nally run-ning it. The process had to be repeated for eachinstance of each experiment. Relative experimenta-tion times for dierent implementation of the LaplaceSolver (Section 5.2.1) using measurements and theperformance interpreter are shown in Figure 8. Exper-imentation using the interpretive approach required
0 64 128 192 256
Problem Size
0.0
0.1
0.2
0.3
0.4
Ex
ec
ut
ion
 T
im
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 2x2 Proc Grid
Measured (Blk,Blk) - 2x2 Proc Grid
Estimated (Blk,*) - 4 Procs
Measured (Blk,*) - 4 Procs
Estimated (*,Blk) - 4 Procs
Measured (*,Blk) - 4 Procs
Figure 4: Laplace Solver (4 Procs) - Esti-mated/Measured Times 0 64 128 192 256Problem Size0.00.1
0.2
0.3
0.4
Ex
ec
ut
ion
 T
im
e 
(s
ec
)
Laplace Solver 
Estimated (Blk,Blk) - 2x4 Proc Grid
Measured (Blk,Blk) - 2x4 Proc Grid
Estimated (Blk,*) - 8 Procs
Measured (Blk,*) - 8 Procs
Estimated (*,Blk) - 8 Procs
Measured (*,Blk) - 8 Procs
Figure 5: Laplace Solver (8 Procs) - Esti-mated/Measured Timesapproximately 10 minutes for each of the three im-plementation. Experimentation using measurementshowever took a minimum 27 minutes (for the (Blk,*)implementation) and required almost 1 hour for the(*,Blk) case. Clearly, the measurements approach isnot feasible, specially when a large number of optionshave to be evaluated. Further, the iPSC/860, being anexpensive resource, is shared by various developmentgroups in the organization. Consequently, its usagecan be restrictive and the required conguration maynot be immediately available. The comparison abovevalidates the convenience and cost-eectiveness of theframework for experimentation during application de-velopment.6 Related WorkExisting performance prediction approaches andmodels for multicomputer systems can be broadly clas-sied as analytic, simulation, monitoring or hybrid(which make use of a combination of the above tech-niques along with possible heuristics and approxima-tions)Analytic techniques use mathematical models toabstract the system and application, and solve thesemodels to obtain performance metrics. A general ap-proach for analytic performance prediction for sharedmemory systems has been proposed by Siewiorek etal. in [8] while probabilistic models for parallel pro-grams based on queueing theory have been presentedin [9]. The above approaches require users to explic-itly model the application along with the HPC system.
(Blk,Blk) (Blk,*) (*,Blk)
Implementation
0
20
40
60
Ex
pe
rim
en
ta
tio
n 
Ti
m
e 
(m
in)
Laplace Solver
Interpreter
iPSC/860
Figure 8: Experimentation Time - Laplace SolverA source based analytic performance prediction modelfor Dataparallel C has been developed by Clement etal [10]. The approach uses the a set of assumptionsand specic characteristics of the language to developa speedup equation for applications in terms of systemcosts.Simulation techniques simulate the hardware andthe actual execution of a program on that hardware.These techniques are typically expensive in terms ofthe time and computing resource required. A sim-ulation based approach is used in the SiGLe system(Simulator at Global Level) [11] which provides spe-
Phase 2
Phase 1
Create Stock
Price Lattice
(shift)
Price
Compute CallFigure 6: Financial Model - Application Phases Phase 1 Phase 2Application Phases05000
10000
15000
Ti
m
e 
(u
se
c)
Stock Option Pricing
Procs = 4; Size = 256
Comp Time
Comm Time
Ovhd Time
Figure 7: Financial Model - Interpreted PerformanceProlecial description languages to describe the architecture,application and the mapping of the application ontothe architecture.The PPPT system [12] uses monitoring techniquesto prole the execution of the application programon a single processor. Obtained information is thenused by the static parameter based performance pre-diction tool to estimate performance information forthe parallelized (SPMD) application program on adistributed memory system. A similar evaluationapproach based on instrumentation, data collectionand post-processing has been proposed by Darema etal. [13]. Balasundaram et al. [14] use `training rou-tines" to benchmark the performance of the architec-ture and then use this information to evaluate dierentdata decompositions.A hybrid approach is presented in [15] where theruntime of each node of a stochastic graph represent-ing the application is modeled as a random variable.The distributions of these random variables are thenobtained using hardware monitoring.The layered approach presented in [16] uses amethodology based on application and system charac-terization. The developer is required to characterizethe application as an execution graph and dene itsresource requirements in this system.7 Conclusions and Future WorkEvaluation tools form a critical part of any softwaredevelopment environment as they enable the devel-oper to evaluate the dierent design choices available
at various stages of application development, and tomake the most appropriate selection.In this paper, we described a novel interpretiveapproach for accurate and cost-eective performanceprediction on high performance computing systems. Acomprehensive characterization methodology is usedto abstract the system and application components ofthe HPC environment into a set of well dened pa-rameters. An interpreter engine then interprets theperformance of the abstracted application in terms ofthe parameters exported by the abstracted system. Asource-driven HPF/Fortran 90D performance predic-tion framework based on the interpretive approach hasbeen implemented as part of the HPF/Fortran 90Dintegrated application development environment. Thecurrent implementation of the environment frameworkis targeted to the iPSC/860 hypercube system.Numerical results using benchmarking kernels andapplication codes from the NPAC HPF/Fortran 90DBenchmark Suite, were presented to validate the ac-curacy, utility, and usability of the performance pre-diction framework. The use of the framework for se-lecting appropriate compiler directives, and for appli-cation performance debugging was demonstrated.We are currently working on developing an intelli-gent HPF/Fortran 90D compiler based on the sourcebased interpretation model. This tool will enablethe compiler to automatically evaluate directives andtransformation choices and optimize the application atcompile time. Future development of the frameworkwill involve moving it to high performance distributedcomputing systems and exploiting its potential as a
system design evaluation tool.AcknowledgmentThe presented research has been jointly sponsoredby DARPA under contract #DABT63-91-k-0005 andby Rome Labs under contract #F30602-92-C-0150.The content of the information does not necessary re-ect the position or the policy of the sponsors and noocial endorsement should be inferred.References[1] Manish Parashar, Salim Hariri, Tomasz Haupt, andGeorey C. Fox, \Design of An Interpretive Toolkitfor HPF/Fortran 90D Application Development",Technical report, Northeast Parallel ArchitecturesCenter, Syracuse University, Syracuse NY 13244-4100, Apr. 1994.[2] High Performance Fortran Forum, High PerformanceFortran Language Specications, Version 1.0, Jan.1993, Also available as Technical Report CRPC-TR92225 from Center for Research on Parallel Com-puting, Rice University, Houston, TX 77251-1892.[3] Georey C. Fox, Seema Hiranandani, Ken Kennedy,Charles Koebel, Uli Kremer, Chau-Wen Tseng, andMin-You Wu, \Fortran D Language Specications",Technical Report SCCS 42c, Northeast Parallel Ar-chitectures Center, Syracuse University, Syracuse NY13244-4100, Dec. 1990.[4] Manish Parashar, Salim Hariri, Tomasz Haupt, andGeorey C. Fox, \An Interpretive Framework for Ap-plication Performance Prediction", Technical ReportSCCS-479, Northeast Parallel Architectures Center,Syracuse University, Syracuse NY 13244-4100, Apr.1993.[5] Zeki Bozkus, Alok Choudhary, Georey Fox, TomaszHaupt, and Sanjay Ranka, \Compiling HPF for Dis-tributed Memory MIMD Computers", in David Liljaand Peter Bird, editors, Impact of Compilation Tech-nology on Computer Architecture. Kluwer AcademicPublishers, 1993.[6] J. A. Etheridge M. Heath, \Paragraph", Technicalreport, Oak Ridge National Laboratory, Oak Ridge,Tennessee 37831, Oct 1991.[7] A. Gaber Mohamed, Georey C. Fox, Gregor vonLaszewski, Manish Parashar, Tomasz Haupt, KimMills, Ying-Hua Lu, Neng-Tan Lin, and Nang kangYeh, \Application Benchmark Set for Fortran-Dand High Performance Fortran", Technical ReportSCCS-327, Northeast Parallel Architectures Center,Syracuse University, Syracuse, NY 13244-4100., June1992.[8] Dalibor F. Vrsalovic, Daniel P. Siewiorek, Zary Z.Segall, and Edward F. Gehringer, \Performance
Prediction and Calibration for a Class of Multi-processors", IEEE Transactions on Computers,37(11):1353{1365, Nov. 1988.[9] A. Kapelnikov, R. R. Muntz, and M. D. Ercegovac,\A Methodology for Performance Analysis of Paral-lel Computations with Looping Constructs", Journalof Parallel and Distributed Computing, 14:105{120,1992.[10] Mark J. Clement and Micheal J. Quinn, \AnalyticPerformance Prediction on Multicomputers", Techni-cal report, Department of Computer Science, OregonState University, Mar. 1993.[11] F. Andre and A. Joubert, \SiGLe: An EvaluationTool for Distributed Systems", Proceedings of theInternational Conference on Distributed ComputingSystems, pp. 466{472, 1987.[12] Thomas Fahringer and Hans P. Zima, \A Static Pa-rameter based Performance Prediction Tool for Par-allel Programs", Proceedings of the 7th ACM Inter-national Conference on Supercomputing, Japan, July1993.[13] Frederica Darema, \Parallel Applications Perfor-mance Methodology", in Margaret Simmons, RebeccaKoskela, and Ingrid Bucher, editors, Instrumentationfor Future Parallel Computing Systems, chapter 3, pp.49{57. Addison-Wesley Publishing Company, 1988.[14] Vasanth Balasundaram, Georey Fox, Ken Kennedy,and Ulrich Kremer, \A Static Performance Estimatorin the Fortran D Programming System", in Joel Saltzand Piyush Mehrotra, editors, Languages, Compilersand Run-Time Environments for Distributed MemoryMachines, pp. 119{138. Elsevier Science PublishersB.V., 1992.[15] Franz Sotz, \A Method for Performance Predictionof Parallel Programs", in H. Burkhart, editor, JointInternational Conference on Vector and Parallel Pro-cessing, Proceedings, Zurich, Switzerland, pp. 98{107.Springer, Berlin, LNCS 457, Sep. 1990.[16] E. Papaefstathiou, D. J. Kerbyson, and G. R. Nudd,\A Layered Approach to Parallel Software Perfor-mance Prediction: A Case Study", Massively Par-allel Processing Applications and Development, Delft,1994.
