A Study of Software Development for High Performance Computing by Parashar, Manish et al.
Syracuse University 
SURFACE 
Northeast Parallel Architecture Center College of Engineering and Computer Science 
1994 
A Study of Software Development for High Performance 
Computing 
Manish Parashar 
Syracuse University 
Salim Hariri 
Syracuse University 
Tomasz Haupt 
Syracuse University 
Geoffrey C. Fox 
Syracuse University 
Follow this and additional works at: https://surface.syr.edu/npac 
 Part of the Software Engineering Commons 
Recommended Citation 
Parashar, Manish; Hariri, Salim; Haupt, Tomasz; and Fox, Geoffrey C., "A Study of Software Development 
for High Performance Computing" (1994). Northeast Parallel Architecture Center. 71. 
https://surface.syr.edu/npac/71 
This Article is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
A Study of Software Development for HighPerformance ComputingManish Parashar, Salim Hariri, Tomasz Haupt and Georey FoxNortheast Parallel Architectures CenterSyracuse UniversityPublished inProgramming Environments for Massively Parallel Distributed SystemsBirkhauser Verlag, Basel, Switzerland, August, 1994Also, presented atIFIP WG10.3 Working Conference on Programming Environmentsfor Massively Parallel Distributed Systems, 1994AbstractSoftware development in a High Performance Computing (HPC) environment isnon-trivial and requires a thorough understanding of the application and the ar-chitecture. The objective of this paper is to study the software development processin a high performance computing environment and to outline the stages typicallyencountered in this process. Support required at each stage is also highlighted. Themodeling of stock option pricing is used as a running example in the study.1 IntroductionSoftware development in any High Performance (Parallel/Distributed) Computing(HPC) environment is a non-trivial process and requires a thorough understandingof the application and the architecture. This is apparent from the fact that applica-tions currently achieve only a fraction of peak available performance [Zor92]. HPCsoftware development requires the developer to resolve and tune a large numberof available design options. For example, during the course of software develop-ment, the developer is required to select the optimal hardware conguration for aparticular application, the best decomposition and mapping of the problem ontothe selected hardware conguration, the best communication and synchronizationstrategy to be used, etc. Using conventional techniques, this would require exten-sive experimentation, data collection and post-processing. The set of reasonable1
alternatives that have to be evaluated is very large and selecting the best amongthese is a formidable task. As a result the exploitation of the vast potential ofHPC systems will largely be governed by the availability of suitable tools andapplication development environments to support application developers.The objective of this paper is to study the software development process in ahigh performance computing environment and to outline the stages encountered.Further, the nature of supporting tools that can assist the developer at each stageare identied. Parallel modeling of stock option pricing is used as an illustrativeexample in the study. The rest of the document is as organized follows: Section 2presents the study of HPC software development process and outlines the stages(subsections 2.3 - 2.7). Section 3 presents some conclusions.2 HPC Software DevelopmentThe HPC software development process is described as a set of stages which cor-respond to the phases typically encountered by a developer. At each stage, a setof support tools which can assist the developer are identied. The stages can beviewed as a set of lters in cascade (see Figure 1) forming a development pipeline.The input to this system of lters is the application description and specicationwhich is generated from the application itself (if it is a new problem) or fromexisting sequential code (porting of dusty decks). The nal output of the pipelineis a running application. Feedback loops present at some stages signify step-wiserenement and tuning. Related discussions pertaining to parallel computing envi-ronments and spanning parts of the software development process can be foundin [BM91, BBDK91, RL88]. A survey of existing tools and techniques correspond-ing to the developemnt stages is presented in [PHHF93a]. The stages in the HPCsoftware development process are described in the following sections. Parallel mod-eling of Stock Option Pricing [MCV+92] is used as an illustrative, running examplein the discussion.2.1 Parallel Modeling of Stock Option PricingStock options are contracts that give the holder of the contract the right to buy orsell the underlying stock at some time in the future for an agreed upon striking orexercise price. Option contracts are traded just as stocks and models that quicklyand accurately predict their prices are valuable to the traders. Stock option pricingmodels estimate the price for an option contract based on historical market trendsand current market information. The model requires three classes of inputs:Mar-ket Variables which include the current stock price, call price, exercise price andtime to maturity. Model Parameters which include the volatility of the asset(variance of the asset price over time), variance of the volatility and the correlationbetween asset price and volatility.These parameters cannot be be directly observedand must be estimated from historical data. User Inputs which specify the na-2
Application Analysis Stage
Compile−Time/Run−Time Stage
Evaluation Stage
Algorithm Development  Module System Level Mapping Module
Implementation/Coding  Module Machine Level Mapping Module
Application  Development Stage
Application Specification
Filter
Application Specification
Filter
Maintenance/Evolution Stage
Evaluation Recommendation
Evaluation Specification
Application Specification
Parallelized Structure
Parallelization Specification
Dusty Decks New Application
Design Evaluator
Module
Figure 1: The HPC Software Development Processture of the required estimation; e.g. American/European call, constant/stochasticvolatility, time of dividend payo, and other constraints regarding acceptable ac-curacy and running times. A number of option pricing models have been developedusing varied approaches, e.g. non-stochastic analytic models, Monte Carlo simu-lation models, binomial models, binomial models with forced recombination, etc.Each of these models involve a set of tradeo's in the nature and accuracy of the3
estimation and suit dierent user requirements. In addition, these models makevaried demands in terms of programming models and computing resources.2.2 InputsThe HPC software development process presented in this section addresses \new"application development as well as the porting of exiting applications (Dusty-Decks) to HPC environments. The input to the development pipeline is the ap-plication specication in the form of a functional ow description, which is a veryhigh-level ow diagram of the application outlining the sequence of functions tobe performed. Each node (termed as functional module) in the functional ow di-agram is a black-box and contains information about (1) its input(s), (2) the func-tion to be performed, (3) the desired output(s) and (4) the resource requirementsat each node. The application specication can be thought of as corresponding tothe \user requirement document" in a traditional life-cycle models.In the case of new applications, the inputs are generated from the textualdescription of the problem and its requirements. In the case of dusty decks codeporting, the developer is required to analyze the existing source code. In either case,expert system based tools and intelligent editors, both equipped with a knowledgebase to assist in analyzing the application, are required. In Figure 1, these toolsare included in the \Application Specication Filter" module.The stock price modeling application comes under the rst class of applica-tions (i.e. new applications). The application specications based on the textualdescription presented in Section 2.1, is shown in Figure 2. It consists of three func-tional modules: (1) The input module which accepts user specication, marketinformation and historical data and generates the three classes of inputs requiredby the model. (2) The estimationmodule consists of the actual model and generatesthe stock option pricing estimates. (3) The output module provides a graphicaldisplay of the estimation to the user. The feedback from the output module tothe input module represents tuning of the user specication based on the outputdisplayed.2.3 Application Analysis StageThe rst stage of the HPC software development pipeline is the application anal-ysis stage. The input to this stage is the application specication as described inSection 2.2. The function of this stage is to thoroughly analyze the applicationwith the sole objective of achieving the most ecient implementation. The prob-lems dealt with in this stage are: (1) module creation problem, i.e. identicationof tasks which can be executed in parallel; (2) module classication problem i.e.identication of standard modules; and (3) module synchronization problem, i.e.analysis of mutual interdependencies. The output of this stage is a detailed pro-cess ow graph called the \Parallelization Specication" where the nodes represent4
Figure 2: Stock Option Pric-ing Model: Application Spec-ications Figure 3: Stock Option Pricing Model: Paralleliza-tion Specicationsfunctional components and the edges represent interdependencies. This stage corre-sponds to the \design phase" in standard software life-cycle models and its outputcorresponds to the \design document". Tools which can assist the user at this stageof software development are: (1) smart editors which can interactively generate di-rected graph models from the application specications; (2) intelligent tools withlearning capabilities which can use the directed graphs to analyze dependencies,identify potentially parallelizable modules and attempt to classify the functional
modules into standard modules; and (3) problem specic tools equipped with adatabase of transformations and strategies applicable to the specic problem.The parallelization specication for the running example is shown in Figure3. The Input functional module is subdivided into two functional components:(1) analyzing historical data and generating model parameters; and (2) accepting5
market information and user inputs to generate market variables and estimationspecications. The two components can be executed concurrently. The Estimationmodule is identied as a standard computational module and is retained as a singlefunctional component. The Output functional module consists of two independentfunctional components: (1) rendering the estimated information onto a graphicaldisplay; and (2) writing it onto disk for subsequent analysis.2.4 Application Development StageThe application development stage receives as its input the Parallelization Speci-cations and produces the Parallelized Structure which can then be compiled andexecuted. This stage is made up of 5 modules: (1) Algorithm Development Mod-ule; (2) System Level Mapping Module; (3) Machine Level Mapping Module; (4)Implementation/Coding Module; and (5) Design Evaluator Module. It should benoted, however, that these modules are not executed in any xed sequence or axed number of times. There exists instead, a feedback system from each moduleto the other modules through the design evaluator module. This allows the devel-opment as well as the tuning to proceed in an iterative manner using step-wiserenement. The modules are described below:2.4.1 Algorithm Development ModuleThe function of the algorithm development module is to assist the developer inidentifying functional components in the parallelization specication and selectingappropriate algorithmic implementations. The input information to this moduleincludes: (1) the classication and requirements of the components specied in theparallelization specication; (2) hardware conguration information; and (3) map-ping information generated by the system level mapping module. It then uses thisinformation to select the best algorithmic implementation and the correspondingimplementation template from its database. The algorithm development moduleuses the services of the design evaluator module to select between possible al-gorithmic implementations. Tools needed during this phase include an intelligentalgorithm development environment (ADE) equipped with a database of opti-mized templates for dierent algorithmic implementations, an evaluation of therequirements of these templates and an estimation of their performance on dier-ent platforms.The algorithm chosen to implement the Estimation Component of the stockoption pricing model (shown in Figure 3), depends on the nature of the esti-mation (constant/stochastic volatility, American/European calls/puts, dividendpayo time, etc) to be performed and the accuracy/time constraints. For exam-ple, models based on Monte Carlo simulation provide high accuracy. However,these models are computationally intensive and slow and thereby cannot be usedin real-time systems. Further they are not suitable for American calls/puts whenearly dividend payo is possible. Binomial models are less accurate than Monte6
Carlo models but are more tractable and can handle early exercise. Models usingconstant volatility (as opposed to treating volatility as a stochastic process) lackaccuracy but are simplistic and easy to compute. The algorithmic implementationsof the input and output functional components must be capable of handling ter-minal and disk I/O at rates specied by the time constraint parameters. Further,the output display must provide all information required by the user.2.4.2 System Level Mapping ModuleThe function of the system level mappingmodule is to use the information providedby the algorithm development module to appropriately map the functional com-ponents of the application to the appropriate computing elements of a distributed(possibly heterogeneous) HPC environment. The objective is to map each func-tional component to the computing element that maximizes the performance ofthe application. Some data and load distribution issues may have to be resolvedin this module. In addition, this module may also cluster functional componentnodes specied in the parallelization specications to obtain a better mapping.The system level mapping module uses feedback from the evaluation module toselect between dierent mapping candidates. System level mapping can be ac-complished in an interactive mapping environment equipped with intelligent toolsfor analyzing the requirements of the functional components, and a knowledgebase consisting of analytic benchmarks for the dierent computing elements andinterconnection media in the HPC environment.The algorithms for stock option pricing have been eciently implementedon architectures like the CM2 and the DECmpp-12000 [MCV+92]. Thus, an ap-propriate mapping for the estimation functional component in the parallelizationspecication in Figure 3 is an SIMD architecture. The input and output interfaces(Input/Output Component-A) require graphics capability with support for highspeed rendering (output display) and must be mapped to an appropriate graphicsstations. Finally, Input/Output Component-B requires high speed disk I/O andmust be mapped to an I/O server with such capabilities.2.4.3 Machine Level Mapping ModuleThe machine level mapping module performs the mapping of the functional com-ponent(s) onto the processor(s) of the computing elements. This stage resolves is-sues like data partitioning, load distribution, control distribution, etc. and makestransformations specic to that computing element. It uses the feedback fromthe design evaluator module to select between possible alternatives. Machine levelmapping can be accomplished in an interactive mapping environment similar tothat described for the system level mapping module, but equipped with informa-tion pertaining individual computing elements of a specic computer architecture.The performance of the stock option pricing models are very sensitive to thelayout of data onto the processing elements. The optimal layout is dictated by7
the input parameters (e.g. time of dividend payo, terminal time, etc.) and bythe specication of the architecture onto which the component is mapped. Forexample, in the binomial model, the continuous time processes for stock price andvolatility are represented as discrete up/down movements forming a binary lattice.Such a lattice is generally implemented as asymmetric arrays which are distributedonto the processing elements. It has been found that the default mapping of thesearrays (i.e. in two dimensions) on architectures like the DECmpp-12000, lead topoor load balancing and performance, specially for extreme values of the dividendpayo time. Further the performance in case of such a mapping, is very sensitiveto this value and has to be modied for each set of inputs. Hence, in this case itis favorable to explicitly map them as one dimensional arrays. This is done by themachine level mapping module.2.4.4 Implementation/Coding ModuleThe function of the implementation/codingmodule is to handle all code generationand perform the code lling of selected templates, so as to produce parallel codewhich can then be compiled and executed on the target computer architecture.This module incorporates all machine specic transformations, optimized librariesand codes; handles the introduction of calls to communication and synchronizationroutines; and takes care of the distribution of data among the processing elements.It also handles any input/output redirection that may be required.With regard to the pricing model application, the implementation/codingmodule is responsible for introducing the machine specic communication rou-tines. For example, the binary estimation model makes use of the \end-of-shift"function for its nearest-neighbor communication. The corresponding function callin C (CM2) or MPL (DECmpp-12000) are introduced by this module. A pos-sible machine specic optimization that can be introduced by this module is toreduce communication by making use of in-processor arrays. This optimizationcan improve performance by about two orders of magnitude [MCV+92].2.4.5 Design Evaluator ModuleThe design evaluator module is a critical component of the application develop-ment stage. Its function is to assist the developer in evaluating dierent optionsavailable to each of the other modules, and identifying the option that provides thebest performance. It receives information about the hardware conguration, theapplication structure, the requirements of the selected algorithms and the map-pings. This input information is then used to estimate the performance of theapplication on the target conguration. Further, it provides insight into the com-putation and communication costs, the existing idle times and the overheads. Thisinformation can be used by the other modules to identify regions where furtherrenement or tuning is required. The keys features of this module are: (1) theability to provide evaluations with the desired accuracy, with minimum resource8
requirements and within a reasonable amount of time; (2) the ability to auto-mate the evaluation process; and (3) the ability to perform the evaluation withinan integrated workstation environment without running the application on thetarget computers. Support applicable to this module consists primarily of perfor-mance prediction and estimation tools. Simulation approaches can also be used toachieve some of the required functionality. A novel approach which uses interpre-tive techniques to realize a performance prediction framework that can meet theserequirements, is presented in [PHHF93b].2.5 Compile-Time & Run-Time StageThe compile-time/run-time stage handles the task of executing the parallelizedapplication generated by the development stage to produce the required output.The input to this stage is the parallelized source code (parallelized structure).The compile-time portion of this stage consists of set of cross compilers for thecomputing elements and tools for scheduling and allocation. The run-time por-tion of this stage handles run-time functions like debugging, scheduling, dynamicload balancing, migration, irregular communications, etc. It also enables the userto (non-intrusively) instrument the code for proling and debugging and allowscheckpointing for fault-tolerance. During the execution of the application, it ac-cepts outputs from the dierent computing elements and directs them for propervisualization. It intercepts error messages generated and provides proper interpre-tation.2.6 Evaluation StageIn the evaluation stage, the developer, retrospectively evaluates the design choicesmade during the design process and looks for ways to improve the performance.The evaluation stage performs a thorough evaluation of the execution of the en-tire application, detailing communication and computation times, synchronizationoverheads and existing idle times at every execution level (application level, nodelevel, procedure level, etc.). It uses this evaluation to identify regions in the im-plementation where performance improvement is possible. Further, it allows acost-eective evaluation (in terms of time and resources) of the application for arepresentative inputs set as well as the eect of various run-time parameters likesystem load, network contention, on performance. The scalability of the applica-tion with machine and problem size is also evaluated. The key requirement of thisstage is the ability to provide desired accuracy and granularity of evaluation whilemaintaining tractability and non-intrusiveness. Support applicable to the evalua-tion stage include dierent analytic tools, monitoring tools, simulation tools andprediction/estimation tools. 9
2.7 Maintenance/Evolution StageIn addition to the above described stages encountered during the developmentand execution of HPC applications, there is an additional stage in the life-cycle ofthis software which involves its maintenance and evolution. Maintenance includesmonitoring the operation of the software and ensuring that it continues to meetits specications. It involves detecting and correcting bugs as they surface. Themaintenance stage also handles modications needed to incorporate changes inthe system conguration. Software evolution deals with improving the software,adding additional functionality, incorporating new optimizations, etc. Another as-pect of evolution is the development of more ecient algorithms and correspond-ing algorithmic templates and the incorporation of new hardware architectures. Tosupport such a development, the maintenance/evolution stage provides tools forthe rapid prototyping of hardware and software and for evaluating the new cong-uration and designs without having to implement them. Other support requiredduring this stage includes tools for monitoring the performance and execution ofthe software, fault detection and recovery tools, and system conguration andconguration evaluation tools.3 ConclusionsSoftware development in any Parallel/Distributed environment is a non-trivial pro-cess and requires a thorough understanding of the application and the architecture.This apparent from the fact that currently, applications are able to achieve onlya fraction of peak available performance. This paper studies the software develop-ment process for in a High Performance Computing environment. It describes thestages typically involved in this process and outlines the support required at eachstage. The development of a parallel model for stock option pricing is used as arunning example.References[BBDK91] J. E. Boillat, H. Burkhart, K. M. Decker, and P. G. Kropf. Parallel Comput-ing in the 1990's: Attacking the Software Problem. Physics Report (ReviewSection of Physics Letters), 207(3-5):141 { 165, 1991.[BM91] Victor R. Basili and John D. Musa. The Future Engineering of Software: AManagement Perspective. IEEE Computer, 24(9):90{96, September 1991.[MCV+92] Kim Mills, Gang Cheng, Michael Vinson, Sanjay Ranka, and Georey C.Fox. Software Issues and Performance of a Parallel Model for Stock Op-tion Pricing. Proceedings of the 5th Australian Supercomputing Conference,Melbourne, Australia, December 1992.[PHHF93a] Manish Parashar, Salim Hariri, Tomasz Haupt, and Georey C. Fox. An In-tegrated Software Development Model for Heterogeneous High Performance10
Computing. Technical Report SCCS-453, Northeast Parallel ArchitecturesCenter, Syracuse University, Syracuse NY 13244-4100, April 1993.[PHHF93b] Manish Parashar, Salim Hariri, Tomasz Haupt, and Georey C. Fox. AnInterpretive Framework for Application Prediction. Procs of the 1993 Int'lConference On Parallel and Distributed Systems, 668{672, Dec. 1993.[RL88] Lucian Russell and R. N. C. Lightfoot. Software Development Issues forParallel Processing. Proceedings of the 12th Annual International ComputerSoftware and Applications Conference, 306{307, 1988.[Zor92] Glenn Zorpette. Teraops Galore. IEEE Spectrum, 29(9):26{76, sep 1992.
11
