17 research outputs found

    THE NAS PARALLEL BENCHMARKS

    Get PDF
    The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, is a large-scale effort to advance the state of computational aerodynamics. Specifically, the NAS organization aims &dquo;to provide the Nation’s aerospace research and development community by the year 2000 a highperformance, operational computing system capable of simulating an entire aerospace vehicle system within a computing time of one to several hours&dquo; (NAS Systems Division, 1988, p. 3). The successful solution of this &dquo;grand challenge&dquo; problem will require the development of computer systems that can perform the required complex scientific computations at a sustained rate nearly 1,000 times greater than current generation supercomputers can achieve. The architecture of computer systems able to achieve this level of performance will likely be dissimilar to the shared memory multiprocessing supercomputers of today. While no consensus yet exists on what the design will be, it is likely that the system will consist of at least 1,000 processors computing in parallel. Highly parallel systems with computing power roughly equivalent to that of traditional shared memory multiprocessors exist today. Unfortunately, for various reasons, the performance evaluation of these systems on comparable types of scientific computations is very difficult. Relevant data for the performance of algorithms of interest to the computational aerophysics community on many currently available parallel systems are limited. Benchmarking and performance evaluation of such systems have not kept pace with advances in hardware, software, and algorithms. In particular, there is as yet no generally accepted benchmark program or even a benchmark strategy for these systems

    A high-speed linear algebra library with automatic parallelism

    Get PDF
    Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market

    The NAS parallel benchmarks

    Get PDF
    A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided

    Human factors in the design of parallel program performance tuning tools

    Get PDF

    Developing and Measuring Parallel Rule-Based Systems in a Functional Programming Environment

    Get PDF
    This thesis investigates the suitability of using functional programming for building parallel rule-based systems. A functional version of the well known rule-based system OPS5 was implemented, and there is a discussion on the suitability of functional languages for both building compilers and manipulating state. Functional languages can be used to build compilers that reflect the structure of the original grammar of a language and are, therefore, very suitable. Particular attention is paid to the state requirements and the state manipulation structures of applications such as a rule-based system because, traditionally, functional languages have been considered unable to manipulate state. From the implementation work, issues have arisen that are important for functional programming as a whole. They are in the areas of algorithms and data structures and development environments. There is a more general discussion of state and state manipulation in functional programs and how theoretical work, such as monads, can be used. Techniques for how descriptions of graph algorithms may be interpreted more abstractly to build functional graph algorithms are presented. Beyond the scope of programming, there are issues relating both to the functional language interaction with the operating system and to tools, such as debugging and measurement tools, which help programmers write efficient programs. In both of these areas functional systems are lacking. To address the complete lack of measurement tools for functional languages, a profiling technique was designed which can accurately measure the number of calls to a function , the time spent in a function, and the amount of heap space used by a function. From this design, a profiler was developed for higher-order, lazy, functional languages which allows the programmer to measure and verify the behaviour of a program. This profiling technique is designed primarily for application programmers rather than functional language implementors, and the results presented by the profiler directly reflect the lexical scope of the original program rather than some run-time representation. Finally, there is a discussion of generally available techniques for parallelizing functional programs in order that they may execute on a parallel machine. The techniques which are easier for the parallel systems builder to implement are shown to be least suitable for large functional applications. Those techniques that best suit functional programmers are not yet generally available and usable

    Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

    Get PDF
    Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications

    Translating expert system rules into Ada code with validation and verification

    Get PDF
    The purpose of this ongoing research and development program is to develop software tools which enable the rapid development, upgrading, and maintenance of embedded real-time artificial intelligence systems. The goals of this phase of the research were to investigate the feasibility of developing software tools which automatically translate expert system rules into Ada code and develop methods for performing validation and verification testing of the resultant expert system. A prototype system was demonstrated which automatically translated rules from an Air Force expert system was demonstrated which detected errors in the execution of the resultant system. The method and prototype tools for converting AI representations into Ada code by converting the rules into Ada code modules and then linking them with an Activation Framework based run-time environment to form an executable load module are discussed. This method is based upon the use of Evidence Flow Graphs which are a data flow representation for intelligent systems. The development of prototype test generation and evaluation software which was used to test the resultant code is discussed. This testing was performed automatically using Monte-Carlo techniques based upon a constraint based description of the required performance for the system
    corecore