17 research outputs found
THE NAS PARALLEL BENCHMARKS
The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, is a large-scale effort to advance the state of computational aerodynamics. Specifically, the NAS organization aims &dquo;to provide the Nation’s aerospace research and development community by the year 2000 a highperformance, operational computing system capable of simulating an entire aerospace vehicle system within a computing time of one to several hours&dquo; (NAS Systems Division, 1988, p. 3). The successful solution of this &dquo;grand challenge&dquo; problem will require the development of computer systems that can perform the required complex scientific computations at a sustained rate nearly 1,000 times greater than current generation supercomputers can achieve. The architecture of computer systems able to achieve this level of performance will likely be dissimilar to the shared memory multiprocessing supercomputers of today. While no consensus yet exists on what the design will be, it is likely that the system will consist of at least 1,000 processors computing in parallel. Highly parallel systems with computing power roughly equivalent to that of traditional shared memory multiprocessors exist today. Unfortunately, for various reasons, the performance evaluation of these systems on comparable types of scientific computations is very difficult. Relevant data for the performance of algorithms of interest to the computational aerophysics community on many currently available parallel systems are limited. Benchmarking and performance evaluation of such systems have not kept pace with advances in hardware, software, and algorithms. In particular, there is as yet no generally accepted benchmark program or even a benchmark strategy for these systems
Recommended from our members
Ultracomputer research project
The NYU Ultracomputer project continues to pioneer the study of architecture and software for large-scale, shared-memory parallel computers. During this past year, we have achieved several very significant milestones, most notably we fabricated and used that first-ever combining switches and we increased our industrial involvement. Other important accomplishments include porting our Symunix operation system to Ultra 3 prototypes; further developing a very high quality, portable C compiler needed for our prototypes, that has attracted considerable commercial attention; producing a fast solver for Laplace's equation on multiply connected domains; and furthering the analysis of buffered interconnection networks and parallel random number generators. In addition to further developments in the areas mentioned above, we plan two new activities for next year. First, we will obtain extensive measurements of the effect of combining on scientific and other application software using both the Ultra 3 hardware prototypes and a new simulation environment that we are presently constructing. Our successful VLSI development of combining switches has already demonstrated that the additional cost for combining is about 100% using modest packaging 50% with 300 pins, and zero given next generation densities and 400 pins. Hopefully, these chips will refute the often-quoted claim that combining increases the cost of the network by a factor of between 6 and 30. Second, we will port our operating system to the new NCR series of Intel 486-based multiprocessors. NCR has agreed to donate a machine during the first quarter of 92 for this effort, which will strengthen the ties between our project and NCR
A high-speed linear algebra library with automatic parallelism
Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market
The NAS parallel benchmarks
A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided
Recommended from our members
Strategies and tools for the exploitation of massively parallel computer systems
The aim of this thesis is to develop software and strategies for the exploitation of parallel computer hardware, in particular distributed memory systems, and embedding these strategies within a parallelisation tool to allow the automatic generation of these strategies.
The parallelisation of four structured mesh codes using the Computer Aided Parallelisation Tools provided a good initial parallelisation of the codes. However, investigation revealed that simple optimisation of the communications within these codes provided an even better improvement in performance. The dominant factor within the communications was the data transfer time with communication start-up latencies also significant. This was significant throughout the codes but especially in sections of pipelined code where there were large amounts of communication present.
This thesis describes the development and testing of the methods used to increase the performance of these communications by overlapping them with unrelated calculation. This method of overlapping the communications was applied to the exchange of data communications as well as the pipelined communications.
The successful application by hand provided the motivation for these methods to be incorporated and automatically generated within the Computer Aided Parallelisation Tools. These methods were integrated within these tools as an additional stage of the parallelisation. This required a generic algorithm that made use of many of the symbolic algebra tests and symbolic variable manipulation routines within the tools.
The automatic generation of overlapped communications was applied to the four codes previously parallelised as well as a further three codes, one of which was a real world Computational Fluid Dynamics code.
The methods to apply automatic generation of overlapped communications to unstructured mesh codes were also discussed. These methods are similar to those applied to the structured mesh codes and their automation is viewed to be of a similar fashion
Developing and Measuring Parallel Rule-Based Systems in a Functional Programming Environment
This thesis investigates the suitability of using functional programming for building parallel rule-based systems. A functional version of the well known rule-based system OPS5 was implemented, and there is a discussion on the suitability of functional languages for both building compilers and manipulating state. Functional languages can be used to build compilers that reflect the structure of the original grammar of a language and are, therefore, very suitable. Particular attention is paid to the state requirements and the state manipulation structures of applications such as a rule-based system because, traditionally, functional languages have been considered unable to manipulate state. From the implementation work, issues have arisen that are important for functional programming as a whole. They are in the areas of algorithms and data structures and development environments. There is a more general discussion of state and state manipulation in functional programs and how theoretical work, such as monads, can be used. Techniques for how descriptions of graph algorithms may be interpreted more abstractly to build functional graph algorithms are presented. Beyond the scope of programming, there are issues relating both to the functional language interaction with the operating system and to tools, such as debugging and measurement tools, which help programmers write efficient programs. In both of these areas functional systems are lacking. To address the complete lack of measurement tools for functional languages, a profiling technique was designed which can accurately measure the number of calls to a function , the time spent in a function, and the amount of heap space used by a function. From this design, a profiler was developed for higher-order, lazy, functional languages which allows the programmer to measure and verify the behaviour of a program. This profiling technique is designed primarily for application programmers rather than functional language implementors, and the results presented by the profiler directly reflect the lexical scope of the original program rather than some run-time representation. Finally, there is a discussion of generally available techniques for parallelizing functional programs in order that they may execute on a parallel machine. The techniques which are easier for the parallel systems builder to implement are shown to be least suitable for large functional applications. Those techniques that best suit functional programmers are not yet generally available and usable
Compilation Techniques for High-Performance Embedded Systems with Multiple Processors
Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems,
programming of embedded high-performance computing systems based on Digital
Signal Processors (DSPs) is still a highly skilled manual task. This is true for
single-processor systems, and even more for embedded systems based on multiple
DSPs. Compilers often fail to optimise existing DSP codes written in C due to the
employed programming style. Parallelisation is hampered by the complex multiple address
space memory architecture, which can be found in most commercial multi-DSP
configurations.
This thesis develops an integrated optimisation and parallelisation strategy that can
deal with low-level C codes and produces optimised parallel code for a homogeneous
multi-DSP architecture with distributed physical memory and multiple logical address
spaces. In a first step, low-level programming idioms are identified and recovered. This
enables the application of high-level code and data transformations well-known in the
field of scientific computing. Iterative feedback-driven search for “good” transformation
sequences is being investigated. A novel approach to parallelisation based on a
unified data and loop transformation framework is presented and evaluated. Performance
optimisation is achieved through exploitation of data locality on the one hand,
and utilisation of DSP-specific architectural features such as Direct Memory Access
(DMA) transfers on the other hand.
The proposed methodology is evaluated against two benchmark suites (DSPstone
& UTDSP) and four different high-performance DSPs, one of which is part of a commercial
four processor multi-DSP board also used for evaluation. Experiments confirm
the effectiveness of the program recovery techniques as enablers of high-level transformations
and automatic parallelisation. Source-to-source transformations of DSP
codes yield an average speedup of 2.21 across four different DSP architectures. The
parallelisation scheme is – in conjunction with a set of locality optimisations – able to
produce linear and even super-linear speedups on a number of relevant DSP kernels
and applications
Translating expert system rules into Ada code with validation and verification
The purpose of this ongoing research and development program is to develop software tools which enable the rapid development, upgrading, and maintenance of embedded real-time artificial intelligence systems. The goals of this phase of the research were to investigate the feasibility of developing software tools which automatically translate expert system rules into Ada code and develop methods for performing validation and verification testing of the resultant expert system. A prototype system was demonstrated which automatically translated rules from an Air Force expert system was demonstrated which detected errors in the execution of the resultant system. The method and prototype tools for converting AI representations into Ada code by converting the rules into Ada code modules and then linking them with an Activation Framework based run-time environment to form an executable load module are discussed. This method is based upon the use of Evidence Flow Graphs which are a data flow representation for intelligent systems. The development of prototype test generation and evaluation software which was used to test the resultant code is discussed. This testing was performed automatically using Monte-Carlo techniques based upon a constraint based description of the required performance for the system