Approved for public release; distribution is unlimited.
BACKGROUND

INTRODUCTION
There is a growing consensus [11 among supercomputer scientists that super-speed computers of the future will be parallel processors, since the traditional vector processors are only able to pipeline out one or a few results per cycle. Parallel processors are potentially able to have hundreds or thousands of execution streams going at once. In earlier years, the clock cycle speed of supercomputers was so much faster than parallel processors and the parallel machines were in such an experimental state that it still made sense to look to vector processing for practical supercomputing. Both of those conditions are changed today, providing computational scientists with the occasion to -:r to realize the full power of parallel processing. As we start to do this, however, we realize that while vector upne~sors are basically all very similar to each other, parallel architectures present a wide variety of types. It seems quite kely that one of these types will be ideal for a wide range of problems. For that reason, a number of computational scientists are looking at distributed heterogeneous processing as a potential solution. Superconcurrency is one approach to this form of computing.
VECTOR ARCHITECTURES
Vector architectures, such as the CRAY XMP (which will be the primary example in this paper), are primarily means for the hardware to support pipeining (Freund [2] ). Suppose we wish to add a set of {xi} to a set of {yi}, i.e., {zi}= {xi}+ {Yi1. We refer to Fig 1. 1 to see how this is normally done on a vector machine. The {xi)are loaded into one vector register (called V 0 here) and the {Yi} into another (V 1 ) vector register.
These operands are then fed through the floating point add unit and the { z i } are then pipelined out at the rate of one per clock cycle.
Most vector architectures are able to get some additional concurrency or parallelism by having some of the features of.(a) two or three functional units in the pipeline stream (called chaining), (b) independcra cxecuticn -f .hc. scalar portion of the processor, and (c) several copies of the scalar/vector CPU. Still the potential concurrency available in vector machines is quite limited and unlikely ever to have hundreds, much less thousands of execution streams going at once.
There are several idosyncrasies characteristic of vector machines. For example, methods of organizing memory are such that often stepping through memory (called stride) in units greater than one (as aefined by the reverse lexicographic order implicit in FORTRAN) can result in significant performance degradation through memory conflicts. The author demonstrated several years ago a typical result, Fig 1. 2., in which it is clear that the greater the power of 2 in the stride, the worse the performance (due to bank and section conflicts).
Another idiosyncrasy concerns linear algebra. Let us examine chaining in a basic linear algebra operation, matrix times a vector. Let XT = (X(1),X(2) .. X(N)) be a point in N-space and A = (A(IJ)) be the M by N matrix mapping X into M-space, i.e., A(1,1), A(1,2) , A(1,N) X(1)
where Y(I) = A(I,J) * X(J).
Algorithmically we can think of this in two ways: (a) as N updates to the elements of Y by successively adding in terms A(I,J)*X(J), commonly called SAXPY, or (b) as M dot products of rows of A with the column vector of X aka SDOT. Both are written in FORTRAN below (assuming initialization of Y to 0):
While the SDOT method corresponds to the way we have normally been taught to think theoretically of linear algebra operations, SAXPY is the method that works better on most vector architectures because of the nature of the hardware (essentially, in this case, the inability to add a vector to a scalar).
One of the consequences of these idiosyncrasies is the way people think about performing and benchmarking code for super-speed architectures. Most of the standard analysis tools, e.g., LINPACK, [3] use code strongly configured to implement SAXPY and avoid non-unit FORTRAN stride. However these rules of thumb, learned from vector architectures do not necessarily apply to parallel processors. NOSC's Superconcurrency Research Team (SRT) has striking examples where natural, parallel implemeittation of fundamental algorithms yields dramatic performance increases over traditional vector implementations (and associated limitations).
TYPES OF PARALLELISM
One of the fundamental facts of parallel processing is the wide variety of types. There are a number of variant factors, e.g., memory organization (distributed, global, hierarchical, etc.) or processor interconnect scheme (bus, mesh, hypercube, etc.). However the most basic distinction is whether the processors execute the same instruction on multiple data (SIMD) or multiple instructions on multiple data (MIMD) . Fig 1.3 . below summarizes these types of paralellism compared to vector processing, with asymptotic performance factors. Since the time to execute on each MIMD processor often cannot be determined until run-time, there is some probability that many processors may have to wait for one to finish. Naturally this probability tends to increase as the number of processors increase, so that MIMD machines usually do not have thousands of processors. On the other hand each processor in a SIMD machine is usually a simple, e.g., bit-slice, processor (sometimes with an associated coprocessor) so that the execution time for any one processor is long. Thus SIMD machines do well only when the number of different data streams is quite large, i.e., in the thousands. The variety of parallel processors is also increased by such features as very Iong instruction word (VLlW) design, data-flow technology, and the fact that many designs are hybrids incorporating several different features. The fundamental result is that most parallel architectures are a good fit for some problems and a poor fit for others. The consequence is that an optimal method (to be made more precise in section 3. below) to compute a wide diversity of computational types is with a corresponding variety of architectures, i.e., the distributed heterogeneous processing approach mentioned in the introduction. Superconcurrency is a general technique for matching and managing optimally configured suites of super-speed processors. In particular this paper shows a general method for choosing the most powerful suite of hetcrogencous parallel and vector supercomputers for a given problem set, subject to a fixed constraint, such as cost. The dual problem could find a minimal cost configuration for a fixed speed requirement. Thus the Optimal Selection Theory is a mathematical program for which one wishes to minimize the total time spent on the sum of all code subsegments. The theory is mathematically dependent on a new methodology of code profiling and a new methodology of analytical benchmarking. The intent is to use this technique to provide supercomputing power for Naval Command and Control (C2) problems, however this paradigm should work for many classes of supercomputing problems. The basic result is that for a computational problem that has a diverse set of computational types, not all tightly-coupled, the optimal solution is a heterogeneous suite of parallel and vector processors rather than a single supercomputing architecture. This solution is called superconcurrency both because it is an approach to supercomputing and because it concurrently uses concurrent (vector and parallel) processors. Ercegovac [41 has recently looked at the feasibility of a suite of heterogeneous processors to solve supercomputing problems. Resnikoff [5] and Kamen [6] have examined the cost-effectiveness of supercomputers (one generally finds the smaller mini-supers to be more cost-effective than the largest-sized machines). Bokhari [7] has investigated partitioning problems among various types of processors. There are several reasons for partitioning. First many large codes have diverse computational types. Second, the various super-speed parallel and vector processors have quite different performance profiles on these types, often amounting to several orders of magnitude. It i; a commonplace observation and a corollary of Amdahl's Law [81 that any single type of supercomputer, often spends most of its time computing code types for which it is poorly designed. If we could configure our processor suite so that each processor could spend almost all its time on code for which it is well designed, the overall increase in speed could be orders of magnitude over what is now achieved by conventional supercomputing.
REASONS FOR SUPERCONCURRENCY
One way of understanding the reasons for superconcurrency is to look at Amdahl's Law. Basically this says that the overall rate at which a machine will compute an overall code or set of codes is determined by the sum of the inverses of the times on each subportion. The paradoxical consequence of this is that, in the face of diverse computation requirements, a single machine asked to excute all the code will spend most of Its time on the portions of code for which It Is not well designed, as illustrated in fig 2. 1 below. The superconcurrency approach is also shown here in which we try to identify and use a suite of machines wherein each is used primarily to compute code types for which it is well-suited, and conversely each portion of code is matched to an appropriate architecture. 3. BENCHMARKING AND CODE PROFILING As discussed earlier, the basic approach of this paper is contingent upon breaking down the overall code into groups of segments within which the processing requirements are the same or homogeneous. The segments of homogeneous type are assigned to optimal processors for that type. Before that can be done, it is necessary to take two benchmarking type steps. The first, called code-type profiling is a code specific function to identify the "natural" types of code that are acttl1y present and group the code segments by type. Types that might be identified include vectorizable decomposable, vectorizable non-decomposable, fine/coarse-grain parallel, SIMD/IIMD parallel, scalar, special purpose, e.g., FFT or specialized sorting algorithm, etc. The second step, called analytical benchmarking is an analysis of how the available processors perform on the identified types, i.e., this identifies processors that ae appropriate solutions for each code type. Thus it is more analytical than some previous techniques that simply looked at the overall result of running a processor on an entire benchmark code or set of loops (without any real analysis of how the myriad of relevant factors contributed). However it should be pointed out that recent research by Dongarra on LINPACK provides some insight to the processes involved. Both code profiling and analytical benchmarking are now being undertaken by the Superconcurrency Research Team (SRT) at the Naval Ocean Systems Center (NOSC). Our initial research at Proftling/Benchmarking was directed at several large Naval C2 problems and a suite of potentially matching mini-supers/parallel processors (including the Connection Machine, DAP, Ardent, Encore, Butterfly, MultiFlow, Aspen, and Convex). Most of the C2 applications we have looked at so far have been relatively loosely-coupled and we have found it feasible to break them up (manually) into homogeneous portions and assign them to appropriate processors. From the processor (benchmarking) point of view, our most interesting result to date is how consistently the long vector problems are much better done on SIMD (Connection Machine or DAP) processors rather than vector processors. Figure 2. 3. shows a method by which the same calculation could be computed on a SIMI) architecture. Namely, V i we load x i , and Yi into processor i. Then we issue the same instruction, e.g., a floating point add, to all processors simultaneously. The add takes much longer on the simple SIMD processor than a comparable singae add instruction on a vector machine. However, since all processors are simultaneously computing the same instruction, the results are computed in time 0 (1), i.e, it takes the same time for any N -< M, where M is the number of processors in the SIMD machine. Thus the time is bounded below by s , where 'r. is the time needed for one of the SIMD processors to compute a floating point add. The implications of this are clear. If N is large enough such that N * Tv > Ts then that total computation is performed faster on the SIMD than on the vector machine. The value of N for which the SIMD machine overtakes the vector machine, i.e., the least N 3 N > ' s / Tv is called the crossover point, or x-point hereafter. Freund, Gherrity, and Kamen [9) computed x-points for several operations oriented around linear algebra computations (Lubeck [10] 3.1. MATHEMATICAL FORMULATION We can state the basic problem as a linear (actually integer) program. We want to get the most power we can, given some overall cost constraint. More mathematically we wish to maximize the power (or speed) function, P. We do this by minimizing a time function, T, giving the time taken on a code, so Let I denote the set of all possible indices of one machine type per segment with V i denoting the number of such machines used per segment. Let V i be the number of machines of type i (which may be 0 if machine X i not in the indexed configuration). Then the mathematical programming problem can be stated as:
Code Type Distribution in Large Application Suite on Baseline System
OLD WAY NEW WAY MACHINES
Vi Ci -C 3.2 EXAMPLE Let us consider the following example. Suppose the code to be 50% vectorizable (35% nondecomposable, i.e, only one vector machine at a time can run it, and 15% decomposable), 20% suitable for SrMD, 20% MIMD, and 10% inherently scalar. We shall assume that each type of machine only achieves scalar speed on code for which it is = designed, e.g., a vector machine will be assumed to get only scalar speed on parallel code. In Table 3 .1.
below we denote by (X the speed up each machine achieves on portions of code for which it is best suited. The V's are vector machines, the S's SIMD, the M's MIMD, and the Sc a scalar machine. Suppose our overall cost constraint is $4M. 
. Optimal Selection Theory Example
We can reformulate eq. 3.1 above as:
where N = # different code types, pj = % of code type j, vi = total # processors for code type i. M = # processor types for code type j, and tkj = time for processor i on code type j
In this computable form, we see the traditional vector supercomputer solution of 1 V 1 has P = 4.00. However the multimachine solution of 1 V 2 , 3 V 3 , I S1, and I M 1 , in which no one machine is a traditonal supercomputer, has a greater power function, P = 5.14. This is true in spite of the fact that 50% of the code was assumed vectorizable.
DINS OR DYNAMIC OPTIMIZATION Distributed Intelligent
Network System (DINS) -One of the most active current research areas of the NOSC Superconcurrency Research Team (SRT) has been the development of the DINS concept. DINS will be a reasoning system that uses information from Code Profiling, Analytical Benchmarking, and network bandwidth to optimally manage a network of heterogeneous, high-performance, concurrent processors and assign portions of code to appropriate processors. In a general sense, this is similar to current research in load balancing and priority assignment. However the information to be used will be the three sources mentioned above with the primary aim of optimal matching code portions to processors rather than (the secondary) factors of load balancing and priority assignment. Since DINS will reason about processors actually available to it, this means we can achieve configuration control at different sites even though there may be a different superconcurrent suite at each. Similarly DINS will continue to function and assign a second best processor if a first choice is unavailable or down. Thus DNS is robust and survivable. Likewise it is compatible with evolutionary development; when a new processor is introduced because of changing technology, we simply replace the old benchmarking data with the new. The features of robusmess, configuration control, survivability, tailorability, and evolutionary development are essential for Naval C2 problems. We call DINS dynamic optimization since it dynamically tasks in an optimal way the backend suite of heterogeneous, superconcurrent processors that were chosen from the Optimal Selection Theory.
APPROACH
We plan to use artificial intelligence and compiler writing techniques to build the DINS using an existing off-the-shelf high-level distributed operating system, e.g., CRONUS (BBN product) and MACH (DARPAsponsored Carnegie Mellon product). We will then use the ongoing results of analytically benchmarking code profile types on a variety of machines for automating the partitioning of complex codes so that homogeneous portions can be sent to the best suited processors. Our supeconcurrency efforts will also draw on the developing taxonomy of code profile 'ypes with similar processing requirements, as well as our current work on the code profile types to find out what machines are ideal. Some code portions may be complex mixes of simple codes which are not easily decomposable because of, for example, unusual data dependencies in the algoritms.
4.2. EXAMPLE An example of how DINS would work can be seen from the SIMD/Vector crossover point study. DINS would have matrices of the x-points for the various vector and SIMD machines available on its network. A vector problem that was short would be done on a traditional vector machine; a long one on a SIMD machine. The kind of reasoning DINS would do would be similar in general nature to the reasoning involved in the now classical problem of load-balancing, but the data it would reason about would be the performance matrices determining optimal machine/code portion matching. Load balancing could, in fact, be a secondary consideration, but only secondary, since the performance increases one gets from this are typically much less than from superconcurrent matching.
EXPECTED RESULTS
The firdings of this project will enable us to assess the potential for improvements in performance from a heterogeneous mix of concurrent processors. Based on the findings of our Optimal Selection Theory, we expect that lower cost multi-machine solutions will have speedups better than what can be achieved with even "iic ,a,, powerful single super-computer. With an intelligent system to distribute tasks among multiple processors having disparate capabilities based on the code type, two to three orders of magnitude of speedup could be achieved. The intelligent system for distributing appropriate code should prevent problems of low vectorization fractions for the vector .... -nues. We expect the various parallel and supercomputer machines to come closer to their peak performance ratings when they run code for which they are optimal. Another of the advantages of constructing a system which can access multiple processors as needed is that new computing technologies can be seamlessly incorporated into the system as they become available. The end users of the system need not learn any new interfaces to take advantage of improvements in technology. We can also expect fault tolerance from the ability to choose a second-best processor when one of the machines is unavailable, implying robustness. This reasoning about what is locally and currently available also implies automatic configuration control since DINS can run transparently at different sites with different back-end supercomputers. This also implies graceful evolutionary acquisition, as well as survivability and tailorability, all important consideration for Navy C2 environments.
FEASIBILITY An important issue in
Superconcurrency is the feasibility of switching machines for various codes or subcodes in our applications suite. In this section we look at several aspects of this, and mention related research.
LEVELS
Superconcurrency could be conducted at three distinct levels. The coarsest or highest level would be one in which we optimally match distinct whole codes to separate machines. The medium level granularity would correspond to sending different subroutines or largely autonomous subpotions to optimal processors. The finest or lowest level would be the one at which we break up tightly-coupled portions of code in order to optimally match them to hardware. Clearly the coarsest level is easiest to implement, but yields the least performance, whereas the lowest level granularity is hardest, but gives the best results. Clearly a fundamental issue is the interprocessor bandwidths. Fortunately ranges exceeding lGbit and beyond should be readily achievable in the near future.
